"An Architecture for Universal Distributed Tracing"
Monday, October 2, 2017, at 1:00 P.M.
Room 368 (CIT 3rd Floor)
Today's tools and abstractions for monitoring, understanding, and enforcing system behaviors often cannot coherently reason about end-to-end executions, because important context is lost when crossing software component and machine boundaries. This makes it difficult to answer questions about causes of failures, uncover dependencies between components, or understand performance or resource usage; it is even harder to provide end-to-end performance guarantees or isolation between tenants.
To address these challenges, recent research and open-source projects have proposed tracing tools for distributed systems, that communicate information between inter-operating system components at runtime by propagating in-band contexts. However, deploying tracing tools is fraught with difficulty, and their applicability is limited, because of the need for both intrusive instrumentation and pervasive deployment. Organizations report drawn-out struggles to deploy tracing tools because of the high developer cost associated with instrumenting all system components to participate in propagating the tool's context. This, along with other obstacles, has hindered the wider development of tracing tools in promising areas such as resource management and online monitoring. In this thesis I extend concepts introduced by previous tracing tools to two new areas, presenting two tracing tools that achieve very different end-to-end goals. Retro, a resource management framework, provides per-tenant performance guarantees; Retro propagates tenant identifiers alongside requests, and uses them to attribute resource consumption to tenants and make per-tenant scheduling decisions. Pivot Tracing, a monitoring framework, gives operators and users the ability to obtain an arbitrary metric from one point of the system, while selecting, filtering, and grouping by events meaningful at other parts of the system; Pivot Tracing dynamically evaluates monitoring queries at multiple points during execution, and relates information between points by propagating partial query state alongside requests.
Next, I identify and categorize common challenges associated with developing and deploying tracing tools in distributed systems. Despite superficial differences between tracing tools, at their core they share common components and duplicate a majority of their implementation effort. Consequently, I advocate that instrumentation should be only a one-time task, reusable, and independent of any tracing tool; and that developing, deploying, and updating tracing tools should be possible without having to revisit or consider the underlying context propagation mechanisms.
Finally, the principal contribution of this thesis is a layered architecture for general-purpose context propagation in distributed systems called the Tracing Plane. The Tracing Plane enables new tracing tools to be developed and deployed independently without having to revisit system-level instrumentation to deploy new propagation logic. The Tracing Plane abstracts and encapsulates several key components that address the aforementioned challenges, and decouples the components into separate layers that can be addressed independently by different teams of developers. Two key abstractions, provided by the topmost and bottommost layers respectively, separate the concerns of tracing tool developers from those of system developers. First, tracing tool developers should implement contexts using an execution-flow scoped variables abstraction, which I liken to thread-local storage but dynamically scoped to end-to-end executions instead of threads. These variables are grouped into what I term baggage. Second, system developers doing instrumentation should propagate opaque contexts, which I term baggage contexts. Baggage contexts hide their underlying format from system developers and standardize system instrumentation on a simple set of five propagation primitives, thereby avoiding development-time decisions about which tracing tools to support.
To implement the Tracing Plane architecture, we require careful consideration of how contexts should behave as they traverse component boundaries, and we must provide datatypes with well-defined branch and merge semantics for concurrent executions that fan in and out. I propose an implementation based on the theory of conflict-free replicated data types, that encapsulates the subtleties of propagating heterogeneous datatypes, and provides a general-purpose context that bridges the two key abstractions.
The potential impact of a common architecture for context propagation is significant. Developers would no longer need to make development-time decisions about which tools to deploy and support; they would no longer need a priori knowledge of dependencies between components; and it would enable more pervasive, more useful, and more diverse tracing tools.
Host: Professor Rodrigo Fonseca