Jonathan Mace

PhD Candidate
Brown University
Department of Computer Science

This web page is deprecated as of September 2018!

I am now a tenure-track faculty member at the Max Planck Institute for Software Systems in Saarbrücken, Germany

Click here to visit my home page at MPI-SWS

About Me

I graduated in May 2018 and I am now a tenure-track faculty member at the Max Planck Institute for Software Systems in Saarbrücken, Germany!

At Brown, I was advised by Professor Rodrigo Fonseca.

My research focuses on monitoring, understanding, and enforcing distributed system behaviors. In my PhD work I adapted techniques from end-to-end request tracing to new applications in multi-tenant resource management and dynamic causal profiling. My work on Pivot Tracing introduced baggage, a concept for generic cross-system metadata now widely used by tracing systems like Zipkin and OpenTracing.

I am currently working on large-scale end-to-end performance analysis, ranging from data collection, aggregation, and storage, to deriving high-level insights using machine learning and statistical analysis. My ongoing work is a collaboration with researchers in Facebook's tracing and performance groups.

Interests

Distributed Systems
Networking & Operating Systems
Multi-Tenant Cloud Systems
Multi-Resource Scheduling
End-to-End Request Tracing
Data-Driven Performance Analysis

Publications

SoCC 2018

Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay

Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca

In Proceedings of the 9th ACM Symposium on Cloud Computing (SoCC '18)

Abstract

End-to-end tracing has emerged recently as a valuable tool to improve the dependability of distributed systems, by performing dynamic verification and diagnosing correctness and performance problems. Contrary to logging, end-to-end traces enable coherent sampling of the entire execution of specific requests, and this is exploited by many deployments to reduce the overhead and storage requirements of tracing. This sampling, however, is usually done uniformly at random, which dedicates a large fraction of the sampling budget to common, 'normal' executions, while missing infrequent, but sometimes important, erroneous or anomalous executions. In this paper we define the representative trace sampling problem, and present a new approach, based on clustering of execution graphs, that is able to bias the sampling of requests to maximize the diversity of execution traces stored towards infrequent patterns. In a preliminary, but encouraging work, we show how our approach chooses to persist representative and diverse executions, even when anomalous ones are very infrequent.

@inproceedings{lascasas2018weighted,
  title={{Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay}},
  author={Las-Casas, Pedro and Mace, Jonathan and Guedes, Dorgival and Fonseca, Rodrigo},
  booktitle={9th ACM Symposium on Cloud Computing (SoCC '18)},
}

Ph.D. Thesis 2018

A Universal Architecture for Cross-Cutting Tools in Distributed Systems

Jonathan Mace

Ph.D. Thesis, Brown University, May 2018

Abstract

Recent research has proposed a variety of cross-cutting tools to help monitor and troubleshoot end-to-end behaviors in distributed systems. However, most prior tools focus on data collection and aggregation, and treat analysis as a distinct step to be performed later, offline. This restricts the applicability of such tools to only doing post-facto analysis. However, this is not a fundamental limitation. Recent research has proposed tools that integrate analysis and decision-making at runtime, to directly enforce end-to-end behaviors and adapt to events.

In this thesis I present two new applications of cross-cutting tools to previously unexplored domains: resource management, and dynamic monitoring. Retro, a cross-cutting tool for resource management, provides end-to-end performance guarantees by propagating tenant identifiers with executions, and using them to attribute resource consumption and enforce throttling decisions. Pivot Tracing, a cross-cutting tool for dynamic monitoring, dynamically monitors metrics and contextualizes them based on properties deriving from arbitrary points in an end-to-end execution.

Retro and Pivot Tracing illustrate the potential breadth of cross-cutting tools in providing visibility and control over distributed system behaviors. From this, I identify and characterize the common challenges associated with developing and deploying cross-cutting tools. This motivates the design of baggage contexts, a general-purpose context that can be shared and reused by different cross-cutting tools. Baggage contexts abstract and encapsulate components that are otherwise duplicated by most cross-cutting tools, and decouples the design of tools into separate layers that can be addressed independently by different teams of developers.

The potential impact of a common architecture for cross-cutting tools is significant. It would enable more pervasive, more useful, and more diverse cross-cutting tools, and make it easier for developers to defer development-time decisions about which tools to deploy and support.

A Universal Architecture for Cross-Cutting Tools in Distributed Systems

@phdthesis{mace2018thesis,
  title={{A Universal Architecture for Cross-Cutting Tools in Distributed Systems}},
  author={Mace, Jonathan},
  school={Brown University},
  year=2018
}

EuroSys 2018

Universal Context Propagation for Distributed System Instrumentation

Jonathan Mace, Rodrigo Fonseca

In Proceedings of the 13th ACM European Conference on Computer Systems (EuroSys '18)

Abstract

Many tools for analyzing distributed systems propagate contexts along the execution paths of requests, tasks, and jobs, in order to correlate events across process, component and machine boundaries. There is a wide range of existing and proposed uses for these tools, which we call cross-cutting tools, such as tracing, debugging, taint propagation, provenance, auditing, and resource management, but few of them get deployed pervasively in large systems. When they do, they are brittle, hard to evolve, and cannot coexist with each other. While they use very different context metadata, the way they propagate the information alongside execution is the same. Nevertheless, in existing tools, these aspects are deeply intertwined, causing most of these problems. In this paper, we propose a layered architecture for cross-cutting toolsthat separates concerns of system developers and tool developers, enabling independent instrumentation of systems, and the deployment and evolution of multiple such tools. At the heart of this layering is a general underlying format, baggage contexts, that enables the complete decoupling of system instrumentation for context propagation from tool logic. Baggage contexts make propagation opaque and general, while still maintaining correctness of the metadata under arbitrary concurrency and different data types. We demonstrate the practicality of the architecture with implementations in Java and Go, porting of several existing cross-cutting tools, and instrumenting existing distributed systems with all of them.

Universal Context Propagation for Distributed System Instrumentation

@inproceedings{mace2018universal,
  title={{Universal Context Propagation for Distributed System Instrumentation}},
  author={Mace, Jonathan and Fonseca, Rodrigo},
  booktitle={13th ACM European Conference on Computer Systems (EuroSys '18)},
}

SOSP 2017

Canopy: An End-to-End Performance Tracing And Analysis System

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song

In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17)
-----
Also featured in The Morning Paper

Abstract

This paper presents Canopy, Facebook’s end-to-end performance tracing infrastructure. Canopy records causally related performance data across the end-to-end execution path of requests, including from browsers, mobile applications, and backend services. Canopy processes traces in near real-time, derives user-specified features, and outputs to performance datasets that aggregate across billions ofrequests. Using Canopy, Facebook engineers can query and analyze performance data in real-time. Canopy addresses three challenges we have encountered in scaling performance analysis: supporting the range of execution and performance models used by different components of the Facebook stack; supporting interactive ad-hoc analysis of performance data; and enabling deep customization by users, from sampling traces to extracting and visualizing features. Canopy currently records and processes over1 billion traces per day. We discuss how Canopy has evolved to apply to a wide range of scenarios, and present case studies of its use in solving various performance challenges.

Canopy: An End-to-End Performance Tracing And Analysis System

@inproceedings{kaldor2017canopy,
  title={{Canopy: An End-to-End Performance Tracing And Analysis System}},
  author={Kaldor, Jonathan and Mace, Jonathan and Bejda, Micha\l{} and Gao, Edison and Kuropatwa, Wiktor and O'Neill, Joe and Ong, Kian Win and Schaller, Bill and Shan, Pingjia and Viscomi, Brendan and Vekataraman, Vinod and Veeraraghavan, Kaushik and Song, Yee Jiun},
  booktitle={26th ACM Symposium on Operating Systems Principles (SOSP '17)},
}

SIGCOMM 2016

2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services

Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan

In Proceedings of the 2016 ACM SIGCOMM Conference

Abstract

In many important cloud services, different tenants execute their requests in the thread pool of the same process, requiring fair sharing of resources. However, using fair queue schedulers to provide fairness in this context is difficult because of high execution concurrency, and because request costs are unknown and have high variance. Using fair schedulers like WFQ and WF²Q in such settings leads to bursty schedules, where large requests block small ones for long periods of time. In this paper, we propose Two-Dimensional Fair Queuing (2DFQ), which spreads requests of different costs across di erent threads and minimizes the impact of tenants with unpredictable requests. In evaluation on production workloads from Azure Storage, a large-scale cloud system at Microsoft, we show that 2DFQ reduces the burstiness of service by 1-2 orders of magnitude. On workloads where many large requests compete with small ones, 2DFQ improves 99th percentile latencies by up to 2 orders of magnitude.

2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services

@inproceedings{mace20162dfq,
  title={{2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services}},
  author={Mace, Jonathan and Bodik, Peter and Musuvathi, Madanlal and Fonseca, Rodrigo and Varadarajan, Krishnan},
  booktitle={Proceedings of the ACM SIGCOMM 2016 Conference (SIGCOMM '16)},
}

SOCC 2016

Principled Workflow-Centric Tracing of Distributed Systems

Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger

In Proceedings of the 7th ACM Symposium on Cloud Computing (SOCC '16)

Abstract

Workflow-centric tracing captures the workflow of causally-related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource accounting and diagnosis. Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and options for them are based on our experiences developing several previous workflow-tracing infrastructures.

Principled Workflow-Centric Tracing of Distributed Systems

@inproceedings{sambasivan2016principled,
  title={{Principled Workflow-Centric Tracing of Distributed Systems}},
  author={Sambasivan, Raja R and Shafer, Ilari and Mace, Jonathan and Sigelman, Benjamin H and Fonseca, Rodrigo and Ganger, Gregory R},
  booktitle={7th ACM Symposium on Cloud Computing (SoCC '16)},
}

SOSP 2015 Best Paper Award

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Jonathan Mace, Ryan Roelke, Rodrigo Fonseca

In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP '15)
In ACM Transactions on Computer Systems (TOCS, forthcoming 2017)
In Communications of the ACM (CACM, forthcoming 2017)
-----
Also featured in The Morning Paper.

Abstract

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today – logs, counters, and metrics – have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

@inproceedings{mace2015pivot,
  title={{Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems}},
  author={Mace, Jonathan and Roelke, Ryan and Fonseca, Rodrigo},
  booktitle={25th ACM Symposium on Operating Systems Principles (SOSP '15)},
}

NSDI 2015

Retro: Targeted Resource Management in Multi-tenant Distributed Systems

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi

In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15)

Abstract

In distributed systems shared by multiple tenants, effective resource management is an important pre-requisite to providing quality of service guarantees. Many systems deployed today lack performance isolation and experience contention, slowdown, and even outages caused by aggressive workloads or by improperly throttled maintenance tasks such as data replication. In this work we present Retro, a resource management framework for shared distributed systems. Retro monitors per-tenant resource usage both within and across distributed systems, and exposes this information to centralized resource management policies through a high-level API. A policy can shape the resources consumed by a tenant using Retro’s control points, which enforce sharing and rate-limiting decisions. We demonstrate Retro through three policies providing bottleneck resource fairness, dominant resource fairness, and latency guarantees to high-priority tenants, and evaluate the system across five distributed systems: HBase, Yarn, MapReduce, HDFS, and Zookeeper. Our evaluation shows that Retro has low overhead, and achieves the policies’ goals, accurately detecting contended resources, throttling tenants responsible for slowdown and overload, and fairly distributing the remaining cluster capacity.

Retro: Targeted Resource Management in Multi-tenant Distributed Systems

@inproceedings{mace2015retro,
  title={{Retro: Targeted Resource Management in Multi-tenant Distributed Systems}},
  author={Mace, Jonathan and Bodik, Peter and Fonseca, Rodrigo and Musuvathi, Madanlal},
  booktitle={12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15)},
}

HPTS 2015

We are Losing Track: a Case for Causal Metadata in Distributed Systems

Rodrigo Fonseca, Jonathan Mace

In Proceedings of the 15th International Workshop on High Performance Transaction Systems (HPTS '15)

As our systems move to more concurrent and distributed execution patterns, the tools and abstractions we have to understand, monitor, schedule, and enforce their behavior become progressively less effective or adequate. We argue that systems should be built with causal propagation of generic metadata as a first class primitive, to serve as the narrow waist upon which many debugging and troubleshooting tools could be built, in an analogy to the role of the IP layer in networking

We are Losing Track: a Case for Causal Metadata in Distributed Systems

@inproceedings{fonseca2015losing,
  title={{We are Losing Track: a Case for Causal Metadata in Distributed Systems}},
  author={Fonseca, Rodrigo and Mace, Jonathan},
  booktitle={15th International Workshop on High Performance Transaction Systems (HPTS '15)},
}

HotDep 2014

Towards General-Purpose Resource Management in Shared Cloud Services

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi

In Proceedings of the 10th Workshop on Hot Topics in Dependability (HotDep '14)

Abstract

In distributed services shared by multiple tenants, managing resource allocation is an important pre-requisite to providing dependability and quality of service guarantees. Many systems deployed today experience contention, slowdown, and even system outages due to aggressive tenants and a lack of resource management. Improperly throttled background tasks, such as data replication, can overwhelm a system; conversely, high-priority background tasks, such as heartbeats, can be subject to resource starvation. In this paper, we outline ve design principles necessary for effective and efficient resource management policies that could provide guaranteed performance, fairness, or isolation. We present Retro, a resource instrumentation framework that is guided by these principles. Retro instruments all system resources and exposes detailed, real-time statistics of pertenant resource consumption, and could serve as a base for the implementation of such policies.

Towards General-Purpose Resource Management in Shared Cloud Services

@inproceedings{mace2014towards,
  title={{Towards General-Purpose Resource Management in Shared Cloud Services}},
  author={Mace, Jonathan and Bodik, Peter and Fonseca, Rodrigo and Musuvathi, Madanlal},
  booktitle={10th Workshop on Hot Topics in System Dependability (HotDep '14)},
  year={2014}
}

Experience

2011 - 2018 (expected)

Ph.D. Candidate

Department of Computer Science

Brown University, USA
Summer 2016

Research Intern

Facebook, New York
Summer 2013 Summer 2015

Research Intern

Microsoft Research, Redmond
2011 - 2014

MSc Computer Science

Brown University, USA
2009 - 2011

Software Developer

IBM UK
2005 - 2009

Mathematics & Computer Science

MMathComp, 1st Class

Oxford University, UK

Awards

2017

SIGCOMM Student Scholar

50 Years of the ACM Turing Award Celebration
2016

Facebook Graduate Fellowship

Pervasive Monitoring, Diagnostics, and Analytics of Distributed Systems through Dynamic Causal Tracing
2016

USENIX ATC "Best of the Rest"

Invited speaker for Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
2015

Best Paper Award

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems, SOSP '15
2015

Great TA Award

Nominated by students of Brown CS138: Distributed Systems, Spring Semester
2015

Student Scholar

3rd Heidelberg Laureate Forum
2011

Graduate School Fellowship

Brown University
2006

Hertford College Scholarship

Oxford University, Hertford College