Jonathan Mace

Brown University
Department of Computer Science

About Me

I am a 6th year PhD candidate in the Department of Computer Science at Brown University. I am advised by Asst. Professor Rodrigo Fonseca. I am expecting to graduate in May 2018, and I will be applying for academic and industrial positions in New Zealand and the USA.

My research focuses on understanding and improving performance in shared distributed systems -- systems in which many user requests run concurrently, and each request many traverse multiple component, machine, and application boundaries. My research includes topics such as end-to-end request tracing, resource management, distributed scheduling, and per-tenant performance guarantees.

A few problems I am thinking about:

  • Why is it so much harder to monitor, diagnose, and debug distributed systems when they go wrong, compared to a standalone program? Why do we lack equivalent tools for troubleshooting distributed systems? How can we debug a live production system without interrupting or interfering with its ongoing operation?
  • Many cloud settings are made up of numerous disparate services, designed to compose yet be agnostic to one another. What are the abstractions we need in this setting for understanding and improving the end-to-end performance of requests?
  • When we get to the point where we can fully trace our systems to extract rich, detailed data, how do we then generate insights? Hypothetically, these could range from high-level (e.g., subpopulation discovery with respect to user-perceived performance metrics) to low (e.g., causality between function calls, system events and anomalies); yet, this presents something of a 'needle-in-a-haystack' problem.

Interests

  • Distributed and Networked Systems
  • Operating Systems
  • End-to-End Request Tracing
  • Large-scale Performance Analysis
  • Multi-Tenant Performance Guarantees
  • Multi-Resource Scheduling

Recent News

  • 2016-12
    First prototype of Baggage Buffers and the Tracing Plane stack released on GitHub with a paper soon to come
  • 2016-06
    I'll be interning this summer at Facebook in New York, working on end-to-end request tracing.
  • 2016-04
    2DFQ is accepted to appear in SIGCOMM 2016! This is joint work with my colleagues from Microsoft.
  • I'll be presenting Pivot Tracing at USENIX ATC 2016 on Wednesday 6/22 as part of the "Best of the Rest" session.
  • 2016-01
    I'm very fortunate to be awarded a Facebook Graduate Fellowship!
  • 2015-10
    Pivot Tracing received a Best Paper Award at this year's SOSP

Publications

SIGCOMM 2016

2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services

Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan
ACM SIGCOMM 2016

Abstract

In many important cloud services, different tenants execute their requests in the thread pool of the same process, requiring fair sharing of resources. However, using fair queue schedulers to provide fairness in this context is difficult because of high execution concurrency, and because request costs are unknown and have high variance. Using fair schedulers like WFQ and WF²Q in such settings leads to bursty schedules, where large requests block small ones for long periods of time. In this paper, we propose Two-Dimensional Fair Queuing (2DFQ), which spreads requests of different costs across di erent threads and minimizes the impact of tenants with unpredictable requests. In evaluation on production workloads from Azure Storage, a large-scale cloud system at Microsoft, we show that 2DFQ reduces the burstiness of service by 1-2 orders of magnitude. On workloads where many large requests compete with small ones, 2DFQ improves 99th percentile latencies by up to 2 orders of magnitude.

2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services

SOCC 2016

Principled Workflow-Centric Tracing of Distributed Systems

Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
7th ACM Symposium on Cloud Computing (SOCC '16)

Abstract

Workflow-centric tracing captures the workflow of causally-related events (e.g., work done to process a request) within and among the components of a distributed system. As distributed systems grow in scale and complexity, such tracing is becoming a critical tool for understanding distributed system behavior. Yet, there is a fundamental lack of clarity about how such infrastructures should be designed to provide maximum benefit for important management tasks, such as resource accounting and diagnosis. Without research into this important issue, there is a danger that workflow-centric tracing will not reach its full potential. To help, this paper distills the design space of workflow-centric tracing and describes key design choices that can help or hinder a tracing infrastructure’s utility for important tasks. Our design space and options for them are based on our experiences developing several previous workflow-tracing infrastructures.

Principled Workflow-Centric Tracing of Distributed Systems

SOSP 2015

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Jonathan Mace, Ryan Roelke, Rodrigo Fonseca
25th ACM Symposium on Operating Systems Principles (SOSP '15)
Best Paper Award

Abstract

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today – logs, counters, and metrics – have two important limitations: what gets recorded is defined a priori, and the information is recorded in a component- or machine-centric way, making it extremely hard to correlate events that cross these boundaries. This paper presents Pivot Tracing, a monitoring framework for distributed systems that addresses both limitations by combining dynamic instrumentation with a novel relational operator: the happened-before join. Pivot Tracing gives users, at runtime, the ability to define arbitrary metrics at one point of the system, while being able to select, filter, and group by events meaningful at other parts of the system, even when crossing component or machine boundaries. We have implemented a prototype of Pivot Tracing for Java-based systems and evaluate it on a heterogeneous Hadoop cluster comprising HDFS, HBase, MapReduce, and YARN. We show that Pivot Tracing can effectively identify a diverse range of root causes such as software bugs, misconfiguration, and limping hardware. We show that Pivot Tracing is dynamic, extensible, and enables cross-tier analysis between inter-operating applications, with low execution overhead.

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

NSDI 2015

Retro: Targeted Resource Management in Multi-tenant Distributed Systems

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
12th USENIX Symposium on Networked Systems Design and Implementation (NSDI '15)

Abstract

In distributed systems shared by multiple tenants, effective resource management is an important pre-requisite to providing quality of service guarantees. Many systems deployed today lack performance isolation and experience contention, slowdown, and even outages caused by aggressive workloads or by improperly throttled maintenance tasks such as data replication. In this work we present Retro, a resource management framework for shared distributed systems. Retro monitors per-tenant resource usage both within and across distributed systems, and exposes this information to centralized resource management policies through a high-level API. A policy can shape the resources consumed by a tenant using Retro’s control points, which enforce sharing and rate-limiting decisions. We demonstrate Retro through three policies providing bottleneck resource fairness, dominant resource fairness, and latency guarantees to high-priority tenants, and evaluate the system across five distributed systems: HBase, Yarn, MapReduce, HDFS, and Zookeeper. Our evaluation shows that Retro has low overhead, and achieves the policies’ goals, accurately detecting contended resources, throttling tenants responsible for slowdown and overload, and fairly distributing the remaining cluster capacity.

Retro: Targeted Resource Management in Multi-tenant Distributed Systems

HotDep 2014

Towards General-Purpose Resource Management in Shared Cloud Services

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
10th Workshop on Hot Topics in Dependability (HotDep '14)

Abstract

In distributed services shared by multiple tenants, managing resource allocation is an important pre-requisite to providing dependability and quality of service guarantees. Many systems deployed today experience contention, slowdown, and even system outages due to aggressive tenants and a lack of resource management. Improperly throttled background tasks, such as data replication, can overwhelm a system; conversely, high-priority background tasks, such as heartbeats, can be subject to resource starvation. In this paper, we outline ve design principles necessary for ešffective and e›fficient resource management policies that could provide guaranteed performance, fairness, or isolation. We present Retro, a resource instrumentation framework that is guided by these principles. Retro instruments all system resources and exposes detailed, real-time statistics of pertenant resource consumption, and could serve as a base for the implementation of such policies.

Towards General-Purpose Resource Management in Shared Cloud Services

Other Documents

Survey 2017

End-to-End Tracing: Adoption and Use Cases

Jonathan Mace
Survey, Brown University, March 2017

Abstract

This document summarizes information about end-to-end tracing for 26 companies. The information was gathered from documents shared to the Distributed Tracing Workgroup and through in-person conversations at tracing workshops.

End-to-End Tracing: Adoption and Use Cases

M.Sc. Project 2013

Revisiting End-to-End Trace Comparison with Graph Kernels

Jonathan Mace, Rodrigo Fonseca
Master's Project, Brown University, May 2013

Abstract

End-to-end tracing has emerged recently as a valuable tool to improve the dependability of distributed systems by performing dynamic verification and diagnosing correctness and performance problems. End-to-end traces are commonly represented as richly annotated directed acyclic graphs, with events as nodes and their causal dependencies as edges. Being able to automatically compare these graphs at scale is a key primitive for tasks such as clustering, classification, and anomaly detection. In this paper we explore recent developments in the theory of graph kernels, and investigate the feasibility of using a family of kernels based on the Weisfeiler-Lehman graph isomorphism test as an efficient and robust graph comparison primitive. We find that graph kernels provide a good formulation of the execution graph comparison problem, and present preliminary but encouraging results on their ability to distinguish high-level differences between execution graphs.

Revisiting End-to-End Trace Comparison with Graph Kernels

Experience

  • 2011 - 2018 (expected)

    Ph.D. Candidate

    Brown University Department of Computer Science

  • Summer 2016

    Research Intern

    Facebook, New York

  • Summer 2013 Summer 2015

    Research Intern

    Microsoft Research, Redmond

  • 2009 - 2011

    Software Developer

    IBM UK

  • 2005 - 2009

    Mathematics & Computer Science

    MMathComp, 1st Class

    Oxford University, UK

Awards

  • 2016

    Facebook Graduate Fellowship

    Pervasive Monitoring, Diagnostics, and Analytics of Distributed Systems through Dynamic Causal Tracing

  • 2015

    Best Paper Award

    Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems, SOSP '15

  • 2011

    Graduate School Fellowship

    Brown University

  • 2006

    Hertford College Scholarship

    Oxford University, Hertford College

Demos

Interactive Hadoop Visualization

Execution Graph Comparison

Execution Graph Clustering