I am a tenure-track faculty member at the Max Planck Institute for Software Systems -- a position equivalent to a US assistant professor -- and head of the Cloud Software Systems Group. I am also affiliated with the University of Saarland (UdS).

I joined MPI-SWS at the end of 2018. Before that, I was a graduate student at Brown University (2011-2018), where I was advised by Rodrigo Fonseca. At Brown, I was supported in part by a Facebook PhD Fellowship, and had various collaborations with researchers at Facebook and Microsoft.

Recent News

  • 2020/11 Our work on DNN serving, "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up" (PDF), received the Distinguished Artifact Award at OSDI 2020!
  • 2020/09 I will be teaching Distributed Systems next summer semester (2021) at Saarland University. It will be TA'ed by two of my PhD students, the excellent Thomas Davidson and Vaastav Anand.
  • 2020/04 Distributed Tracing in Practice is out! It's a book about distributed tracing, co-written with Rebecca Isaacs, Austin Parker, and Daniel Spoonhower. You can get it here.
  • 2020/02 Finalist for the Facebook ML Systems research award, for our work on DNN serving.
  • 2020/01 Teaching Advanced Topics in Cloud and Datacenter Systems this summer semester (2020) at Saarland University.


I lead the Cloud Software Systems Group at the Max Planck Institute for Software Systems. My group's research spans distributed systems, computer networks, and operating systems. We think about the kinds of systems that modern tech companies use, such as public cloud systems (like AWS, Google Cloud, and Azure); open-source systems (like Hadoop, Spark, and Kafka); and emerging paradigms (like microservices and serverless). A central goal for us is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime.

Want to know more? A good place to start is my 2021 Research Statement, which gives an overview of the problems I like and the work I've done

Systems@MPI. In addition to my group, other systems faculty at MPI include Peter Druschel, Anja Feldmann, Deepak Garg, Yiting Xia, and Antoine Kaufmann.


My current research focuses observability and tracing in cloud and distributed systems, performance and resource management, and building multi-tenant distributed systems. A few project highlights include the following:

Millenial is a framework for developing flexible and modular microservice applications. Millenial provides a programming abstraction for writing microservice applications that makes it easy to later change aspects related to the system's scaffolding and topology -- e.g. concerns such as RPC frameworks or backends to use, placement and load balancing, replication, and so on. Using Millenial, applications don't early-bind to those choices. Later on, it's easy to change any of these aspects and re-generate a fully functional variant of the application. Millenial makes comparative evaluation trivially easy -- it helps developers empirically pick the best-performing libraries, frameworks, and backends; and it helps researchers perform rigorous comparative evaluation of new prototypes. Millenial is work-in-progress, but contact us if you are interested in early-access!

Hindsight is a new distributed tracing framework that we have built from the ground-up to support edge-case tracing. Hindsight enables detailed end-to-end tracing for rare and outlier requests without data loss traditionally incurred by sampling-based systems. Hindsight overcomes this by combining a short per-node history of telemetry, programmatic symptom detection, and rapid distributed retrieval. Check out the arXiv preprint! Its code can be found on GitLab.

Clockwork is a distributed DNN serving system designed for predictable end-to-end performance. Clockwork promotes end-to-end performance predictability as a first-class design concern. To achieve this, Clockwork's design eliminates major sources of performance variability and centralizes choices that lead to variability such as scheduling and admission control. The end result is a system design with extremely tight tail latency. Clockwork received the Distinguished Artifact Award at OSDI 2020 and its code can be found on GitLab.

Pivot Tracing is a cross-component monitoring framework for distributed systems. Often the information needed to troubleshoot cross-component problems is relatively minimal, but inaccessible due to a lack of cross-component visibility. Pivot Tracing combines causal metadata propagation with dynamic instrumentation to overcome this limitation. Using Pivot Tracing, a system operator can use a simple SQL-like interface to define and measure arbitrary system metrics, while grouping, filtering, and aggregating those metrics according to arbitrary identifiers from other system components. Pivot Tracing received the Best Paper Award at SOSP 2015 and its code can be found on GitHub.


All research projects for my group are hosted on MPI's GitLab.

Some older projects can be found on GitHub under JonathanMace, brownsys, and tracingplane.

Current Members

Former Members

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
OSDI, 2020. [PDF]
Distinguished Artifact Award
Distributed Tracing in Practice
Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs
Textbook, Publication July 2020.
Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace
SoCC, 2019. [PDF]
No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
Amit Samanta, Suhas Shrinivasan, Antoine Kaufmann, Jonathan Mace
arXiv, 2019. [PDF]
Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca
SoCC, 2018. [PDF]
A Universal Architecture for Cross-Cutting Tools in Distributed Systems
Jonathan Mace
Ph.D. Thesis, Brown University, 2018. [PDF]
Dennis M. Ritchie Doctoral Dissertation Award, Honorable Mention
Universal Context Propagation for Distributed System Instrumentation
Jonathan Mace, Rodrigo Fonseca
EuroSys, 2018. [PDF]
Canopy: An End-to-End Performance Tracing And Analysis System
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song
SOSP, 2017. [PDF]
Principled Workflow-Centric Tracing of Distributed Systems
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
SoCC, 2016. [PDF]
2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services
Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan
SIGCOMM, 2016. [PDF]
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Jonathan Mace, Ryan Roelke, Rodrigo Fonseca
SOSP, 2015. [PDF]
Best Paper Award
We are Losing Track: a Case for Causal Metadata in Distributed Systems
Rodrigo Fonseca, Jonathan Mace
HPTS, 2015. [PDF]
Retro: Targeted Resource Management in Multi-Tenant Distributed Systems
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
NSDI, 2015. [PDF]
Towards General-Purpose Resource Management in Shared Cloud Services
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
HotDep, 2014. [PDF]
Revisiting End-to-End Trace Comparison with Graph Kernels
Jonathan Mace, Rodrigo Fonseca
MSc Project, Brown University, 2013. [PDF]


+49 681 9303-8801

Room 409 in the Saarbrücken building of MPI-SWS. Map

Saarland Informatics Campus E1 5
66123 Saarbrücken