I am leaving MPI-SWS at the end of September 2022. I will be joining the Cloud Reliability group at Microsoft Research in Redmond, WA. I can still be reached using the contact information on this webpage.


I am a tenure-track faculty member at the Max Planck Institute for Software Systems -- a position equivalent to a US assistant professor -- and head of the Cloud Software Systems Group. I am also affiliated with the University of Saarland (UdS).

I joined MPI-SWS at the end of 2018. Before that, I was a graduate student at Brown University (2011-2018), where I was advised by Rodrigo Fonseca. At Brown, I was supported in part by a Facebook PhD Fellowship, and had various collaborations with researchers at Facebook and Microsoft.

Recent News


  • 2023/01 Our work, "A Qualitative Interview Study of Distributed Tracing Visualisation" (PDF) was accepted to appear in IEEE Transactions on Visualization and Computer Graphics, February 2023.
  • 2022/09 Our work on visualization in systems tools, "See it to Believe it? The Role of Visualisation in Systems Research" (PDF) was accepted at SoCC 2022.
  • 2022/07 Our work on lightweight rollback for serverless, "Groundhog: Reconciling Efficiency and Request Isolation in FaaS" (preprint) was accepted at EuroSys 2023.
  • 2022/06 Our work on tracing edge-cases, "The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems" (preprint) was accepted at NSDI 2023.
  • 2021/11 I will be serving as co-General Chair of SOSP 2023 alongside my colleagues Peter Druschel and Antoine Kaufmann. SOSP 2023 will be held in Koblenz, Germany!
  • 2020/11 Our work on DNN serving, "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up" (PDF), received the Distinguished Artifact Award at OSDI 2020!
  • 2020/09 I will be teaching Distributed Systems next summer semester (2021) at Saarland University. It will be TA'ed by two of my PhD students, the excellent Thomas Davidson and Vaastav Anand.
  • 2020/04 Distributed Tracing in Practice is out! It's a book about distributed tracing, co-written with Rebecca Isaacs, Austin Parker, and Daniel Spoonhower. You can get it here.
  • 2020/02 Finalist for the Facebook ML Systems research award, for our work on DNN serving.
  • 2020/01 Teaching Advanced Topics in Cloud and Datacenter Systems this summer semester (2020) at Saarland University.

Awards


I lead the Cloud Software Systems Group at the Max Planck Institute for Software Systems. My group's research spans distributed systems, computer networks, and operating systems. We think about the kinds of systems that modern tech companies use, such as public cloud systems (like AWS, Google Cloud, and Azure); open-source systems (like Hadoop, Spark, and Kafka); and emerging paradigms (like microservices and serverless). A central goal for us is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime.

Want to know more? A good place to start is my 2021 Research Statement, which gives an overview of the problems I like and the work I've done

Systems@MPI. In addition to my group, other systems faculty at MPI include Peter Druschel, Anja Feldmann, Deepak Garg, Yiting Xia, and Antoine Kaufmann.

Research


My current research focuses observability and tracing in cloud and distributed systems, performance and resource management, and building multi-tenant distributed systems. A few project highlights include the following:

Blueprint is a framework for developing flexible and modular microservice applications. Blueprint provides a programming abstraction for writing microservice applications that makes it easy to later change aspects related to the system's scaffolding and topology -- e.g. concerns such as RPC frameworks or backends to use, placement and load balancing, replication, and so on. Using Blueprint, applications don't early-bind to those choices. Later on, it's easy to change any of these aspects and re-generate a fully functional variant of the application. Blueprint makes comparative evaluation trivially easy -- it helps developers empirically pick the best-performing libraries, frameworks, and backends; and it helps researchers perform rigorous comparative evaluation of new prototypes. Blueprint was published at SOSP 2023 and its code can be found on Github.

Hindsight is a new distributed tracing framework that we have built from the ground-up to support edge-case tracing. Hindsight enables detailed end-to-end tracing for rare and outlier requests without data loss traditionally incurred by sampling-based systems. Hindsight overcomes this by combining a short per-node history of telemetry, programmatic symptom detection, and rapid distributed retrieval. Hindsight was published at NSDI 2023 and its code can be found on GitLab.

Clockwork is a distributed DNN serving system designed for predictable end-to-end performance. Clockwork promotes end-to-end performance predictability as a first-class design concern. To achieve this, Clockwork's design eliminates major sources of performance variability and centralizes choices that lead to variability such as scheduling and admission control. The end result is a system design with extremely tight tail latency. Clockwork received the Distinguished Artifact Award at OSDI 2020 and its code can be found on GitLab.

Pivot Tracing is a cross-component monitoring framework for distributed systems. Often the information needed to troubleshoot cross-component problems is relatively minimal, but inaccessible due to a lack of cross-component visibility. Pivot Tracing combines causal metadata propagation with dynamic instrumentation to overcome this limitation. Using Pivot Tracing, a system operator can use a simple SQL-like interface to define and measure arbitrary system metrics, while grouping, filtering, and aggregating those metrics according to arbitrary identifiers from other system components. Pivot Tracing received the Best Paper Award at SOSP 2015 and its code can be found on GitHub.

Software


Blueprint can be found on GitHub at Blueprint-uServices.

All research projects for my group are hosted on MPI's GitLab.

Some older projects can be found on GitHub under JonathanMace, brownsys, and tracingplane.

Current Members


Former Members


2024
Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace
preprint, 2024. [PDF]
2023
Detection Is Better Than Cure: A Cloud Incidents Perspective
Vaibhav Ganatra, Anjaly Parayil, Supriyo Ghosh, Yu Kang, Minghua Ma, Chetan Bansal, Suman Nath, Jonathan Mace
ESEC/FSE, 2023. [PDF]
Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications
Vaastav Anand, Deepak Garg, Antoine Kaufmann, Jonathan Mace
SOSP, 2023. [PDF, Github]
Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications
João Loff, Daniel Porto, João Garcia, Jonathan Mace, Rodrigo Rodrigues
SOSP, 2023. [PDF]
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
NSDI, 2023. [PDF, Gitlab]
Groundhog: Reconciling Efficiency and Request Isolation in FaaS
Mohamed Alzayat, Jonathan Mace, Peter Druschel, Deepak Garg
EuroSys, 2023. [PDF]
A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities
Thomas Davidson, Emily Wall, Jonathan Mace
IEEE Transactions on Visualization and Computer Graphics, February 2023. [PDF]
2022
See it to Believe it? The Role of Visualisation in Systems Research
Thomas Davidson, Jonathan Mace
SoCC, 2022. [PDF]
The Odd One Out: Energy is not like Other Metrics
Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace
HotCarbon, 2022. [PDF]
ACT now: Aggregate Comparison of Traces for Incident Localization
Kamala Ramasubramanian, Ashutosh Raina, Jonathan Mace, Peter Alvaro
arXiv preprint, 2022. [PDF]
2020
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
OSDI, 2020. [PDF, Gitlab]
Distinguished Artifact Award
Distributed Tracing in Practice
Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs
Textbook, Publication July 2020.
2019
Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace
SoCC, 2019. [PDF]
No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
Amit Samanta, Suhas Shrinivasan, Antoine Kaufmann, Jonathan Mace
arXiv preprint, 2019. [PDF]
2018
Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca
SoCC, 2018. [PDF]
A Universal Architecture for Cross-Cutting Tools in Distributed Systems
Jonathan Mace
Ph.D. Thesis, Brown University, 2018. [PDF]
Dennis M. Ritchie Doctoral Dissertation Award, Honorable Mention
Universal Context Propagation for Distributed System Instrumentation
Jonathan Mace, Rodrigo Fonseca
EuroSys, 2018. [PDF]
2017
Canopy: An End-to-End Performance Tracing And Analysis System
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song
SOSP, 2017. [PDF]
2016
Principled Workflow-Centric Tracing of Distributed Systems
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
SoCC, 2016. [PDF]
2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services
Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan
SIGCOMM, 2016. [PDF]
2015
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Jonathan Mace, Ryan Roelke, Rodrigo Fonseca
SOSP, 2015. [PDF]
Best Paper Award
We are Losing Track: a Case for Causal Metadata in Distributed Systems
Rodrigo Fonseca, Jonathan Mace
HPTS, 2015. [PDF]
Retro: Targeted Resource Management in Multi-Tenant Distributed Systems
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
NSDI, 2015. [PDF]
2014
Towards General-Purpose Resource Management in Shared Cloud Services
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
HotDep, 2014. [PDF]
2013
Revisiting End-to-End Trace Comparison with Graph Kernels
Jonathan Mace, Rodrigo Fonseca
MSc Project, Brown University, 2013. [PDF]

jcmace@mpi-sws.org

+49 681 9303-8801

Room 409 in the Saarbrücken building of MPI-SWS. Map

MPI-SWS
Saarland Informatics Campus E1 5
66123 Saarbrücken
Germany