I am leaving MPI-SWS at the end of September 2022. I will be joining the Cloud Reliability group at Microsoft Research in Redmond, WA. I can still be reached using the contact information on this webpage.
I am a tenure-track faculty member at the Max Planck Institute for Software Systems -- a position equivalent to a US assistant professor -- and head of the Cloud Software Systems Group. I am also affiliated with the University of Saarland (UdS).
I joined MPI-SWS at the end of 2018. Before that, I was a graduate student at Brown University (2011-2018), where I was advised by Rodrigo Fonseca. At Brown, I was supported in part by a Facebook PhD Fellowship, and had various collaborations with researchers at Facebook and Microsoft.
Recent News
- 2023/01 Our work, "A Qualitative Interview Study of Distributed Tracing Visualisation" (PDF) was accepted to appear in IEEE Transactions on Visualization and Computer Graphics, February 2023.
- 2022/09 Our work on visualization in systems tools, "See it to Believe it? The Role of Visualisation in Systems Research" (PDF) was accepted at SoCC 2022.
- 2022/07 Our work on lightweight rollback for serverless, "Groundhog: Reconciling Efficiency and Request Isolation in FaaS" (preprint) was accepted at EuroSys 2023.
- 2022/06 Our work on tracing edge-cases, "The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems" (preprint) was accepted at NSDI 2023.
- 2021/11 I will be serving as co-General Chair of SOSP 2023 alongside my colleagues Peter Druschel and Antoine Kaufmann. SOSP 2023 will be held in Koblenz, Germany!
- 2020/11 Our work on DNN serving, "Serving DNNs like Clockwork: Performance Predictability from the Bottom Up" (PDF), received the Distinguished Artifact Award at OSDI 2020!
- 2020/09 I will be teaching Distributed Systems next summer semester (2021) at Saarland University. It will be TA'ed by two of my PhD students, the excellent Thomas Davidson and Vaastav Anand.
- 2020/04 Distributed Tracing in Practice is out! It's a book about distributed tracing, co-written with Rebecca Isaacs, Austin Parker, and Daniel Spoonhower. You can get it here.
- 2020/02 Finalist for the Facebook ML Systems research award, for our work on DNN serving.
- 2020/01 Teaching Advanced Topics in Cloud and Datacenter Systems this summer semester (2020) at Saarland University.
Awards
- 2020/11 Distinguished Artifact Award at OSDI 2020, for Serving DNNs like Clockwork: Performance Predictability from the Bottom Up.
- 2020/02 Finalist for the Facebook ML Systems research award, for our work on DNN serving.
- 2018/10 Honorable Mention for the 2018 Dennis M. Ritchie Doctoral Dissertation Award, for my thesis A Universal Architecture for Cross-Cutting Tools in Distributed Systems.
- 2016/09 Facebook PhD Fellowship in Distributed Systems (2016-2018).
- 2015/10 Best Paper Award at SOSP 2015, for Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems.
I lead the Cloud Software Systems Group at the Max Planck Institute for Software Systems. My group's research spans distributed systems, computer networks, and operating systems. We think about the kinds of systems that modern tech companies use, such as public cloud systems (like AWS, Google Cloud, and Azure); open-source systems (like Hadoop, Spark, and Kafka); and emerging paradigms (like microservices and serverless). A central goal for us is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime.
Want to know more? A good place to start is my 2021 Research Statement, which gives an overview of the problems I like and the work I've done
Systems@MPI. In addition to my group, other systems faculty at MPI include Peter Druschel, Anja Feldmann, Deepak Garg, Yiting Xia, and Antoine Kaufmann.
Research
My current research focuses observability and tracing in cloud and distributed systems, performance and resource management, and building multi-tenant distributed systems. A few project highlights include the following:
Blueprint is a framework for developing flexible and modular microservice applications. Blueprint provides a programming abstraction for writing microservice applications that makes it easy to later change aspects related to the system's scaffolding and topology -- e.g. concerns such as RPC frameworks or backends to use, placement and load balancing, replication, and so on. Using Blueprint, applications don't early-bind to those choices. Later on, it's easy to change any of these aspects and re-generate a fully functional variant of the application. Blueprint makes comparative evaluation trivially easy -- it helps developers empirically pick the best-performing libraries, frameworks, and backends; and it helps researchers perform rigorous comparative evaluation of new prototypes. Blueprint was published at SOSP 2023 and its code can be found on Github.
Hindsight is a new distributed tracing framework that we have built from the ground-up to support edge-case tracing. Hindsight enables detailed end-to-end tracing for rare and outlier requests without data loss traditionally incurred by sampling-based systems. Hindsight overcomes this by combining a short per-node history of telemetry, programmatic symptom detection, and rapid distributed retrieval. Hindsight was published at NSDI 2023 and its code can be found on GitLab.
Clockwork is a distributed DNN serving system designed for predictable end-to-end performance. Clockwork promotes end-to-end performance predictability as a first-class design concern. To achieve this, Clockwork's design eliminates major sources of performance variability and centralizes choices that lead to variability such as scheduling and admission control. The end result is a system design with extremely tight tail latency. Clockwork received the Distinguished Artifact Award at OSDI 2020 and its code can be found on GitLab.
Pivot Tracing is a cross-component monitoring framework for distributed systems. Often the information needed to troubleshoot cross-component problems is relatively minimal, but inaccessible due to a lack of cross-component visibility. Pivot Tracing combines causal metadata propagation with dynamic instrumentation to overcome this limitation. Using Pivot Tracing, a system operator can use a simple SQL-like interface to define and measure arbitrary system metrics, while grouping, filtering, and aggregating those metrics according to arbitrary identifiers from other system components. Pivot Tracing received the Best Paper Award at SOSP 2015 and its code can be found on GitHub.
Software
Blueprint can be found on GitHub at Blueprint-uServices.
All research projects for my group are hosted on MPI's GitLab.
Some older projects can be found on GitHub under JonathanMace, brownsys, and tracingplane.
Current Members
- Matheus Stolet, PhD student (2021-)
- Vaastav Anand, PhD student (2020-)
- Safya Alzayat, PhD student (2019-)
- Thomas Davidson, PhD student (2019-)
- Reyhaneh Karimipour, PhD student (2019-)
- Zhiqiang Xie, Visiting MSc intern (2021)
Former Members
- Arpan Gujarati, Postdoc (2020-2021), now at the University of British Columbia.
- Franco Caspe, Erasmus Masters student (2021), now PhD student at Queen Mary University of London.
Thesis: Efficient DNN Serving: Evaluating the feasibility of FPGAs for multi-tenant model serving - Nicolas Schäfer, Masters student (2019-2020), now at SAP.
Thesis: Pathfinder: Exploiting Inter-Thread Communication for Request Flow Instrumentation
Saarland Informatics Campus E1 5
66123 Saarbrücken
Germany