# Artifact Evaluation Instructions

This document describes how to evaluate the in-kernel implementation of the laminar APA scheduler presented in the paper:

V. Bonifaci, B. Brandenburg, G. D’Angelo, and A. Marchetti-Spaccamela, “Multiprocessor Real-Time Scheduling with Hierarchical Processor Affinities”, Proceedings of the 28th Euromicro Conference on Real-Time Systems (ECRTS 2016), July 2016.

The proposed scheduling algorithm was implemented and evaluated in LITMUSRT, which in turn is based on Linux.

Note: a tutorial on Linux kernel development and detailed instructions on how to compile and configure a Linux kernel that actually works on a given hardware platform is beyond the scope of this document. Basic familiarity with Linux and the compilation and installation of custom Linux kernels on behalf of the evaluator is assumed.

Note: the paper reports empirical data measured on a particular hardware platform available at the time of writing in the shared research cluster of MPI-SWS. To reproduce the exact numbers reported in the paper, it would be necessary to have access to this particular machine, or one virtually identical to it. The focus of this document is hence how to obtain measurements like those reported in the paper (i.e., how to observe similar trends), and not on reproducing the exact numbers.

The rest of the document covers the following main points:

1. What hardware is required?
2. How to obtain, compile, and install the kernel, tools, and test workloads? This part necessarily assumes basic familiarity with Linux kernel development.
3. How to run experiments?
4. How to process the raw data?

We provide instructions both for running the full experiments, which however requires access to a 24-core machine, and for running toy experiments, for which a 4-core machine suffices. We hope that running the 4-core toy experiments may be sufficient to establish that the provided artifact and the general procedure works.

Note: Even running the 24-core experiments will not reproduce the exact numbers reported in the paper, unless you happen to have an identical hardware platform. The general trends, however, should be observable even on other 24-core Intel platforms.

## Hardware Requirements

To run the full experiments, you need a 24-core Intel platform (or larger) running Linux. To run the toy experiments, a 4-core Intel machine (or larger) running Linux will do. In a pinch, you can also use a virtual machine (with four or more virtual cores) to run the toy experiments. However, in a virtual machine, all collected data will be bogus due to the virtualization overheads.

If you have a system with more than 24 cores, you can tell Linux to use only 24 cores by booting with maxcpus=24 specified in the kernel command line.

On your experimental platform, to approximate our settings, you should disable (in the BIOS) architectural features that cause unpredictability such as hyperthreading and cache prefetching. To reproduce our exact setup, you need a Dell PowerEdge R920 server with four sockets, each containing a 12-core Intel Xeon E7–8857 v2 processor.

## Obtaining, Compiling, and Installing the Software Artifact

To conduct experiments similar to those reported in the paper, you need the following software components:

1. The Linux kernel, version 4.1.3
2. The LITMUSRT patch, version 2015.1.
3. Our patch that adds the laminar APA scheduler.
4. The liblitmus user-space library, which provides the necessary tools for working with a LITMUSRT kernel. The liblitmus library also comes with rtspin, a tool for simulating CPU-bound, periodic real-time tasks, which is employed as the workload in the experiments.
5. The feather-trace-tools project, a collection of tracing and and analysis tools that we used to collect overheads under LITMUSRT.
6. Our patch to feather-trace-tools, which adds support for the trace points used in message-passing-based schedulers (such as the implementation of the proposed laminar APA scheduler).
7. The workloads used to stress the kernel while running experiments.

For convenience, we provide items 2–7 together in a single archive that can be downloaded from here:

When unpacked, you should see the following directory contents:

drwxr-xr-x   2 bbb  wheel      68 May 11 20:52 data
drwxr-xr-x  20 bbb  wheel     680 May 11 23:28 feather-trace-tools
-rw-r--r--   1 bbb  wheel  114145 May 12 01:49 kernel-config-for-virtualbox
drwxr-xr-x  15 bbb  wheel     510 May 11 20:45 liblitmus
-rw-r--r--   1 bbb  wheel  600909 May 11 21:04 litmus-rt-with-laminar-apa.patch
drwxr-xr-x   4 bbb  wheel     136 May 11 22:16 workloads


To set up the software environment, carry out the following step-by-step instructions on your Linux host that you will use for the experiments. (This could be a virtual machine, provided your workstation can comfortably run one with at least four virtual cores.) In the following, we assume that you use /usr/local/litmus as the working directory.

Note: all instructions have been tested on a machine running Ubuntu Linux 14.04 LTS. In principle, you should be able to use just about any Linux distribution; however, with different, especially newer compiler versions comes the risk of compilation failures due to newly added warnings (we compile with -Werror).

Support: we’ve made all efforts to make reproducing our work as painless as possible, but working with kernels does come with some challenges. If at any point you run into problems, please feel free to contact Björn Brandenburg (bbb@mpi-sws.org) for help.

### Step 0: System Setup

Make sure your Linux box is set up for Linux kernel development. This means installing the typical Unix C development chain, including gcc, make, etc. Any Linux installation that can compile a vanilla Linux kernel should also be able to compile the provided software artifact.

The working directory /usr/local/litmus can be set up as follows:

cd /usr/local
sudo mkdir litmus
export PATH=/usr/local/litmus/ae-laminar-apa/feather-trace-tools:$PATH  You can check that the path was set up correctly by locating the rtspin and ftcat utilities: which rtspin # expected output: # /usr/local/litmus/ae-laminar-apa/liblitmus/rtspin which ftcat # expected output: # /usr/local/litmus/ae-laminar-apa/feather-trace-tools/ftcat  ### Step 10: Run an Experiment Finally, we can launch an experiment. The archive comes both with all workloads used in the experiments reported in the paper, and with toy experiments that allow trying out the kernel if no 24-core machine is available. There is a shell script for each experiment that takes care of everything: setting up the experiment, launching rtspin processes with appropriate parameters, starting overhead tracing, and tearing everything down again at the end of the experiments. The experiment scripts are provided in the folder workloads/ of the provided archive. They are organized by required core count and by scheduler: • the directory workloads/24/apa contains experiment scripts for the laminar APA scheduler (the new algorithm) for platforms with at least 24 cores; • the directory workloads/24/gfp contains experiment scripts for the global fixed-priority scheduler (the baseline) for platforms with at least 24 cores; • the directory workloads/4/apa contains experiment scripts for the laminar APA scheduler (the new algorithm) for platforms with at least 4 cores; and • the directory workloads/4/gfp contains experiment scripts for the global fixed-priority scheduler (the baseline) for platforms with at least 4 cores. The file names of the shell scripts reflect the parameters of the workload. For example, the file workloads/4/apa/apa-workload_m=04_n=08_u=85_seq=00.sh is for m=4 processor cores, launches a workload consisting of n=8 tasks, and has a total utilization of 85%. The seq tag is simply a sequence number; there are 10 scripts for each parameter combination. To launch an experiment, simply run the corresponding script from the root shell. Each experiment script produces raw overhead sample files in the directory in which it is launched. We therefore first move to the (still empty) data/ directory. cd /usr/local/litmus/ae-laminar-apa/data ../workloads/4/apa/apa-workload_m=04_n=08_u=85_seq=00.sh  Each experiment runs for 60 seconds (plus a few seconds for setup and teardown). While it is running, it should provide a progress indicator (dots appearing, one per second). When the experiment is done, the shell script will terminate. For example, this is what it should look like after the experiment has completed: root@rts44:/usr/local/litmus# cd /usr/local/litmus/ae-laminar-apa/data root@rts44:/usr/local/litmus/ae-laminar-apa/data# ../workloads/4/apa/apa-workload_m=04_n=08_u=85_seq=00.sh Running apa-workload_m=04_n=08_u=85_seq=00 under LSA-FP-MP for 60 seconds... Setting processor 3 to be the dedicated scheduling core. Waiting for 8 tasks to finish launching... Launching overhead tracer. Waiting for overhead tracer to finish launching... Released 8 real-time tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All tasks finished. Sent SIGUSR1 to stop tracers... root@rts44:/usr/local/litmus/ae-laminar-apa/data#  The experiment script generated a bunch of data files that contain overhead samples that were collected with Feather-Trace, the overhead tracing framework built into LITMUSRT. You should now see (at least) the following files in the data/ directory. overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_cpu=0.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_cpu=1.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_cpu=2.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_cpu=3.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_msg=0.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_msg=1.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_msg=2.bin overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_n=08_u=85_seq=00_msg=3.bin  The script creates two files for each processor; the total number of files created hence depends on size of the experimental platform. The above example ran a workload under the new laminar APA scheduler. To get data for the baseline scheduler, run the same experiment again, but this time under the global fixed-priority (GFP) scheduler. cd /usr/local/litmus/ae-laminar-apa/data ../workloads/4/gfp/apa-workload_m=04_n=08_u=85_seq=00.sh  Again, the output should look something like this: root@rts44:/usr/local/litmus/ae-laminar-apa/data# ../workloads/4/gfp/apa-workload_m=04_n=08_u=85_seq=00.sh Running apa-workload_m=04_n=08_u=85_seq=00 under G-FP-MP for 60 seconds... Setting processor 3 to be the dedicated scheduling core. Waiting for 8 tasks to finish launching... Launching overhead tracer. Waiting for overhead tracer to finish launching... Released 8 real-time tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . All tasks finished. Sent SIGUSR1 to stop tracers...  You can now run as many experiments as you wish. Obviously, running all workloads will take many hours. ## Data Processing Once a satisfactory amount of data has been collected, the overhead statistics reported in the paper can be obtained with the tools provided in feather-trace-tools/. The following steps are based on the LITMUSRT overhead tracing tutorial. The focus here is on documenting how to obtain the desired statistics, not on explaining why each step is necessary or what precisely it does. For a more in-depth explanation, please refer to the LITMUSRT tracing tutorial. ### Step 11: Sort Trace Files In the first processing step, the raw trace files are cleaned up and prepared for further processing. ft-sort-traces overheads_*.bin 2>&1 | tee -a overhead-processing.log  ### Step 12: Extract Samples The files ending with the extension .bin are raw trace files in a kernel-defined binary format. Before extracting meaningful statistics, we need to extract the actual data samples. ft-extract-samples overheads_*.bin 2>&1 | tee -a overhead-processing.log  ### Step 13: Aggregate Samples At this point, we have many per-processor, per-task-count, per-utilization, etc. files. We are interested in aggregate overhead values across all tested scenarios. Hence we need to combine the individual sample files. ft-combine-samples --std overheads_*.float32 2>&1 | tee -a overhead-processing.log  ### Step 14: Count Total Number of Samples In general, the number of samples that were collected for the two schedulers will differ to some extent. To get the minimum number available of each type, which we require for the next step, we count the number of samples in all files. ft-count-samples combined-overheads_*.float32 > counts.csv  ### Step 15: Draw a Random Sample To compare sampled maxima in an unbiased way, we need to use an equal number of samples from each population. Since in all likelihood we recorded a different number of samples for each scheduler, we randomize and truncate the data files. ft-select-samples counts.csv combined-overheads_*.float32 2>&1 | tee -a overhead-processing.log  ### Step 16: Compute Statistics Finally, we can compute the statistics reported in the paper. ft-compute-stats combined-overheads_*.sf32 > stats.csv  The above command reports overheads in terms of processor cycles. For human consumption, it is more convenient to report overheads in terms of microseconds. To obtain the overheads in microseconds, pass the option --cycles-per-usec to ft-compute-stats. The appropriate value can be obtained from /proc/cpuinfo with the following command: grep 'cpu MHz' /proc/cpuinfo | uniq  For example, on our test machine (not the one used for the experiments reported in the paper), the output looks as follows. bbb@rts44:/usr/local/litmus$ grep 'cpu MHz' /proc/cpuinfo | uniq
cpu MHz         : 2200.080


In this particular processor, there are 2200.080 cycles per microsecond. Hence we can compute the statistics in microseconds as follows.

ft-compute-stats --cycles-per-usec 2200.080 combined-overheads_*.sf32 > stats-us.csv


Having just run the two toy experiments mentioned in Step 10 above, the output looks as follows:

#    Plugin, #cores,        Overhead,                                 Unit, #tasks, #samples,      max, 99.9th perc., 99th perc., 95th perc.,     avg,     med,     min,     std,     var,                                                                                                           file
G-FP-MP,     04,  CLIENT-REQUEST, microseconds (scale = 1/2200.080000),      *,    74673, 19.34430,     15.64589,    2.43958,    2.20492, 1.44456, 1.34995, 0.16272, 0.83952, 0.70479,   combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=CLIENT-REQUEST_LATENCY.sf32
G-FP-MP,     04,             CXS, microseconds (scale = 1/2200.080000),      *,    74713, 21.48013,      3.50275,    1.59358,    1.51313, 1.16294, 1.19087, 0.20408, 0.44905, 0.20165,                      combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=CXS.sf32
G-FP-MP,     04,     DSP-HANDLER, microseconds (scale = 1/2200.080000),      *,    74754, 26.97538,     11.48239,    1.30995,    1.03087, 0.73659, 0.69270, 0.09272, 0.55134, 0.30398,              combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=DSP-HANDLER.sf32
G-FP-MP,     04, RELEASE-LATENCY,        microseconds (scale = 1/1000),      *,    31554, 24.16400,     17.20824,    6.06674,    3.61900, 1.76023, 1.43000, 0.63200, 1.40904, 1.98534,          combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=RELEASE-LATENCY.sf32
G-FP-MP,     04,         RELEASE, microseconds (scale = 1/2200.080000),      *,    31553, 28.52260,     13.56427,    4.46666,    3.48669, 1.68209, 1.59085, 0.44862, 1.03474, 1.07066,                  combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=RELEASE.sf32
G-FP-MP,     04,          SCHED2, microseconds (scale = 1/2200.080000),      *,    74715, 15.17581,      0.54998,    0.32999,    0.23999, 0.17463, 0.22363, 0.09545, 0.11122, 0.01237,                   combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=SCHED2.sf32
G-FP-MP,     04,           SCHED, microseconds (scale = 1/2200.080000),      *,    74715, 28.92122,     11.12775,    3.12509,    2.54354, 1.26885, 1.13632, 0.34271, 0.70886, 0.50247,                    combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=SCHED.sf32
G-FP-MP,     04,    SEND-RESCHED, microseconds (scale = 1/2200.080000),      *,    73766, 19.14249,     11.68789,    2.05356,    1.87311, 1.46852, 1.60312, 0.63543, 0.63861, 0.40782,             combined-overheads_host=rts44_scheduler=G-FP-MP_trace=apa-workload_m=04_overhead=SEND-RESCHED.sf32
LSA-FP-MP,     04,  CLIENT-REQUEST, microseconds (scale = 1/2200.080000),      *,    74673, 16.72530,      9.88720,    1.43449,    1.23359, 1.13731, 1.09587, 0.85997, 0.46705, 0.21813, combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=CLIENT-REQUEST_LATENCY.sf32
LSA-FP-MP,     04,             CXS, microseconds (scale = 1/2200.080000),      *,    74713, 10.75552,      1.30645,    0.82133,    0.75725, 0.52628, 0.64998, 0.17863, 0.22476, 0.05051,                    combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=CXS.sf32
LSA-FP-MP,     04,     DSP-HANDLER, microseconds (scale = 1/2200.080000),      *,    74754, 31.82248,      5.14525,    3.49442,    3.14307, 2.53362, 2.42264, 0.09409, 0.45715, 0.20899,            combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=DSP-HANDLER.sf32
LSA-FP-MP,     04, RELEASE-LATENCY,        microseconds (scale = 1/1000),      *,    31554, 18.40800,     11.55656,    2.41288,    1.72400, 0.95378, 0.70900, 0.62700, 0.67612, 0.45712,        combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=RELEASE-LATENCY.sf32
LSA-FP-MP,     04,         RELEASE, microseconds (scale = 1/2200.080000),      *,    31553, 20.57107,     11.54750,    5.53980,    3.29715, 2.27388, 1.96993, 0.53589, 0.91940, 0.84526,                combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=RELEASE.sf32
LSA-FP-MP,     04,          SCHED2, microseconds (scale = 1/2200.080000),      *,    74715, 16.82439,      0.30590,    0.22999,    0.17999, 0.12171, 0.10727, 0.09409, 0.07053, 0.00497,                 combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=SCHED2.sf32
LSA-FP-MP,     04,           SCHED, microseconds (scale = 1/2200.080000),      *,    74715, 17.67618,      2.13031,    1.82130,    1.03133, 0.54628, 0.49816, 0.24726, 0.31187, 0.09726,                  combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=SCHED.sf32
LSA-FP-MP,     04,    SEND-RESCHED, microseconds (scale = 1/2200.080000),      *,    73766, 17.93435,      9.49478,    1.61676,    0.97996, 0.90923, 0.88133, 0.61407, 0.41412, 0.17149,           combined-overheads_host=rts44_scheduler=LSA-FP-MP_trace=apa-workload_m=04_overhead=SEND-RESCHED.sf32


How to interpret this data: In Figure 1 of the paper, the bars labeled “DSP” correspond directly to the DSP-HANDLER overheads listed in the table. The bars labeled “DISPATCHER” correspond to the sum of the three overheads SCHED (scheduler invocation), SCHED2 (post-context-switch activities), and CXS (context switch). For technical reasons, the latter three overheads are measured separately. The other types of overhead were not reported in the paper.

Final remarks: a close look at the above table reveals that it does not exhibit exactly the same trends as the data reported in the paper. This has multiple reasons.

First, with just two toy experiments, the number of samples is too small to draw any firm conclusions.

Second, with a workload for only four cores, scalability bottlenecks do not manifest yet that play a role in the 24-core configuration.

Third, the data actually stems from a 44-core test machine, and not a 4-core machine (for which the toy experiments are designed). The global scheduler uses all cores and hence never preempts (due to the small number of tasks in the toy experiments). In contrast, the APA scheduler remains constrained by the specified affinities. Therefore, this is not actually a fair comparison; the maxcpus command line option would be required to ensure a level playing field.

This highlights two important points: (1) it is possible to validate the workflow on just about any Linux host with four or more cores, and (2) to replicate the actual reported numbers, a more or less identical machine is needed to run the experiments, and the experiments need to be run at full scale (i.e., all experiments over many hours, resulting in many gigabytes of data).

That said, when running the provided workloads on a 24-core machine (even if not exactly identical), for a reasonable number of task sets (a least a dozen or more), generally similar trends (the APA scheduler incurring higher, but still bearable overheads) should become apparent.