Repairing Sequential Consistency in C/C++11

Ori Lahav
MPI-SWS, Germany ∗
orilahav@mpi-sws.org

Viktor Vafeiadis
MPI-SWS, Germany ∗
viktor@mpi-sws.org

Jeehoon Kang
Seoul National University, Korea
jeehoon.kang@sf.snu.ac.kr

Chung-Kil Hur
Seoul National University, Korea
gil.hur@sf.snu.ac.kr

Derek Dreyer
MPI-SWS, Germany ∗
dreyer@mpi-sws.org

Abstract
The C/C++11 memory model defines the semantics of concurrent memory accesses in C/C++, and in particular supports racy “atomic” accesses at a range of different consistency levels, from very weak consistency (“relaxed”) to strong, sequential consistency (“SC”). Unfortunately, as we observe in this paper, the semantics of SC atomic accesses in C/C++11, as well as in all proposed strengthenings of the semantics, is flawed, in that both suggested compilation schemes to Power are unsound. We propose a better semantics for SC accesses that restores the soundness of the compilation schemes to Power, maintains the DRF-SC guarantee, and provides stronger, more useful, guarantees to SC fences. In addition, we formally prove, for the first time, the correctness of the proposed stronger compilation schemes to Power that preserve load-to-store ordering and avoid “out-of-thin-air” reads.

1. Introduction
The C/C++11 memory model (C11 for short) [7] defines the semantics of concurrent memory accesses in C/C++, of which there are two general types: non-atomic and atomic. Non-atomic accesses are intended for normal data: races on such accesses lead to undefined behavior, thus ensuring that it is sound to subject non-atomic accesses to standard sequential optimizations and reorderings. In contrast, atomic accesses are specifically intended for communication between threads: thus, races on atomics are permitted, but at the cost of imposing restrictions on how such accesses may be merged or reordered during compilation.

The degree to which an atomic access may be reordered with other operations—and more generally, the implementation cost of an atomic access—depends on its consistency level, concerning which C11 offers programmers several options according to their needs. Strongest and most expensive are sequentially consistent (SC) accesses, whose primary purpose is to restore the simple interleaving semantics of sequential consistency [18] if a program (when executed under SC semantics) only has races on SC accesses. This property is called “DRF-SC” and was a main design goal for C11. To ensure DRF-SC, the standard compilation schemes for modern architectures must insert hardware “fence” instructions appropriately into the compiled code, with those for weaker architectures (like Power and ARM) introducing a full (strong) fence adjacent to each SC access.

Weaker than SC atomics are release-acquire accesses, which can be used to perform “message passing” between threads without incurring the implementation cost of a full SC access; and weaker and cheaper still are relaxed accesses, which are compiled down to plain loads and stores at the machine level and which provide only the minimal synchronization guaranteed by the hardware. Finally, the C11 model also supports language-level fence instructions, which provide finer-grained control over where hardware fences are to be placed and serve as a barrier to prevent unwanted compiler optimizations.

In this paper, we are mainly concerned with the semantics of SC atomics (i.e., SC accesses and SC fences), and their interplay with the rest of the model. Since sequential consistency is such a classical, well-understood notion, one might expect that the semantics of SC atomics should be totally straightforward, but sadly, as we shall see, it is not!

The main problem arises in programs that mix SC and non-SC accesses to the same location. Although not common, such mixing is freely permitted by the C11 standard, and has legitimate uses—e.g., as a way of enabling faster (non-SC) reads from an otherwise quite strongly synchronized data structure. Indeed, we know of several examples of code in the wild that mixes SC accesses together with release/acquire or relaxed accesses to the same location: seqLocks [8] and Rust’s crossbeam library [2]. Now consider the following program (see Manerkar et al. [20]):

\[
\begin{align*}
& x := \text{sc} 1 \\
& a := x_{\text{acq}} \#1 \\
& c := y_{\text{sc}} \#0 \\
& d := x_{\text{sc}} \#0 \\
& y := \text{sc} 1 \\
\end{align*}
\]

(IRIW-acq-sc)
Here and in all other programs in this paper, we write $a, b, \ldots$ for local variables (registers), and assume that all variables are initialized to 0. The program contains two variables, $x$ and $y$, which are accessed via SC atomic accesses and also read by acquire-atomic accesses. The annotated behavior (reading $a = b = 1$ and $c = d = 0$) corresponds to the two threads observing the writes to $x$ and $y$ as occurring in different orders, and is forbidden by C11. (We defer the explanation of how C11 forbids this behavior to §2.)

Let’s now consider how this program is compiled to Power. Two compilation schemes have been proposed [6]: the first scheme, the one implemented in the GCC and LLVM compilers, inserts a sync fence before each SC access (“leading sync” convention), whereas an alternative scheme inserts a sync fence after each SC access (“trailing sync” convention). The intent of both schemes is to have a strong barrier between every pair of SC accesses, and thereby enforce sequential consistency on programs containing only SC accesses. Nevertheless, by mixing SC and release-acquire accesses, one can quickly get into trouble, as illustrated by IRIW-acq-sc.

In particular, if one compiles the program into Power using the trailing sync convention, then the behavior is allowed. Since all SC accesses are at the end of the threads, the trailing sync fences have no effect, and the example reduces to IRIW with only acquire reads, which is allowed by the Power memory model. In §2.1, we show further examples illustrating that the other, leading-sync scheme also leads to behaviors in the target of compilation to Power that are not permitted in the source.

Although previous work has found multiple problems in the C11 model (e.g., out-of-thin-air problem [28, 10], the lack of monotonicity [27]), none of them until now affected the correctness of compilation to the mainstream architectures. In contrast, the IRIW-acq-sc program and our examples in §2.1 show that both the suggested compilation schemes to Power are unsound with respect to the C11 model, thereby contradicting the published results of [6, 24]. Consequently, the strengthened model of Batty et al. [4] is also not efficiently implementable on Power.

In the remainder of the paper, we propose a way to repair the semantics of SC accesses that resolves the problems mentioned above. In particular, our corrected semantics restores the soundness of the suggested compilation schemes to Power. Moreover, it still satisfies the standard DRF-SC theorem: if a program’s sequentially consistent executions only ever exhibit races on SC atomic accesses, then its semantics under full C11 is also sequentially consistent. It is worth noting that our correction only affects the semantics of programs mixing SC and non-SC accesses to the same location: we show that, without such mixing, our correction coincides with the strengthened model of Batty et al. [4].

We also apply two additional, orthogonal, corrections to the C11 model, which strengthen the semantics of SC fences. The first fix corrects a problem already noted before

\[ a = b = 1 \land c = d = 0 \]

[24, 19, 15], namely that the current semantics of SC fences does not recover sequential consistency, even when SC fences are placed between every two commands in programs with only release/acquire atomic accesses. The second fix provides stronger “cumulativity” guarantees for programs with SC fences. We justify these strengthenings by proving that the compilation schemes for TSO, Power, and ARMv7 remain sound with the stronger semantics.

Finally, we apply another, mostly orthogonal, correction to the C11 model, in order to address the well-known “out-of-thin-air” problem. The problem is that the C11 standard permits certain executions as a result of causality cycles, which break even basic invariant-based reasoning [10]. The correction, which is simple to state formally, is to strengthen the model to enforce load-to-store ordering for atomic accesses, thereby ruling out such causality cycles, at the expense of requiring a less efficient compilation scheme for relaxed accesses. The idea of this correction is not novel—it has been extensively discussed in the literature [28, 10, 27]—but the suggested compilation schemes to Power and ARMv7 have not yet been proven sound. Here, we give the first proof that one of these compilation schemes—the one that places a fake control dependency after every relaxed read—is sound. The proof is surprisingly delicate, and involves a novel argument similar to that in DRF-SC proofs.

Putting all these corrections together, we propose a new model called RC11 (for Repaired C11) that supports nearly all features of the C11 model except consume atomics (§3). We prove correctness of compilation to TSO (§4), Power (§5), and ARMv7 (§6), the soundness of a wide collection of program transformations (§7) and a DRF-SC theorem (§8).

\[ k : W_{ac}(x, 1) \xrightarrow{rf} l : R^{ac}(x, 1) \xrightarrow{rf} n : R^{ac}(y, 1) \xleftarrow{rf} p : W_{ac}(y, 1) \]

\[ m : W_{ac}(y, 0) \xrightarrow{rf} o : W_{ac}(x, 0) \]

Figure 1. An execution of IRIW-acq-sc yielding the result $a = b = 1 \land c = d = 0$.

2. The Semantics of SC Atomics in C11: What’s Wrong, and How Can We Fix It?

The C11 memory model defines the semantics of a program as a set of consistent executions. Each execution is a graph. Its nodes, $E$, are called events and represent the individual memory accesses and fences of the program, while its edges represent various relations among these events:

- The sequenced-before $(ab)$ relation, a.k.a. program order, captures the order of events in the program’s control flow.
- The modification order $(mo)$ is a union of total orders, one for each memory address, totally ordering the writes to that address. Intuitively, it records for each memory
address the globally agreed-upon order in which writes to that address happened.

- Finally, the reads-from (rf) relation associates each write with the set of reads that read from that write. In a consistent execution, the reads-from relation should be functional (and total) in the second argument: a read must read from exactly one write.

As an example, in Fig. 1, we depict an execution of the IRIW-acq-sc program discussed in the introduction. In addition to the events corresponding to the accesses appearing in the program, the execution contains two events for the implicit initialization writes to \( x \) and \( y \), which are assumed to be \( \textsf{sb} \)-before all other events.

**Notation 1.** We write \( \textsf{R}, \textsf{W}, \textsf{F}, \textsf{RMW} \) for the set of read, write, fence, and RMW events in \( E \). (RMW events are “read-modify-write” events, induced by atomic update operations like fetch-and-add and compare-and-swap.) We also write \( E^\text{sc} \) for the set of all SC events in \( E \).

Given a binary relation \( R \), we write \( R^1, R^+, \) and \( R^* \) respectively to denote its reflexive, transitive, and reflexive-transitive closures. The inverse relation is denoted by \( R^{-1} \). We denote by \( R_1; R_2 \) the left composition of two relations \( R_1, R_2 \), and assume that \( ; \) binds tighter than \( \cup \) and \( \setminus \). Finally, we denote by \( [A] \) the identity relation on a set \( A \). In particular, \( [A]; R; [B] = R \cap (A \times B) \).

Based on these three basic relations, let us define some derived relations. First, whenever an acquire or SC read reads from a release or SC write, we say that the write synchronizes with (sw) the read.\(^1\) Next, we say that one event happens before (hb) another event if they are connected by a sequence of \( \textsf{sb} \) or \( \textsf{sw} \) edges. Formally, \( \textsf{hb} \) is taken to be \((\textsf{sb} \cup \textsf{sw})^+\). For example, in Fig. 1, event \( k \) synchronizes with \( l \) and therefore \( k \) happens-before \( l \) and \( m \). Lastly, whenever a read event \( e \) reads from a write that is \( \textsf{mo} \)-before another write \( f \) to the same location, we say that \( e \) reads before (rb) \( f \).

Formally, \( \textsf{rb} = (\textsf{rf}^{-1}; \textsf{mo}) \setminus [E] \). (The “\( \setminus [E] \)” part is needed so that RMW events do not read before themselves.)

Consistent C11 executions require that \( \textsf{hb} \) is irreflexive (i.e., \( \textsf{sb} \cup \textsf{sw} \) is acyclic), and further guarantee coherence (aka SC-per-location). Roughly speaking, coherence ensures that (i) the order of writes to the same location according to \( \textsf{mo} \) does not contradict \( \textsf{hb} \) (COHERENCE-WW); (ii) reads do not read values written in the future (NO-FUTURE-READ and COHERENCE-RW); (iii) reads do not read overwritten values (COHERENCE-WR); and (iv) two \( \textsf{hb} \)-related reads from the same location cannot read from two writes in reversed \( \textsf{mo} \)-order (COHERENCE-RR). We refer the reader to Prop. 1 in §3 for a precise formal definition of coherence.

Now, to give semantics to SC atomics, C11 stipulates that in consistent executions, there should be a strict total order, \( S \), over all SC events, intuitively corresponding to the order in which SC events are executed. This order is required to satisfy a number of conditions (but see Remark 1 below):

**S1** \( S \) must include \( \textsf{hb} \) restricted to SC events (formally: \( E^\text{sc} \); \( \textsf{hb}; [E^\text{sc}] \subseteq S \));

**S2** \( S \) must include \( \textsf{mo} \) restricted to SC events (formally: \( E^\text{sc} \); \( \textsf{mo}; [E^\text{sc}] \subseteq S \));

**S3** \( S \) must include \( \textsf{rb} \) restricted to SC events (formally: \( E^\text{sc} \); \( \textsf{rb}; [E^\text{sc}] \subseteq S \));

**S4-7** \( S \) must obey a few more conditions having to do with SC fences.

**Remark 1.** The S3 condition above, due to Batty et al. [4], is slightly simpler and stronger than the one imposed by the official C11. Crucially, however, all the problems and counterexamples we observe in this section, concerning the C11 semantics of SC atomics, hold for both Batty et al.’s model and the original C11. The reason we use Batty et al.’s version here is that it provides a cleaner starting point for our discussion, and our solution to the problems with C11’s SC semantics will build on it.

Intuitively, the effect of the above conditions is to enforce that, since \( S \) corresponds to the order in which SC events are executed, it should agree with the other global orders of events: \( \textsf{hb}, \textsf{mo}, \) and \( \textsf{rb} \). However, as we will see shortly, condition (S1) is too strong. Before we get there, let us first look at a few examples to illustrate how the conditions on \( S \) interact to enforce sequential consistency.

Consider the classic “store buffering” litmus test:

\[
\begin{align*}
x & := \text{sc} 1 \\
y & := \text{sc} 1 \\
a & := y_{\text{sc}} # 0 \quad b & := x_{\text{sc}} # 0
\end{align*}
\]

(SB)

Here, the annotated behavior is forbidden by C11. Without loss of generality, assume \( x := \text{sc} 1 \) is before \( y := \text{sc} 1 \) in \( S \). By condition S1, since \( y := \text{sc} 1 \) is sequenced before \( b := x_{\text{sc}} \), it is also before \( b := x_{\text{sc}} \) in \( S \). Thus, by transitivity of \( S \), \( x := \text{sc} 1 \) is before \( b := x_{\text{sc}} \) in \( S \). Finally, if the read of \( x \) returns 0, then that means it reads-before \( x := \text{sc} 1 \), which by condition S3 means that \( S \) must order \( b := x_{\text{sc}} \) before \( x := \text{sc} 1 \). This would entail a cycle in \( S \), which is impossible.

Similarly, C11’s conditions guarantee that the following (variant given in [29] of the) 2+2W litmus test disallows the annotated weak behavior:

\[
\begin{align*}
x & := \text{sc} 1 \\
y & := \text{sc} 2 \\
a & := y_{\text{sc}} # 1 \quad b & := x_{\text{sc}} # 1
\end{align*}
\]

(2+2W)

Because of coherence condition (iii), \( \textsf{mo} \)—and thus, by S2, also \( S \)—must order \( x := \text{sc} 2 \) before \( x := \text{sc} 1 \), and \( y := \text{sc} 2 \) before \( y := \text{sc} 1 \); otherwise, the reads at the end of each thread would be reading overwritten values. At the same time, \( x := \text{sc} 1 \) is sequenced before \( y := \text{sc} 2 \), and \( y := \text{sc} 1 \) before \( x := \text{sc} 2 \), and thus by S1, those orderings must be included in \( S \) as well. Together, these induce an illegal cycle in \( S \).

---

\(^1\)The actual definition of \( \textsf{sw} \) contains further cases, which are not relevant for the current discussion. These are included in our formal model in §3.
We will show that the behavior is disallowed according to
As we saw in the introduction, the IRIW-acq-sc example
provide coherence, hardware memory models provide rather
strong ordering guarantees on accesses to the same memory
location. Consequently, for conditions S2 and S3, which only
enforce orderings between accesses to the same location,
ensuring that compilation preserves these conditions is not
difficult, even for weaker architectures like Power and ARM.

When, however, it comes to ensuring a strong ordering
between accesses of different memory locations, as S1 does,
compiling to weaker hardware requires the insertion of appro-
propriate memory fence instructions. In particular, for Power,
to enforce a strong ordering between two hb-related accesses,
there should be a Power sync fence occurring somewhere in
the hb-path (the sequence of sb and sw edges) connecting
the two accesses. Unfortunately, in the presence of mixed
SC and non-SC accesses, the Power compilation schemes do
not always ensure that a sync exists between hb-related SC
accesses. Specifically, if we follow the trailing sync conven-
tion, the hb-path (in Fig. 1) from k to m starting with an sw
edge avoids the sync fence placed after k. Conversely, if we
follow the leading sync convention, the hb-path (in Fig. 2)
from k to m ending with an sw edge avoids the fence placed
before m. The result is that S1 enforces more ordering than
the hardware provides!

So, if requiring that hb (on SC events) be included in S
is too strong a condition, what should we require instead?
The essential insight is that, according to either compilation
scheme, we know that a fence will necessarily exist between
SC accesses a and b if the hb path from a to b starts and ends
with an sb edge. Secondary, note that if a is a write and b
is a read that reads directly from a, then the hardware will
preserve the ordering anyway. These two observations lead
us to replace condition S1 with the following:

\[(S1_{fix})\quad S \text{ must include } hb, \text{ restricted to SC events, and further}
\text{ restricted to } hb\text{-paths that either start and end with a } sb\text{, or consist of a single } rf\text{ edge}
\text{(formally: } [E^{sc}]; (sb } \cup \text{ sb; hb; sb } \cup \text{ rf)}; [E^{sc}] \subseteq S)\]

We note that condition S1fix, although weaker than S1,
suffices to rule out the weak behaviors of the basic litmus tests
(i.e., SB and 2+2W). In fact, just to rule out these behaviors,
it suffices to require sb (on SC events) to be included in S.

Further, we note that in the absence of mixing of SC and
non-SC accesses to the same location, every hb-path between
two SC accesses that does not go through another SC access
is either a direct rf-edge or has to start and end with an sb
edge, in which case the two conditions coincide.

**Fixing the Model** Before formalizing our fix, let us first
rephrase conditions S1–S3 in the more concise style sug-
uggested by Batty et al. [4]. Instead of expressing them as sepa-
rate conditions on a total order S, Batty et al. require a single
acyclicity condition, namely that \([E^{sc}]; (hb } \cup \text{ mo } \cup \text{ rb)}; [E^{sc}]\]
be acyclic. (In general, acyclicity of \(\bigcup R_i \) is equivalent to the
existence of a total order S that contains \(R_1, R_2, \ldots \))

We propose to correct the condition by replacing hb with
\(sb \cup sb; hb; sb \cup rf, \text{ thus requiring the relation}\)

\[psc_1 \triangleq [E^{sc}]; (sb \cup sb; hb; sb \cup rf \cup mo \cup rb); [E^{sc}]\]
to be acyclic instead. Our weaker condition suffices to provide the most basic usefulness criterion for SC accesses, namely the DRF-SC theorem. It turns out that even the acyclicity of \([E^c] \cup \{sb \cup rf \cup mo \cup rb\}; E^c\) suffices to prove DRF-SC. This should not be very surprising: in the extreme case when all accesses are SC, this corresponds exactly to the declarative definition of sequential consistency [25].

2.2 Second Problem: SC Fences are Too Weak

We move on to a second problem, this time involving SC fences. Denote by \(F^c\) the set of SC fences in \(E\). The condition of Batty et al. [5] for the full model is that

\[
psc_{Batty} \triangleq \left( [E^c] \cup [F^c] ; sb \right) \cup \left( \left( E^c \cup F^c \right) ; \left[ sb \cup F^c \right] \right)
\]

is acyclic. This condition generalizes the earlier condition by forbidding \(\left( \left( E^c \cup F^c \right) ; \left[ sb \cup F^c \right] \right)\)-cycles even between non-SC accesses provided they are preceded/followed by an SC fence. This condition rules out weak behaviors of examples such as \(SB\) and \(2+2W\) where all accesses are relaxed and SC fences are placed between them in all threads.

In general, one might expect that inserting an SC fence between every two instructions restores sequential consistency. This holds for hardware memory models, such as TSO, Power, and ARM, for programs with aligned word-sized accesses (for their analogue of SC fences), but does not hold neither in the original C11 model nor in its strengthening [5] for two reasons. The first reason is that C11 declares that programs with racy non-atomic accesses have undefined behavior, and even if fences are placed everywhere such races may exist. There is, however, another way in which putting fences everywhere in C11 does not restore sequential consistency, even if all the accesses are atomic. Consider the following program:

\[
x := rlx 1 \quad \begin{align*}
& Pence \quad \begin{align*}
& a := rlx 1 \\
& Pence \quad \begin{align*}
& b := y_{rlx} / 0 \\
& c := y_{rlx} / 0
\end{align*}
\end{align*}
\]

The annotated behavior is allowed according to the model of Batty et al. [5]. Fig. 3 depicts a consistent execution yielding this behavior, as the only \(psc_{Batty}\)-edge is from \(f_1\) to \(f_2\). Yet, this behavior is disallowed by all implementations of C11. We believe that this is a serious omission of the standard rendering the SC fences too weak, as they cannot be used to enforce sequential consistency. This weakness has also been observed in an C11 implementation of the Chase-Lev deque by Lê et al. [19], who report that the weak semantics of SC fences in C11 requires them to unnecessarily strengthen the access modes of certain relaxed writes to SC. (In the context of the \(RWC+syncs\), it would amount to making the write to \(x\) in the first thread into an SC write.)

Remark 2 (Itanium). A possible justification for this weakness of the standard is that there was a fear that the implementation of fences on Itanium does not guarantee the property of restoring sequential consistency when fences are inserted everywhere. This fear is unfounded for two independent reasons. First, all atomic accesses are compiled to release/acquire Itanium accesses on which Itanium fences guarantee ordering. Second, even if this were not the case, Itanium implementations provide multi-copy atomicity, and thus cannot yield the weak outcome of IRIW even without fences [13, §3.3.7.1]. Nevertheless, the whole discussion is probably not very relevant any more, as Itanium is becoming obsolete.

Fixing the Semantics of SC Fences

Analyzing the execution of \(RWC+syncs\), we note that there is a \(sb; rb; rf\) path from \(f_2\) to \(f_1\), but this path does not contribute to \(psc_{Batty}\). Although both \(rb\) and \(rf\) edges contribute to \(psc\), their composition \(rb; rf\) does not.

To repair the model, we define the extended coherence order, \(eco\), to include the read-from relation, \(rf\), the modification order, \(mo\), the reads-before relation, \(rb\), and also all the compositions of these relations with one another—namely, all orders forced because of the coherence axioms. Our condition then becomes that

\[
psc_2 \triangleq \left( [E^c] \cup [F^c] ; sb \right) \cup \left( \left( E^c \cup F^c \right) ; \left[ sb \cup F^c \right] \cup \left[ E^c \cup F^c \right]\right)
\]

is acyclic, where \(eco \triangleq (rf \cup mo \cup rb)\). This stronger condition rules out the weak behavior of \(RWC+syncs\) because there are \(sb; rb; rf\) paths from one fence to another and vice versa (in one direction via the \(x\) accesses and in the other direction via the \(y\) accesses).

Intuitively speaking, compilation to Power remains correct with this stronger model, since \(eco\) exists only between accesses to the same location, on which Power provides strong ordering guarantees.

Now it is easy to see that, given a program without non-atomic accesses, placing an SC fence between every two accesses guarantees SC. By the definition of SC, it suffices to show that \(eco \cup sb\) is acyclic. Consider a \(eco \cup sb\) cycle. Since \(eco\) and \(sb\) are irreflexive and transitive, the cycle necessarily has the form \((eco; sb)\). Between every two \(eco\) steps, there must be an SC fence. So in effect, we have a cycle in \((eco; sb; [F^c]; sb)\), which can be regrouped to a cycle in \(([F^c]; sb; eco; sb; [F^c])\), which is forbidden by our model.
2.3 Adjustments of the Model

Next, we describe two modifications of our condition above that we employ in our model.

Restoring Fence Cumulativity Consider the following variant of the store buffering program, where the write of \( x := 1 \) has been moved to another thread with a release-acquire synchronization.

\[
\begin{align*}
\text{if} & \quad \text{fence}_{sc} \quad \text{fence}_{sc} \quad \text{fence}_{sc} \quad \text{fence}_{sc} \\
x & := \text{rlx} 1 \\
y & := \text{rel} 1 \\
z & := \text{rel} 1 \\
\end{align*}
\]

(W+RWC)

The annotated behavior corresponds to the writes of \( x \) and \( y \) being observed in different orders by the reads, although SC fences having been used in the observer threads. This behavior is disallowed on TSO, Power and ARM because their fences are cumulative: the fences order not only the writes performed by the thread with the fence instruction, but also the writes of other threads that are observed by the thread in question [21].

In contrast, the behavior is allowed by the model described thus far. Consider the execution shown in Fig. 4. While there is a \( \text{sb} \); \( \text{rb} \); \( \text{sb} \) path from \( f \) to \( f' \), the only path from \( f' \) back to \( f \) is \( \text{sb} \); \( \text{rb} \); \( \text{sb} \); \( \text{sw} \); \( \text{sb} \), and so the execution is allowed.

To disallow such behaviors, we can strengthen the SC condition and require that

\[
\text{psc}_3 \triangleq \left( [E_{sc}] \cup [F_{sc}], \text{hb} \right) \\
\left( \text{sb} \cup \text{sb}; \text{hb}; \text{sb} \cup \text{eco} \right) \\
\left( [E_{sc}] \cup [F_{sc}] \right)
\]

is acyclic, thereby ruling out the cycle in the execution in Fig. 4. (Note that to rule out only the cycle shown in Fig. 4, it would suffice to have replaced only the \( \text{sb} \) to a fence by an \( \text{hb} \). We can, however, also construct examples, where it is useful for the \( \text{sb} \) from a fence to be replaced by \( \text{hb} \). We thus strengthen them both.)

Elimination of SC Accesses Finally, we observe that our condition disallows the elimination of an SC write immediately followed by another SC write to the same location, as well as of an SC read immediately preceded by an SC read from the same location. While no existing compiler performs these eliminations, these are sound under sequential consistency, and one may wish to preserve their soundness under weak memory.

To see the unsoundness of eliminating an overwritten SC write, consider the following program. The annotated behavior is forbidden, but it will become allowed after eliminating \( x := \text{sc} 1 \).

\[
\begin{align*}
a := x_{sc} & \quad \text{(LB+deps)} \\
b := y_{sc} & \quad \text{(WWmerge)} \\
y := x_{rel} & \quad \text{(RRmerge)} \\
c := x_{sc} & \\
d := y_{sc} &
\end{align*}
\]

Similarly, the unsoundness of eliminating a repeated SC read is witnessed by the following program. Again, the annotated behavior is forbidden, but it will become allowed after replacing \( b := x_{sc} \) by \( b := a \).

\[
\begin{align*}
y := x_{sc} & \quad \text{(LB+deps)} \\
\end{align*}
\]

The problem here is that these transformations remove an \( \text{sb} \)-edge, and thus removes an \( \text{sb} \); \( \text{hb} \) path between two SC accesses.\(^2\) Note that the removed \( \text{sb} \)-edges are all edges between same location accesses. Thus, supporting these transformations can be achieved by a slight weakening of our model. The solution is to replace \( \text{sb} \); \( \text{hb} \); \( \text{sb} \) in \( \text{psc}_3 \) by \( \text{sb} \not\in \text{loc} \); \( \text{hb} \); \( \text{sb} \not\in \text{loc} \) where \( \text{sb} \not\in \text{loc} \) denotes \( \text{sb} \)-edges between accesses to different locations. Our final condition becomes acyclicity of the following relation:

\[
\text{psc} \triangleq \left( [E_{sc}] \cup [F_{sc}] ; \text{hb} \right) \\
\left( \text{sb} \cup \text{eco} \cup \text{sb} \not\in \text{loc} ; \text{hb} ; \text{sb} \not\in \text{loc} \right) \\
\left( [E_{sc}] \cup [F_{sc}] \right)
\]

We note that this change does not affect programs that do not mix SC and non-SC accesses to the same location.

2.4 A Final Problem: Out-of-Thin-Air Reads

The C11 memory model suffers from a major problem, known as the “out-of-thin-air problem” [28, 10]. Designed to allow efficient compilation and many optimization opportunities for relaxed accesses, the model happened to be too weak, admitting “thin-air” behaviors, which no implementation exhibits. The standard example is load buffering with some form of dependencies:

\[
\begin{align*}
a := x_{rlx} & \quad \text{(LB+deps)} \\
\end{align*}
\]

In this program, the formalized C11 model by Batty et al. [7] allows reading \( a = b = 1 \) even though the value does not appear in the program. The reason is that the execution where both threads read and write the value 1 is consistent: each read reads from the write of the other thread. As one might expect, such behaviors are very problematic because they invalidate almost all forms of formal reasoning about the programs. In particular, the example above demonstrates a violation of DRF-SC, the most basic guarantee that users of C11 were intended to assume: \( \text{LB+deps} \) has no races under sequential consistency, and yet has some non-SC behavior.

Fixing the model in a way that forbids all out-of-thin-air behaviors and still allows the most efficient compilation is
This is what makes the compilation correctness proof highly non-trivial, as the Power model allows certain cycles involving plain loads and stores. As a result, one has to rely on the “catch-fire” semantics (races on non-atomic accesses result in undefined behavior) for explaining behaviors that involve such cycles. A similar argument is needed for proving the correctness of non-atomic read-write reordering.

3. The Proposed Memory Model

In this section, we formally define our proposed corrected version of the C11 model, which we call RC11. Similar to C11, the RC11 model is given in a “declarative” style in three steps: we associate a set of graphs (called executions) to every program (§3.1), filter this set by imposing a consistency predicate (§3.2), and finally define the outcomes of a program based on the set of its consistent executions (§3.3). At the end of the section, we compare our model with C11 (§3.4).

Before we start, we introduce some further notation. Given a binary relation $R$, $dom(R)$ and $codom(R)$ denote its domain and codomain. Given a function $f$, $\mathcal{F}$ denotes the set of $f$-equivalent pairs ($\mathcal{F} \triangleq \{ (a, b) \mid f(a) = f(b) \}$), and $R|_f$ denotes the restriction of $R$ to $f$-equivalent pairs ($R|_f \triangleq R \cap \mathcal{F}$). When $R$ is a strict partial order, $R|_{\mathcal{F}_i}$ denotes the set of all immediate $R$-edges, i.e., pairs $\langle a, b \rangle \in R$ such that for every $c$, $\langle c, b \rangle \in R$ implies $\langle c, a \rangle \in R^+$, and $\langle a, c \rangle \in R$ implies $\langle b, c \rangle \in R^+$.

We assume finite sets $\text{Loc}$ and $\text{Val}$ of locations and values. We use $x, y, z$ as metavariables for locations and $v$ for values.
each \( c_i \) is a sequential program. Then, the set of executions associated with a given program is defined by induction over the structure of sequential programs. We do not define formally this construction (it depends on the particular syntax and features of the source programming language). In this initial stage the read values are not restricted whatsoever (and \( rf \) and \( mo \) are arbitrary). Note that the set of executions of a program \( P \) is taken to be prefix-closed: an \( sb \)-prefix of an execution of \( P \) (which includes at least the initialization events) is also considered to be an execution of \( P \). By full executions of \( P \), we refer to executions of \( P \) that represent traces generated by the whole program \( P \).

We show an example of an execution in Fig. 5. This is a full execution of the Z6.U program, and is essentially the same as the C11 execution shown in Fig. 2, except that for convenience our executions represent RMWs differently from C11 executions. Here each RMW is represented as two events, a read and a write, related by the \( rmw \) relation, whereas in C11 they are represented by single RMW events, which act as both the read and the write of the RMW. Our choice is in line with the Power and ARM memory models, and simplifies a bit the formal development (e.g., the definition of receptiveness).

### 3.2 Consistent Executions

The main part of the memory model is filtering the consistent executions among all executions of the program. The first obvious restriction is that every read should read some written value (formally, \( R \subseteq codom(rf) \)). We refer to such executions as complete. Next, the model defines several constraints with the help of a number of derived relations.

\[
rs \triangleq \{w; sb; |loc\}; [w|\rightarrow|lin]; (rf; rmw)^+ \quad \text{(release sequence)}
\]

\[
sw \triangleq \{s; rs; sf; [F; sb]; [E|\rightarrow|lin]; (sb; |loc); [E|\rightarrow|acq]\} \quad \text{(synchronizes with)}
\]

\[
hb \triangleq (sb \cup sw)^+ \quad \text{(happens-before)}
\]

An important derived relation is the happens-before order \( (hb) \), which intuitively records when an event is globally perceived as occurring before another one. Happens-before is defined in terms of two more basic definitions. First, following [27], the release sequence \( (rs) \) of a write contains the write itself and all later writes to the same location in the same thread, as well as all RMWs that recursively read from such writes. Next, a release event \( a \) synchronizes with \( (sw) \) an acquire event \( b \), whenever \( b \) (or, in case \( b \) is a fence, some \( sb \)-prior read) reads from the release sequence of \( a \) (or in case \( a \) is a fence, of some \( sb \)-later write). Finally, we say that an event \( a \) happens-before another event \( b \) if there is a path from \( a \) to \( b \) consisting of \( sb \) and \( sw \) edges.

Next, we define the extended coherence order, \( eco \), to be \((rf \cup mo \cup rb)^+\), where \( rb \) is the reads-before order. Since the modification order, \( mo \), is transitive, the definition of \( eco \) can be simplified as follows:

\[
rb \triangleq rf^{-1}; mo \quad \text{(reads-before or from-read)}
\]

\[
eco \triangleq rf \cup (mo \cup rb); rf^+ \quad \text{(extended coherence order)}
\]

We remark that \( eco \) is a strict order. Finally, we define the partial SC relation, \( psc \), as follows.

\[
psc \triangleq \left( [E \cup F; sb]; sb \cup eco \cup [E \cup F; sb]_{\not\emptyset}; sb; \not\emptyset; sb; \not\emptyset \right) ; \left( [E \cup F; sb]; sb; \not\emptyset; sb \right)
\]

where \( sb; \not\emptyset \triangleq [RW]; sb; [RW] \setminus sb; \not\emptyset \).

Using these derived relations, RC11-consistency imposes four constraints on executions:

**Definition 1.** An execution \( G \) is called RC11-consistent if it is complete and the following hold:

- \( hb; eco \) is irreflexive. \( \quad \text{(COHERENCE)} \)
- \( rmw \cap (rb; mo) = \emptyset. \) \( \quad \text{(ATOMICITY)} \)
- \( psc \) is acyclic. \( \quad \text{(SC)} \)
- \( sb \cup rf \) is acyclic. \( \quad \text{(NO-THIN-AIR)} \)

**COHERENCE** ensures that programs with only one shared location are sequentially consistent, as at least two locations are needed for a cycle in \( sb \cup eco \). **ATOMICITY** ensures that the read and the write comprising a RMW are adjacent in \( eco \): there is no write event in between. The **SC** condition is the main novelty of RC11 and will be used to show the DRF-SC theorem and other criteria for ensuring sequential consistency (see §8). Finally, **NO-THIN-AIR** rules out thin-air behaviors, albeit at a performance cost, as we will see in §5.

### 3.3 Program Outcomes

Finally, in order to allow the compilation of non-atomic read and writes to plain machine load and store instructions (as well as the compiler to reorder such accesses), RC11 follows the “catch-fire” approach: races on non-atomic accesses result in undefined behavior, that is, any outcome is allowed. Formally, it is defined as follows.

**Definition 2.** Two events \( a \) and \( b \) are called conflicting in an execution \( G \) if \( a, b \in E, W \in \{\text{typ}(a), \text{typ}(b)\}, a \neq b, \) and \( loc(a) = loc(b) \). A pair \((a, b)\) is called a race in \( G \) (denoted \( (a, b) \in \text{race} \)) if \( a \) and \( b \) are conflicting events in \( G \), and \( (a, b) \not\in hb \cup hb^{-1}. \)

**Definition 3.** An execution \( G \) is called racy if there is some \( (a, b) \in \text{race} \) with \( na \in \{\text{mod}(a), \text{mod}(b)\} \). A program \( P \) has undefined behavior under RC11 if it has some racy RC11-consistent execution.
3.4 Comparison with C11

Besides the new SC and NO-THIN-AIR conditions, RC11 differs in a few other ways from C11.

- It does not support consume accesses, a premise feature of C11 that is not implemented by major compilers, nor locks, as they can be straightforwardly implemented with release-acquire accesses.
- For simplicity, it assumes all locations are initialized.
- It incorporates the fixes proposed by Vafeiadis et al. [27], namely (i) the strengthening of the release sequences definition, (ii) the removal of restrictions about different threads in the definition of synchronization, and (iii) the lack of distinction between atomic and non-atomic locations (and accordingly omitting the problematic rf ⊆ hb condition for non-atomic locations). The third fix avoids out-of-thin-air problems that arise when performing non-atomic accesses to atomic location [5, §5].
- It does not consider “unsequenced races” between atomic accesses to have undefined behavior. (Such gratuitous undefined behavior is not needed for any of our results.)

We have also made three presentational changes: (1) we have a much more concise axiomatization of coherence; (2) we model RMWs using two events; and (3) we do not have a total order over SC atoms.

Proposition 1. RC11’s COHERENCE condition is equivalent to the conjunction of the following constraints of C11:

- hb is irreflexive. (IRREFLEXIVE-HB)
- rf; hb is irreflexive. (NO-FUTURE-READ)
- mo; rf; hb is irreflexive. (COHERENCE-RW)
- mo; hb is irreflexive. (COHERENCE-WW)
- mo; hb; rf⁻¹ is irreflexive. (COHERENCE-WR)
- mo; rf; hb; rf⁻¹ is irreflexive. (COHERENCE-RR)

Proposition 2. The SC condition is equivalent to requiring the existence of a total strict order S on E_asc such that S; asc is irreflexive.

Finally, the next proposition ensures that without mixing SC and non-SC accesses to the same location, RC11 supplies the stronger guarantee of C11. As a consequence, programmers that never mix such accesses may completely ignore the difference between RC11 and C11 regarding SC.

Proposition 3. If all SC accesses are to distinguished locations (for every a, b ∈ E \ Eo, if mod(a) = sc and loc(a) = loc(b) then mod(b) = sc) then [E_asc]; hb; [E_asc] ⊆ psc.
typically more SC reads than SC writes in programs, this scheme is less preferred.

5. Compilation to Power

In this section, we present the Power model and the mappings of language operations to Power instructions. We then prove the correctness of compilation from RC11 to Power.

As a model of the Power architecture, we use the recent declarative model by Alglave et al. [3], which we denote by Power. Its executions are similar to the ones above, with the following exceptions:

- Read/write labels have the form \( R(x, v) \) and \( W(x, v) \) (they do not include a “mode”).
- Power executions track syntactic dependencies between events in the same thread, and derive a relation called preserved program order, denoted \( ppo \), which is a subset of \( sb \) guaranteed to be preserved. The exact definition of \( ppo \) is quite intricate, and is included in Appendix F.
- Power has two types of fence events: a “lightweight fence” and a “full fence”. We denote by \( \text{lwsync} \) and \( \text{sync} \) the set of all lightweight fence and full fence events in a Power execution. Power’s “instruction fence” (\( \text{isync} \)) is used to derive \( ppo \) but is not recorded in executions.

In addition to \( ppo \), the following additional derived relations are needed to define Power-consistency (see [3] for further explanations and details).

\[
\begin{align*}
\text{sync} & \triangleq [R]; sb; [\text{sync}]; sb; [R] \\
\text{lwsync} & \triangleq [R]; sb; [\text{sync}]; sb; [R] \setminus (W \times R) \\
\text{fence} & \triangleq \text{sync} \cup \text{lwsync} \\
\text{hb} & \triangleq ppo \cap \text{fence} \cup \text{rf} & \text{(Power’s happens-before)} \\
\text{prop}_1 & \triangleq [w]; \text{rf}; \text{fence}; \text{hb}_p^*; [w] \\
\text{prop}_2 & \triangleq (\text{mo} \cup \text{urbe}); \text{rf}; (\text{fence}; \text{hb}_p^*)^2; \text{sync}; \text{hb}_p^* \\
\text{prop} & \triangleq \text{prop}_1 \cup \text{prop}_2 & \text{(propagation relation)}
\end{align*}
\]

where for every relation \( c \) (e.g., \( \text{rf}, \text{mo} \), etc.), we denote by \( ci \) and \( ce \) (internal and external) its thread-internal and thread-external restrictions. Formally, \( ci = c \cap sb \) and \( ce = c \setminus sb \).

Definition 5. A Power execution \( G \) is Power-consistent if it is complete and the following hold:

1. \( sb \setminus \text{loc} \cup \text{rf} \cup \text{rb} \cup \text{mo} \) is acyclic. (SC-PER-LOC)
2. \( \text{rbe} \cap \text{prop} \cap \text{hb}_p^* \) is irreflexive. (OBSERVATION)
3. \( \text{mo} \cup \text{prop} \) is acyclic. (PROPAGATION)
4. \( \text{rmw} \cap (\text{rbe} \cup \text{mo}) \) is irreflexive. (POWER-ATOMICITY)
5. \( \text{hb}_p \) is acyclic. (POWER-NO-THIN-AIR)

Remark 3. The model in [3] contains an additional constraint: \( \text{mo} \cup [\text{At}]; \text{sb} \cup [\text{At}] \) should be acyclic (recall that \( At = \text{dom}(\text{rmw}) \cup \text{codom}(\text{rmw}) \)). Since none of our proofs requires this property, we excluded it from Def. 5.

\[
\begin{align*}
(R^2 & \triangleq 1d) \\
(R^\text{rlx} & \triangleq 1d; \text{cmp}; \text{bc}) \\
(R^\text{acq} & \triangleq 1d; \text{cmp}; \text{bc}; \text{sync}; (W^*)^2; \text{lwsync}; \text{st}) \\
(F^\text{nec} & \triangleq \text{lwsync}; (F^\text{ec}) \cap \text{sync}) \\
(R^\text{acq} & \triangleq L; \text{lwarx}; \text{cmp}; \text{bc} \cup \text{bc} \cup \text{stw} \cup \text{bc} \cup \text{L}; \text{L}) \\
(R^\text{acq} & \triangleq (\text{RMW}^\text{rlx})^2; \text{sync} \\
(R^\text{acq} & \triangleq \text{lwsync} \cap (\text{RMW}^\text{rlx})^2) \\
(R^\text{acq} & \triangleq \text{lwsync} \cap (\text{RMW}^\text{rlx})^2); \text{isync}
\end{align*}
\]

Figure 7. Compilation of non-SC primitives to Power.

Leading sync | Trailing sync
--- | ---
(R^\text{rc} & \triangleq \text{sync}; (R^\text{acq})^2) | (R^\text{rc}; (R^\text{rc})^2) \triangleq 1d; \text{sync} | (R^\text{wsc} & \triangleq \text{sync}; (R^\text{wsc})^2) | (R^\text{wsc}; (R^\text{wsc})^2) \triangleq \text{sync} | (R^\text{acq} & \triangleq \text{sync}; (R^\text{acq})^2) | (R^\text{wsc} & \triangleq \text{sync}; (R^\text{wsc})^2)

Figure 8. Compilations of SC accesses to Power.

Unlike RC11, well-formed Power programs do not have undefined behavior. Thus, a function \( O : \text{Loc} \rightarrow \text{Val} \) is an outcome of a Power program \( P \) if it is an outcome of some Power-consistent full execution of \( P \) (see Def. 4).

As already mentioned, the two compilation schemes from C11 to Power that have been proposed in the literature [1] differ only in the mappings used for SC accesses (see Fig. 8). The first compilation scheme follows the leading sync convention, and places a sync fence before each SC access. The alternative scheme follows the trailing sync convention, and places a sync fence after each SC access. Importantly, the same scheme should be used for all SC accesses in the program, since mixing the schemes is unsound. The mappings for the non-SC accesses and fences are common to both schemes and are shown in Fig. 7. Note that our compilation of relaxed reads is stronger than the one proposed for C11 (see §2.4).

Our main theorem says that the compilation schemes are correct. For a program \( P \), we denote by \( \langle \{P\} \rangle \) the Power-program obtained by compiling \( P \) using the scheme in Fig. 7 and either of the schemes in Fig. 8 for SC accesses.

Theorem 1. Given a program \( P \), every outcome of \( \langle \{P\} \rangle \) under Power is an outcome of \( P \) under RC11.

Proof (Outline). The main idea is to consider the compilation as if it happens in three steps, and prove the soundness of each step:

1. Leading sync: Each \( R^\text{acq} \cap W^\text{sc} \cap \text{RMW}^\text{wsc} \) in \( P \) is replaced by \( F^\text{sc} \) followed by \( \text{R} \cap \text{wsc} \cap \text{RMW}^\text{acqrel} \).
   Trailing sync: Each \( R^\text{acq} \cap W^\text{sc} \cap \text{RMW}^\text{wsc} \) in \( P \) is replaced by \( R^\text{acq} \cap W^\text{sc} \cap \text{RMW}^\text{acqrel} \) followed by \( F^\text{sc} \).
2. The mappings in Fig. 7 are applied.
3. Leading sync: Pairs of the form \( \text{sync}; \text{lwsync} \) that originated from \( R^\text{acq} \cap W^\text{sc} \cap \text{RMW}^\text{wsc} \) are reduced to \( \text{sync} \) (eliminating the redundant \( \text{lwsync} \)).
   Trailing sync: Any \( \text{cmp}; \text{bc}; \text{isync} \) sequences originated from \( R^\text{acq} \cap W^\text{sc} \cap \text{RMW}^\text{wsc} \) are reduced to \( \text{sync} \) (eliminating the redundant \( \text{cmp}; \text{bc}; \text{isync} \)).
The resulting Power program is clearly identical to the one obtained by applying the mappings in Figures 7 and 8.

Soundness of the steps (that is, none of them introduces additional outcomes) follows from Lemmas E.2 and H.2 and Appendix F.3.

The main difficulty (and novelty of our proof) lies in proving soundness of the second step, and more specifically in establishing the NO-THIN-AIR condition. Since Power, unlike RC11, does not generally forbid sb U rf cycles, we have to show that such cycles can be untangled to produce a racy RC11-consistent execution, witnessing the undefined behavior. Here, the idea is, similar to DRF-SC proofs, to detect a first rf-edge that closes an sb U rf cycle, and replace it by a different rf-edge that avoids the cycle. This is highly non-trivial because it is unclear how to define a “first” rf-edge when sb U rf is cyclic. To solve this problem, we came up with a different ordering of events, which does not include all sb edges, and Power ensures to be acyclic (a relation we call Power-before in Appendix G).

For completeness, we also show that the conditional branch after the relaxed read is only necessary if we care about enforcing the NO-THIN-AIR condition. That is, let weakRC11 be the model obtained from RC11 by omitting the NO-THIN-AIR condition, and denote by \(\langle P \rangle_{\text{weak}}\), the Power-program obtained by compiling P as above, except that relaxed reads are compiled to plain loads (again, with either leading or trailing syncs for SC accesses). Then, this scheme is correct with respect to the weakRC11 model.

**Theorem 2** (Compilation of weakRC11 to Power). Given a program P, every outcome of \(\langle P \rangle_{\text{weak}}\) under Power is an outcome of P under weakRC11.

Finally, we note that it is also possible to use a lightweight fence (1\(\omega\)sync) instead of a fake control dependency and an instruction fence (1\(\omega\)sync) in the compilation of (all or some) acquire accesses. The correctness follows from Appendix F.3.

### 6. Compilation to ARMv7

The ARMv7 model [3] is very similar to the Power model just presented in §5. There are only two differences.

First, while ARMv7 has analogues for Power’s strong fence (sync) and instruction fence (1\(\omega\)sync), it lacks an analogue for Power’s lightweight fence (1\(\omega\)sync). Thus, on ARMv7 we have \(F^{1\omega sync} = \emptyset\) and so \(fence = sync\).

The second difference is that ARMv7 has a somewhat weaker preserved program order, ppo, than Power, which in particular does not always include \([R_y]; sb; [W_x]\) (following the model in [3]). In our Power compilation proofs, however, we never rely on this property of Power’s ppo (see Appendix F).

The compilation schemes to ARMv7 are essentially the same as those to Power substituting the corresponding ARMv7 instructions for the Power ones: dmb instead of sync and l\(\omega\)sync, and i\(\omega\) instead of l\(\omega\)sync. (Since ARMv7 lacks an analogue for l\(\omega\)sync, the compilation to ARMv7 uses a strong fence (dmb) instead.) The soundness of compilation to ARMv7 follows directly from Theorems 1 and 2.

We note that neither GCC (version 5.4) nor LLVM (version 3.9) map acquire reads into 1d; cmp; bc; i\(\omega\) (instead of 1d; sb). Instead, they emit 1d; dmb (that corresponds to Power’s 1d; sync). With this stronger compilation scheme, there is no correctness problem in compilation of C11 to ARMv7. Nevertheless, if one intends to use i\(\omega\)’s, the same correctness issue arises (e.g., the one in Fig. 1), and RC11 overcomes this issue.

### 7. Correctness of Program Transformations

In this section, we list program transformations that are sound in RC11, and prove that this is the case. As in [27], to have a simple presentation, all of our arguments are performed at the semantic level, as if the transformations were applied to events in an execution. Thus, to prove soundness of a program transformation \(P_{src} \rightsquigarrow P_{tgt}\), we are given an arbitrary RC11-consistent execution \(G_{tgt}\) of \(P_{tgt}\), and construct a RC11-consistent execution \(G_{src}\) of \(P_{src}\), such that either \(G_{src}\) and \(G_{tgt}\) have the same outcome or \(G_{src}\) is racy. In the former case, we show that \(G_{tgt}\) is racy only if \(G_{src}\) is. Consequently, one obtains that every outcome of \(P_{tgt}\) under RC11 is also an outcome of \(P_{src}\) under RC11.

The soundness proofs (sketched in Appendix I) are mostly similar to the proofs in [27], with the main difference concerning the new SC.

**Strengthening** Strengthening transforms the mode \(o\) of an event in the source into \(o'\) in the target where \(o \subseteq o'\). Soundness of this transformation is trivial, because RC11-consistency is monotone with respect to the mode ordering.

**Sequentialization** Sequentialization merges two program threads into one, by interleaving their events in sb. Essentially sequentialization just adds edges to the sb relation. Its
soundness trivially follows from the monotonicity of RC11-consistency with respect to \( \text{sb} \).

**Deordering** Table 1 defines the deorderable pairs, for which we proved the soundness of the transformation \( \{ \text{na}, \text{rlx}, \text{acq}, \text{sc} \} \) satisfying \( \text{na} \subseteq \sigma \); and \( \sigma \) denotes the maximal mode in \( \{ \text{na}, \text{rlx}, \text{rel}, \text{sc} \} \) satisfying \( \text{sc} \subseteq \sigma \).

**Merging** Merges are transformations of the form \( X; Y \leadsto Z \), eliminating one memory access or fence. Fig. 9 defines the set of mergeable pairs. Note that using strengthening, the modes mentioned in Fig. 9 are upper bounds (e.g., \( \text{Racq} \); \( \text{RL} \) can be first strengthened to \( \text{Racq} \); \( \text{RL} \) and then merged). Generally speaking, RC11 supports all mergings that are intended to be mergeable in C11 [27].

**Remark 4.** The elimination of redundant read-after-write allows the write to be non-atomic. Nevertheless, an SC read cannot be eliminated in this case, unless it follows an SC write. Indeed, elimination of a an SC read after a non-SC write is unsound in RC11. We note that while this elimination is allowed by a certain fix of of C11 described in [27], its effectiveness seems to be low, and, in fact, it is already unsound for the model in [4] (see Appendix A.3 for a counterexample). Note also that read-after-RMW elimination does not allow the read to be an acquire read unless the update includes an acquire read (unlike read-after-write). This is due to release sequences: eliminating an acquire read after a relaxed update may remove the synchronization due to a release sequence ending in this update.

**Register Promotion** Finally, “register promotion” is sound in RC11. This global program transformation replaces all the accesses to a memory location by those to a register, provided that the location is used by only one thread. At the execution level, all accesses to a particular location are removed from the execution, provided that they are all \( \text{sb} \)-related.

### 8. Programming Guarantees

In this section, we demonstrate that our semantics for SC atoms (i.e., the SC condition in Def. 1) is not overly weak. We do so by proving theorems stating that programmers who follow certain defensive programming patterns can be assured that their programs exhibit no weak behaviors. The first such theorem is DRF-SC, which says that if a program has no races on non-SC accesses under SC semantics, then its outcomes under RC11 coincide with those under SC.

In our proofs we use the standard declarative definition of SC: an execution is SC-consistent if it is complete, satisfies \( \text{ATOMICITY} \), and \( \text{sb} \cup \text{rf} \cup \text{mo} \cup \text{rb} \) is acyclic [25].

**Theorem 3.** If in all SC-consistent executions of a program \( P \), every race \( \langle a, b \rangle \) has \( \text{mod}(a) = \text{mod}(b) = \text{sc} \), then the outcomes of \( P \) under RC11 coincide with those under SC.

Next, we show that adding a fence instruction between every two accesses to \( \text{shared} \) locations restores SC, or there remains a race in the program, in which case the program has undefined behavior. More formally, we say that a location is shared if it is accessed by more than one threads.

**Definition 6.** A location \( x \) is \( \text{shared} \) in an execution \( G \) if \( \langle a, b \rangle \not\in \text{sb} \cup \text{sb}^{-1} \) for some distinct events \( a, b \in E_x \).

**Theorem 4.** Let \( G \) be an RC11-consistent execution. Suppose that for every two distinct shared locations \( x \) and \( y \), \( \text{for each } E_x \); \( \text{sb} \cup E_y \subseteq \text{sb} \); \( \text{Ec} \); \( \text{sb} \). Then, \( G \) is SC-consistent.

We remark that for the proofs of Theorems 3 and 4, we do not need the full SC condition: for Thm. 3 it suffices for \( [E^C]; \text{sb} \cup [E^C] \); \( \text{sb} \cup [E^C]; \text{sb} \). Then, \( G \) is SC-consistent.

### 9. Related Work

Despite having been developed quite recently, a fair number of problems have been found in the C11 memory model. The model itself was designed by the C++ standard committee based on a paper by Boehm and Adve [9]. During the standardization process, Batty et al. [7] formalized the C11 memory model and proved soundness of its compilation to x86-TSO. They also proposed a number of key technical improvements to the model (such as some coherence axioms), which were incorporated into the standard.

Soon afterwards, Batty et al. [6] and Sarkar et al. [24] studied the compilation of C11 to Power, and incorrectly proved the correctness of two compilation schemes. In their proofs, from a consistent Power execution, they constructed a corresponding C11 execution, which they tried to prove consistent, but in doing so they forgot to check the overly strong condition S1. The examples shown in the introduction and in §2.1 are counterexamples to their theorems.

Quite early on, a number of papers [11, 28, 22, 10] noticed the disastrous effects of thin-air behaviors allowed by the C11 model, and proposed strengthening the definition of consistency by disallowing \( \text{sb} \cup \text{rf} \) cycles. [10] further
discussed how the compilation schemes of relaxed accesses to Power and ARM would be affected by the change, but did not formally prove the correctness of their proposed schemes.

Next, Vafeiadis et al. [27] noticed a number of other problems with the C11 memory model, which invalidated a number of source-to-source program transformations that were assumed to hold. They proposed local fixes to those problems, and showed that these fixes enabled proving correctness of a number of local transformations. We have incorporated their fixes in the RC11-consistency definition.

Then, in 2016, Batty et al. [4] proposed a more concise semantics for SC atomics, whose presentation we have followed in our proposed RC11 model. As their semantics is stronger than C11, it cannot be compiled efficiently to Power, contradicting the claim of that paper. Moreover, as already discussed, SC fences are still too weak according to their model: in particular, putting them between every two accesses in a program with only atomic accesses does not guarantee SC.

Finally, Manerkar et al. [20] recently discovered the problem with trailing-sync compilation to Power (in particular, they observed the IRIW-acq-sc counterexample), and identified the mistake in the existing proof. Independently, we discovered the same problem, as well as the problem with leading-sync compilation. Moreover, in this paper, we have proposed a fix for both problems, and proven that it works.

A number of works [28, 26, 16, 15] have previously studied only small fragments of the C11 model—typically the release/acquire fragment. Among these, Lahav et al. [15] previously proposed strengthening the semantics of SC fences in a different way by treating them as read-modify-writes to a distinguished location. That strengthening, however, was considered in the restricted setting of only release/acquire accesses, and does not directly scale to the full set of C11 access modes. In fact, for the fragment containing only SC fences and release/acquire accesses, RC11-consistency is equivalent to RA-consistency that treats SC fences as RMWs to a distinguished location [15].

10. Conclusion

In this paper, we have introduced the RC11 memory model, which corrects all the known problems of the C11 model. We have further proved (i) the correctness of compilation from RC11 to x86-TSO [23], Power and ARMv7 [3]; (ii) the soundness of various program transformations; (iii) a DRF-SC theorem; and (iv) a theorem showing that for programs without non-atomic accesses, weak behaviors can be always avoided by placing SC fences. It would be useful to mechanize the proofs of this paper in a theorem prover; we leave this for future work.

A certain degree of freedom exists in the design of the SC condition. A very weak version, which still maintains the two formal programming guarantees of this paper, would require acyclicity of \((E^c_2 \cup F^c_2); \text{sb}; (\text{sb} \cup \text{eco}); (E^c_2 \cup \text{sb}; F^c_2))\). In our choice of psc we aimed to provide: (i) stronger guarantees, while still, needless to say, ensuring the correctness of compilation; and (ii) additional optimization opportunities which are sound for sequential consistency (eliminating overwritten SC writes and repeated SC reads).

Regarding the infamous out-of-thin-air problem, we employed in RC11 a conservative solution at the cost of including a fake control dependency after every relaxed read. While this was already considered as a valid solution before, we are the first to prove the correctness of this compilation scheme, as well as the soundness of reordering of independent non-atomic accesses under this model. Correctness of an alternative scheme that places a lightweight fence after every relaxed write is left for future work. It is interesting to check the practical performance costs of each scheme. On the one hand, relaxed writes (which are not followed by a fence) are perhaps rare in real programs, compared to relaxed reads. On the other hand, a control dependency is cheaper than a lightweight fence, and relaxed reads are often anyway followed by a control dependency.

Another important future direction would be to combine our SC constraint with the recent model of Kang et al. [14], which prevents out-of-thin-air values (and avoids undefined behaviors all together), while still allowing the compilation of relaxed reads and writes to plain loads and stores. This is, in particular, crucial for adopting a model like RC11 in a type-safe language, like Java, that cannot allow undefined behaviors. Integrating our SC condition in that model, however, is non-trivial because the Kang et al. [14] model is defined in a very different style from C11, and thus we will have to find an equivalent operational way to check our SC condition. Finally, establishing the correctness of compilation of RC11 to ARMv8 [12] is another important future goal.

Acknowledgments

This research was supported in part by an ERC Consolidator Grant for the project “RustBelt” (grant agreement no. 683289), and in part by Samsung Research Funding Center of Samsung Electronics under Project Number SRFC-IT1502-07. The third author has been supported by a Korea Foundation for Advanced Studies Scholarship.

References


A. Further Examples

A.1 Failure of leading sync convention with SC fences

The following behavior is disallowed according to C11, but allowed by its compilation to Power.

\[
x := _{\text{rlx}} 2 \quad y := _{\text{sc}} 1 \quad x := _{\text{rel}} 1 \quad d := _{\text{rlx}} 2 \quad e := x_{\text{sc}} \# 1 \quad \text{(Rsync+Rsc)}
\]

Under C11, this behavior is forbidden. Consider the following execution (the initialization of \(x\) is omitted):

[Diagram of execution]

The \(\text{rf}\) and \(\text{mo}\) edges are forced because of coherence. Now, the C11 conditions on SC fences require, in particular, that \([E^\text{sc}] := \text{sc}; \text{rb}; [E^\text{sc}] \subseteq S\) and \([E^\text{sc}] := \text{rb}; \text{rb}; [E^\text{sc}] \subseteq S\). Hence, we must have \(S(l, n)\) (essentially because if we had \(S(n, l)\), then \(m\) would have been reading from an overwritten write), as well as \(S(q, l)\) (essentially because if we had \(S(q, l)\), then \(m\) would have been reading from an \(\text{mo}\)-overwritten write before the fence). By transitivity, we thus have \(S(q, n)\) which contradicts condition S1, which requires \(S(n, q)\) because of the happens-before path via \(o\).

The compilation to Power allows the behavior because again the sync fences do not provide sufficient synchronization: again all but one sync fences are useless, as they are placed at the beginning of a thread.

A.2 Failure of write-after-write and read-after-read eliminations using \(\text{psc}_3\)

We present the executions showing the failure of write-after-write and read-after-read eliminations using \(\text{psc}_3\) (see §2.3).

First, the following execution is an execution of \(\text{WWmerge}\) yielding the result \(a = b = c = 0\). The initialization events are omitted.

[Diagram of execution]

This execution is inconsistent with the SC condition that requires acyclicity of \(\text{psc}_3\) (since we have \(\langle k, n \rangle, \langle n, p \rangle, \langle p, q \rangle, \langle q, k \rangle \in \text{psc}_3\)). It is, however, consistent using our final \(\text{psc}\) relation \((k, n) \notin \text{psc}\).

Now, the following execution is an execution of the same \(\text{WWmerge}\) program, but after replacing \(b := x_{\text{sc}}\) by \(b := a\), again yielding the result \(a = b = c = 1 \land d = 0\). This execution is consistent with the SC condition that requires acyclicity of \(\text{psc}_3\) (as well as with our final \(\text{psc}\) relation).

[Diagram of execution]

A.3 Failure of SC-read-after-non-SC-write elimination

\[
y_{\text{sc}} := 2; \quad x := _{\text{rlx}} 1; \quad y := _{\text{sc}} 2; \quad x := _{\text{rel}} 1; \quad y := _{\text{sc}} 1; \quad a := 1; \quad c := _{\text{rlx}} 2; \quad b := _{\text{rlx}} 2 \quad \text{psc}\]

The annotated behavior is allowed under RC11 for the target, but not for the source. The same applies to the model of Batty et al. [4].
B. Programs to Executions: Receptiveness Assumption

To carry out the compilation correctness proof, we need to record syntactic dependencies between instructions, as in the Power model. (This is only needed if one is interested in the NO-THIN-AIR condition; compilation correctness for weakRC11 may completely ignore this extension.) Dependencies are classified into data, address and control dependencies. Accordingly, we extend the definition of an execution (see §3.1), with additional relations data, addr and ctrl. We use deps to denote the union of the three relations. We require data, addr and ctrl to satisfy the following:

1. data ⊆ R × W.
2. addr ⊆ R × (R ∪ W).
3. ctrl ⊆ R × E.
4. ctrl; sb ⊆ ctrl.
5. rmw ⊆ deps.

The dependency relations are calculated from the program syntax, together with the generation of program’s execution (like in Power), and the construction ensures that the above properties hold. Moreover, the construction of executions from programs provides us with the following receptiveness property:

Definition B.1. A function lab' : Event → Label is called a reevaluation of lab : Event → Label if for every event a, the label lab'(a) is identical to lab(a), except possibly for read/written value.

Notation B.1. Given an execution G and a reevaluation lab of G.1ab, lab(G) denotes the execution G' given by: G'.1ab = lab, G'.rf = ∅, and G'.c = G.c for every c ∈ {E, sb, rmw, mo, data, addr, ctrl}.

Assumption B.1 (receptiveness). Let G be an execution of a program P. Let a ∈ R, and suppose that a ∉ dom(deps⁺; (ctrl ∪ addr)). For every v ∈ Val, there exists a reevaluation lab of G.1ab such that:

- lab(G) is an execution of P.
- lab(G).val_r(a) = v.
- lab(b) = G.1ab(b) whenever ⟨a, b⟩ ∉ G.deps⁺.

Note that a more basic receptiveness property follows from this assumption: if a ∉ dom(sb) then for every v ∈ Val, we have that lab(G) is an execution of P, for the reevaluation lab of G.1ab that sets the read value of a to v, and otherwise is identical to G.1ab.

In addition, we assume that the set of executions of a program is prefix-closed:

Notation B.2. Given an execution G and a set E ⊆ E that is downwards closed w.r.t. sb (i.e., a ∈ E whenever ⟨a, b⟩ ∈ sb for some b ∈ E), and contains at least all the initialization events, the restriction of G to E, denoted G|E, is the execution G' given by G'.E = E, G'.1ab = G.1ab|E, and G'.c = [E]; G.c; [E] for c ∈ {sb, rmw, rf, mo, data, addr, ctrl}.

Assumption B.2 (prefix-closed executions). Let G be an execution of a program P, and let E be a subset of E that is downwards closed w.r.t. sb, and contains at least all the initialization events. Then, G|E is an execution of P.

C. Properties of RC11

In this section, we present some basic properties of the derived relations eco, sw, hb and of RC11-consistent executions. We omit some of the proofs that straightforwardly follow from our definitions. For the rest of this section, consider an arbitrary execution G.

Proposition C.1. eco is a strict partial order.

Proposition C.2. Suppose that [w]; sb|loc; [w] ≤ mo and rmw ≤ rb. Then, the following hold:

1. rs ≤ eco⁺.
2. [w]; sw; [R] ≤ eco.
3. [w]; sw; [P] ≤ eco; sb.
4. eco; hb ≤ eco ∪ eco; (sb \ rmw); hb⁺.

Proof.

1. Let ⟨a, b⟩ ∈ rs. Then, by definition, ⟨a, b⟩ ∈ [w]; sb|loc; [w] ≤ mo and rmw ≤ rb, we have ⟨a, b⟩ ∈ eco⁺. Since eco is transitive, we have ⟨a, b⟩ ∈ eco⁺.
Let \( (a, b) \in [w] \setminus [sw; [R]]. \) Then, by definition, we have \( (a, b) \in rs; rf \). Using the previous item, we obtain that \( (a, b) \in eco^2; eco \subseteq eco. \)

3. Let \( (a, b) \in [w]; sw; [F]. \) Then, by definition, we have \( (a, b) \in rs; rf; sb. \) Using the first item, we obtain that \( (a, b) \in eco^2; eco; sb \subseteq eco; sb. \)

4. Let \( (a, c) \in eco; hb \), and let \( b \in E \) be an eco-maximal event satisfying \( (a, b) \in eco \), and \( (b, c) \in hb^2 \). If \( b = c \) then \( (a, c) \in eco \), and we are done. Otherwise, the maximality of \( b \) ensures that \( (b, b') \in sb \setminus sw \) and \( (b', c) \in hb^2 \) for some \( b' \in E. \) Since \( rmw \subseteq rb \subseteq eco \), it follows that \( (a, c) \in eco; (sb \setminus rmw); hb^2. \) \( \Box \)

**Lemma C.1** (Read at end). Let \( a \in R \setminus dom(sb). \) Suppose that \( G' = G'|_{[G, E \setminus \{a\}]} \) is RC11-consistent. Then, there exists an event \( b \in E \) such that the execution \( G'' \) given by \( G''c = G . c \) for every \( c \in \{ E, sb, rmw, data, addr, ctrl, mo \} \), \( G'' \cdot lab = G' \cdot lab \cup \{ a \rightarrow R_\mod(a)(loc(a), val_e(b)) \} \), and \( G'' \cdot rf = G' \cdot rf \cup \{ (b, a) \} \) is RC11-consistent.

**Proof.** Take \( b \) to be the \( mo \)-maximal event in \( E \cdot w_{1oc(a)}. \) It is straightforward to show that \( G'' \), as defined in the statement, is RC11-consistent. \( \Box \)

**Proposition C.3.** Let \( a \in \mathbb{W}^{≤10} \setminus dom(rb). \) Let \( G' = G|_{[G, E \setminus \{a\}]} \). Then, \( [G'. E]; G . hb; [G'. E] = G' . hb. \)

**Proposition C.4.** Let \( G' \) be any execution obtained from \( G \) by possibly changing the value read at some \( a \in \mathbb{R}^{na} \), and the source of the \( rf \)-edge entering the event \( a. \) Then, \( G' . hb = G . hb. \)

**Proposition C.5.** Let \( G' \) be an execution, such that \( G' . E = G . E \cup \{ a \} \) for some event \( a. \) Suppose that \( a \in G' . \mathbb{R}^{na} \), \( G . sb \subseteq G' . sb \), \( G . lab \subseteq G' . lab \), \( G . rmw = G' . rmw \), and \( G' . rf = G . rf \cup \{ (b, a) \} \) for some \( b \in G . E. \) Then, \( [G . E]; G' . hb; [G . E] = G . hb. \)

### D. The RC\( na \) Model

In this section we present a variant of RC11, which has a smaller \( psc \) relation, and is useful in our correctness of compilation proofs. It is based on the following additional derived relations:

\[
\begin{align*}
rb^{na} &\triangleq [r^{na}]; rb \\
r^{\not{na}}b &\triangleq rb \setminus rb^{na} \\
ec^{\not{na}} &\triangleq rf \cup (mo \cup rb^{\not{na}}); rf \\
p^{\not{na}}sc &\triangleq ([ec] \cup [fsc]; hb); (sb \cup ec^{\not{na}}); sb; sb|_{\not{loc}hb}; sb|_{\not{loc}hb}; ([ec] \cup hb; [fsc])
\end{align*}
\]

**Proposition D.1.** If \( rb^{na} \subseteq hb \) then \( psc = p^{\not{na}}sc. \)

**Proof.** Immediately follows from our definitions.

(Note that \( psc \setminus p^{\not{na}}sc \subseteq [fsc]; hb; (rb^{\not{na}}); rf; ([ec] \cup hb; [fsc]), \) and \( hb; rf^{\not{na}} \subseteq hb^{\not{na}}; (sb \cup rf).\) \( \Box \)

We call an execution RC\( na \)-consistent if it satisfies all conditions of Def. 1, except possibly for \( sc \), and \( p^{\not{na}}sc \) is acyclic.

**Lemma D.1.** Let \( G \) be an RC\( na \)-consistent execution of a program \( P. \) Then, either \( G \) is RC11-consistent, or \( P \) has undefined behavior under RC11.

**Proof.** If \( rb^{na} \subseteq hb \), then, by Prop. D.1, \( psc = p^{\not{na}}sc \) and \( G \) is RC11-consistent. Suppose otherwise. We show that \( P \) has undefined behavior under RC11. Let \( a_1, \ldots, a_n \) be an enumeration of \( E \) that respects \( sb \cup rf \) (that is, \( i < j \) whenever \( (a_i, a_j) \in sb \cup rf \)). For every \( 1 \leq i \leq n \), let \( E_i = E_0 \cup \{ a_1, \ldots, a_i \} \) and \( G_i = G|_{E_i}. \) Let \( k \) be the minimal index such that \( G_k . rb^{na} \not\subseteq G_k . hb. \) Then, by Prop. D.1, \( G_{k-1} . psc = G_{k-1} . p^{\not{na}}sc \) is acyclic, and so \( G_{k-1} \) is RC11-consistent. Let \( (a_k, a_k) \in G_k . rb^{na} \setminus G_k . hb. \) Then, we must have \( a_k \in \{ a_k, a_k \}. \) Note also that \( (a_k, a_k) \not\in G_k . hb \) since \( G_k \) satisfies COHERENCE.

Now, if \( G_k \) is RC11-consistent, then we are done (it is a racy execution of \( P). \) Suppose otherwise. We show that \( a_k \neq a_k. \) Indeed, since \( G_k \) is RC\( na \)-consistent but not RC11-consistent, and \( G_{k-1} \) is RC11-consistent, it must be the case that \( \mod(a_k) = sc \), and there exist \( b, f \in E_{k-1} \) such that:

- \( (a_k, b) \in G_k . mo; G_{k-1} . rf^{\not{na}}; (G_{k-1} . hb; [F]) \); \( [G_{k-1} . ec] \)
We use the following additional notations:

- \( (b, f) \in G_{k-1}\) psc\(^*\); \([F^\text{pc}]\)
- \( (f, ak) \in G_{k-1}\) \(hb\); \(G_k\) \(rb^{na}\)

Now, since we have \([E_{k-1}]\); \(G_k\) \(rb\); \(G_k\) \(mo\); \([E_{k-1}]\) \(\subseteq G_{k-1}\) \(rb\), it follows that \((f, b) \in G_{k-1}\) psc. This, however, contradicts the fact that \(G_{k-1}\) is RC11-consistent.

Therefore, we have \(ak = ak\). Let \(x = G.xloc(ak)\). By Lemma C.1, there exists an event \(b \in G_{k-1}\), such that the execution \(G'\) given by \(G'.c = G_k.c\) for every \(c \in \{E, sb, rmw, data, addr, ctrl, mo\}\), \(G'.\) lab = \(G_{k-1}\). lab \(\cup \{ak \mapsto \text{Ra}^n(x, \text{val}_a(b))\}\), and \(G'.rf = G_{k-1}.rf \cup \{(b, ak)\}\) is RC11-consistent.

By Assumption B.1, \(G'\) is an execution of \(P\). In addition, we have \((ak, ak) \notin (b, ak)\), \(G_k\) hb and \(G'.\)hb = \(G_k\) hb by Prop. C.4, and so \(G'\) is racy. Hence, \(P\) has undefined behavior under RC11. \(\Box\)

Next, we prove some lemmas that allow us (under some restrictions) to add a memory access inside a given execution. In what follows, we take \(G\) to be an arbitrary execution.

**Proposition D.2.** If \(a \notin \text{dom}(sb^2; [E^\text{rel}])\), then for every \(b \in E\), we have \(\langle a, b \rangle \in \text{hb} \iff \langle a, b \rangle \in \text{sb}\).

**Proof.** The assumption that \(a \notin \text{dom}(sb^2; [E^\text{rel}])\) ensures that \(a \notin \text{dom}(sb^2; sw)\), and so we have \(\langle a, b \rangle \in \text{hb} \iff \langle a, b \rangle \in \text{sb}\). \(\Box\)

**Lemma D.2** (Add write). Let \(a \in W \setminus \text{dom}(sb^2; [E^\text{rel}]) \setminus \text{At}\). Suppose that \(G' = G_{[\text{G}, E] \{a\}}\) is RC\(n_a\)-consistent. Let \(x = xloc(a)\), and suppose that \((a, b) \in sb; \{x\}\) implies \((a, b) \in sb; \{x\}; sb)\). Then, there exists a relation \(T \subseteq G.W_x \times G.W_x\) such that the execution \(G''\) given by \(G''\).c = \(G.c\) for every \(c \in \{E, \text{lab}, sb, rmw, data, addr, ctrl1\}\), \(G''\).rf = \(G'.rf\), and \(G''\).mo = \(G'.mo\) \(\cup T\) is RC\(n_a\)-consistent.

**Proof.** Let \(C = \{c \in G'.W_x \mid \langle a, c \rangle \in G.sb; G'.mo^2\}\), and take \(T = ((\{a\} \times C) \cup ((G'.W_x \setminus C) \times \{a\})\). It is straightforward to show that \(G''\), as defined in the statement, is RC\(n_a\)-consistent. In particular, we have \(G''.\)psc\(\neq na\) = \(G'.\)psc\(\neq na\). \(\Box\)

**Lemma D.3** (Add rmw write). Suppose that \(\text{rmw}^{-1}; \text{rf}^{-1}; \text{rf}; \text{rmw} \subseteq [G.E].\) Let \(a \in (W \cap \text{At}) \setminus \text{dom}(sb^2; [E^\text{rel}])\). Suppose that \(G' = G_{[\text{G}, E] \{a\}}\) is RC\(n_a\)-consistent. Let \(x = xloc(a)\), and suppose that \((a, b) \in sb; \{x\}\) implies \((a, b) \in sb; \{x\}; sb)\). Then, there exists a relation \(T \subseteq G.W_x \times G.W_x\) such that the execution \(G''\) given by \(G''\).c = \(G.c\) for every \(c \in \{E, \text{lab}, sb, rmw, data, addr, ctrl1\}\), \(G''\).rf = \(G'.rf\), and \(G''\).mo = \(G'.mo\) \(\cup T\) is RC\(n_a\)-consistent.

**Proof.** Let \(b, d \in G'.E\) such that \((b, a) \in G.\)rmw and \((d, b) \in G'.\)rf. Let \(C = \{c \in G'.W_x \mid \langle d, c \rangle \in G'.\)mo\}, and take \(T = ((\{a\} \times C) \cup ((G'.W_x \setminus C) \times \{a\})\). It is straightforward to show that \(G''\), as defined in the statement, is RC\(n_a\)-consistent. \(\Box\)

**Lemma D.4** (Add non-atomic read). Let \(a \in \text{Ra}^{-2} \setminus \text{dom}(sb^2; [E^\text{rel}])\). Suppose that \(G' = G_{[\text{G}, E] \{a\}}\) is RC\(n_a\)-consistent. Then, there exists an event \(b \in G'.\)W such that the execution \(G''\) given by \(G''\).E = \(G.E\), \(G''.\)lab = \(G'.\)lab \(\cup \{a \mapsto \text{Ra}^{-2}(\text{loc}(a), \text{val}_a(b))\}\), \(G''.\)c = \(G.c\) for every \(c \in \{sb, rmw, data, addr, ctrl1\}\), \(G''.\)rf = \(G'.rf\) \(\cup \{(b, a)\}\), and \(G''.\)mo = \(G.mo\) is RC\(n_a\)-consistent.

**Proof.** Let \(x = xloc(a)\). Let \(B = \{b \in G.W_x \mid \langle b, a \rangle \in G.\)rf\(\}; G.hb\}\), and take \(b\) be the \(mo\)-maximal event in \(B\). It is straightforward to show that \(G''\), as defined in the statement, is RC\(n_a\)-consistent. \(\Box\)

**E. Proof of Global Transformation of SC accesses**

In this section we prove the soundness of a global program transformation that either adds an SC fence before every SC access, or adds an SC fence after every SC access, and then replaces all SC accesses by release/acquire ones. This will allow us later to prove the correctness of compilation only for programs that do not contain any SC accesses.

We use the following additional notations:

\[ \text{psc}_f \triangleq [F]; \text{psc}; [F] \quad \text{and} \quad \text{sb}^{'} \triangleq \text{sb} \setminus \text{rmw} \]
Lemma E.1. Let $G$ be an execution satisfying all conditions of Def. 1, except possibly for SC. Suppose that $[\text{RB}^{\text{sc}}]; (sb' \cup sb; hb; sb'); [\text{RB}^{\text{sc}}] \subseteq hb; [\text{F}^s]; hb$. Let $T = sb \cup sb|\not\in_{\text{loc}}; hb; sb|\not\in_{\text{loc}} \cup \text{eco}$. Then:

1. $[\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^*; \text{eco}'; hb; [\text{F}^s] \subseteq \text{psc}^+_R$.
2. If $\text{psc}_R$ is acyclic, then so is $\text{psc}$.

Proof.

1. We show by induction on $n$, that $[\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^n; \text{eco}'; hb; [\text{F}^s] \subseteq \text{psc}^+_R$ for every $n \geq 0$. For $n = 0$, the claim holds since $\text{eco}'; \text{eco}'; \subseteq \text{eco}'$, and $[\text{F}^s]; hb; \text{eco}'; hb; [\text{F}^s] \subseteq \text{psc}_R$. Suppose now that $[\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^{n-1}; \text{eco}'; hb; [\text{F}^s] \subseteq \text{psc}^+_R$, and let $R = [\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^n; \text{eco}'; hb; [\text{F}^s]$. Expanding the definition of $T$ (keeping in mind that $\text{rmw} \subseteq \text{eco}$) we have $R \subseteq R_1 \cup R_2$, where:

$$R_1 = [\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^{n-1}; [\text{RB}^{\text{sc}}]; (sb' \cup sb|\not\in_{\text{loc}}; hb; sb|\not\in_{\text{loc}}); [\text{RB}^{\text{sc}}]; \text{eco}'; hb; [\text{F}^s],$$

$$R_2 = [\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^{n-1}; \text{eco}'; \text{eco}'; hb; [\text{F}^s].$$

Since $\text{eco}; \text{eco}'; \subseteq \text{eco}$, by the induction hypothesis, we have $R_2 \subseteq \text{psc}^+_R$. In addition, since $sb|\not\in_{\text{loc}}; hb; sb|\not\in_{\text{loc}} \subseteq sb'; hb; sb'$, our assumption entails that $R_1$ is contained in

$$R_1' = [\text{F}^s]; hb; \text{eco}'; ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^{n-1}; hb; [\text{F}^s]; \text{eco}; \text{eco}; hb; [\text{F}^s],$$

which, in turn, using the induction hypothesis is contained in $\text{psc}^+_R$.

2. Contrapositively, suppose that $\text{psc}$ is cyclic. Then, by definition, the union of the following relations is cyclic:

- $A_1 = [\text{RB}^{\text{sc}}]; T; [\text{RB}^{\text{sc}}]$
- $A_2 = [\text{F}^s]; hb'; (sb \cup \text{eco}); hb'; [\text{F}^s]$
- $A_3 = [\text{RB}^{\text{sc}}]; (sb \cup \text{eco}); hb'; [\text{F}^s]$
- $A_4 = [\text{F}^s]; hb'; (sb \cup \text{eco}); [\text{RB}^{\text{sc}}]$

Consider first the case that $A_1$ is cyclic. Then, since $\text{rmw} \subseteq \text{eco}$, the relation $[\text{RB}^{\text{sc}}]; (sb' \cup sb|\not\in_{\text{loc}}; hb; sb|\not\in_{\text{loc}}); [\text{RB}^{\text{sc}}] \cup \text{eco}$ is cyclic. Our assumption on $G$ entails that $hb; [\text{F}^s]; hb \cup \text{eco}$ is cyclic. Since both $hb; [\text{F}^s]; hb$ and $\text{eco}$ are transitive and irreflexive, we obtain that $[\text{F}^s]; hb; \text{eco}$ is cyclic, which in turn implies that $[\text{F}^s]; hb; \text{eco}; hb; [\text{F}^s] \subseteq \text{psc}_R$ is cyclic.

Now, consider the case that $A_1$ is acyclic. Then, the union of the following two relations must be cyclic:

- $B_1 = [\text{F}^s]; hb'; (sb \cup \text{eco}); ([\text{RB}^{\text{sc}}]T; [\text{RB}^{\text{sc}}])^*; (sb \cup \text{eco}); hb'; [\text{F}^s]$
- $B_2 = [\text{F}^s]; hb'; (sb \cup \text{eco}); [\text{F}^s]$

Note that $B_2 \subseteq \text{psc}_R$. In addition, we have $[\text{F}^s]; hb'; (sb \cup \text{eco}) \subseteq [\text{F}^s]; hb; \text{eco}'$ and $(sb \cup \text{eco}); hb'; [\text{F}^s] \subseteq \text{eco}'; hb; [\text{F}^s]$. By item 1, it follows that $B_1 \subseteq \text{psc}_R$ as well, and so $\text{psc}_R$ is cyclic.

Remark E.1. The proof of Lemma E.1 can be easily adapted for the following alternative $\text{psc}$ relations:

$$\text{psc}_0 \triangleq ([\text{RB}^{\text{sc}}] \cup [\text{F}^s]; sb); (sb \cup \text{eco}); ([\text{RB}^{\text{sc}}] \cup sb; [\text{F}^s])$$
$$\text{psc}_2 \triangleq ([\text{RB}^{\text{sc}}] \cup [\text{F}^s]; sb); (sb \cup sb'; hb; sb' \cup \text{eco}); ([\text{RB}^{\text{sc}}] \cup sb; [\text{F}^s])$$
$$\text{psc}_3 \triangleq ([\text{RB}^{\text{sc}}] \cup [\text{F}^s]; hb); (sb \cup sb'; hb; sb' \cup \text{eco}); ([\text{RB}^{\text{sc}}] \cup hb; [\text{F}^s])$$

This ensures the correctness of the transformation for all alternative fixes presented in §2.

Lemma E.2. Let $G$ be an RC11-consistent execution without any SC accesses. Let $A \subseteq \text{RB}^{\text{sc}} \cup \text{RB}^{\text{rel}}$, such that $[A]; (sb' \cup sb; hb; sb'); [A] \subseteq hb; [\text{F}^s]; hb$, and $[A]; \text{rmw} = \text{rmw}; [A]$. Then, the execution $G'$ obtained from $G$ by changing all modes of events in $A$ to $\text{sc}$ is RC11-consistent.

Proof. The only constraint that is affected by such modification is SC. Now, in $G'$ we have $[G'; \text{RB}^{\text{sc}}]; (G'; sb' \cup G'; sb'; G'; \text{hb}; G'; \text{sb}); [G'; \text{RB}^{\text{sc}}] \subseteq G'; \text{hb}; [G'; \text{F}^s]; G'; \text{hb}$, and by Lemma E.1 it suffices to show that $G'.\text{psc}_R$ is acyclic. This follows from the fact that $G$ satisfies $\text{SC}$, since $G'.\text{psc}_R = G.\text{psc}_R$. 

F. Properties of the Power and ARMv7 Models

In this appendix we provide the full definition of preserved program order (ppo) used by Power and ARMv7, and prove various properties of these models that are needed in our compilation correctness proof.

F.1 Preserved Program Order

ppo is defined based on the four dependencies — data, addr, ctrl, ctrl_{isync} — that satisfy the following properties:

1. data ⊆ R × W.
2. addr ⊆ R × (R ∪ W).
3. ctrl_{isync} ⊆ ctrl ⊆ R × E.
4. ctrl; sb ⊆ ctrl.
5. ctrl_{isync}; sb ⊆ ctrl_{isync}.
6. rmw ⊆ data ∪ addr ∪ ctrl.
7. rmw; sb ⊆ ctrl.

1 – 5 hold by definition (see [3]). 6 – 7 hold due to the compilation scheme: it always places a dependency from the load to the store that form an RMW pair, and a branch after each (conditional) store in such pairs.

The relation deps includes all types of dependencies:

\[ \text{deps} \triangleq \text{data} \cup \text{addr} \cup \text{ctrl} \]

Herd’s definition of ppo is as follows:

\[
\begin{align*}
\text{rdw} & \triangleq (\text{rbe}; \text{rfe}) \cap \text{sb} \\
\text{ii}_0 & \triangleq \text{addr} \cup \text{data} \cup \text{rdw} \cup \text{rfi} \\
\text{ci}_0 & \triangleq \text{ctrl}_{isync} \cup \text{detour} \\
\text{ic}_0 & \triangleq \emptyset \\
\text{cc}^\text{Power}_0 & \triangleq \text{data} \cup \text{ctrl} \cup \text{addr}; \text{sb}^2 \cup \text{sb}|_{\text{loc}} \\
\text{cc}^\text{ARMv7}_0 & \triangleq \text{data} \cup \text{ctrl} \cup \text{addr}; \text{sb}^2
\end{align*}
\]

\[ \text{ppo} \triangleq [R]; \text{ii}; [R] \cup [R]; \text{ic}; [W] \]

where, ii, ic, ci, cc are inductively defined as follows:

\[
\begin{array}{cccc}
\text{ii}_0 & \text{ci} & \text{ic}; \text{ci} & \text{ii}; \text{ii} \\
\text{ii} & \text{ii} & \text{ii} & \text{ii} \\
\text{ic}_0 & \text{ii} & \text{cc} & \text{ic}; \text{cc} & \text{ii}; \text{ic} \\
\text{ic} & \text{ic} & \text{ic} & \text{ic} & \text{ic} \\
\text{ci}_0 & \text{ci}; \text{ii} & \text{cc}; \text{ci} \\
\text{ci} & \text{ci} & \text{ci} \\
\text{cc}_0 & \text{ci} & \text{ci}; \text{ic} & \text{cc}; \text{cc} \\
\text{cc} & \text{cc} & \text{cc} & \text{cc} \\
\end{array}
\]

Note that \( \text{ci} \subseteq \text{ii} \subseteq \text{ic} \), as well as \( \text{ci} \subseteq \text{cc} \subseteq \text{ic} \).

Alternatively the relations ii, ic, ci, cc can be defined as follows:

\[ xy \triangleq \bigcup_{n \geq 1} x^1 y^1_0; x^2 y^2_0; \ldots; x^n y^n_0 \]

where:

- \( x, y, x^1, \ldots, x^n, y^1, \ldots, y^n \in \{i, c\} \).
- If \( x = c \) then \( x^1 = c \).
- For every \( 1 \leq i \leq n - 1 \), if \( y^i = c \) then \( x^{i+1} = c \).
- If \( y = i \) then \( y^n = i \).
Note that the only difference between Power and ARMv7 is in the definition of $cc_0$. Henceforth, we only assume ARMv7’s definition, which is weaker, so our proofs apply for both Power and ARMv7.

Next, we prove some useful properties of $ppo$. In all propositions below we assume some Power-consistent execution.

**Proposition F.1.** $ppo$ is transitive.

**Proof.** Immediately follows from the definition. □

**Proposition F.2.** $[w]; psbloc \subseteq ii$.

**Proof.** Let $(a, b) \in [w]; psbloc$ and let $x = \text{loc}(a)$. Then, by definition, $a \in \mathbb{W}_x$, $b \in \mathbb{R}_x$, $(a, b) \in sb$, and there is no $c \in \mathbb{W}_x$ such that $(a, c), (c, b) \in sb$. Since $G$ is complete, there exists some $d \in \mathbb{W}_x$ such that $(d, b) \in rf$. If $d = a$, then we are done since $ri \subseteq ii$. Otherwise, since $G$ satisfies SC-PER-LOC, we have $(a, d) \in mo$, $(a, b) \notin sb$, and $(b, d) \notin sb$. It follows that $(a, d) \in moe$ and $(d, b) \in rf$. Thus, we have $(a, b) \in detour \subseteq ii$. □

**Proposition F.3.** $(deps \cup addr; sb); [w]; psbloc; ppo; [w] \subseteq ppo$.

**Proof.** Let $a, b, c, d \in E$ such that $(a, b) \in (deps \cup addr; sb); [w]$, $(b, c) \in psbloc$, and $(c, d) \in ppo; [w]$. If $(a, b) \in ctrl$, then by definition, we have $(a, d) \in ctrl$, and so $(a, d) \in ppo$. If $(a, b) \in addr; sb$, then by definition, we have $(a, d) \in cc$, and so $(a, d) \in ppo$. Otherwise, $(a, b) \in addr \cup data \subseteq ii$. By Prop. F.2, we also have $(b, c) \in ii$. Hence, $(a, c) \in ii$, and so $(a, c) \in ppo$. It follows that $(a, d) \in ppo$. □

**Proposition F.4.** $(deps \cup addr; sb); [R]; sb; [w] \subseteq ppo$.

**Proof.** Let $a, b, c \in E$ such that $(a, b) \in (deps \cup addr; sb); [R]$ and $(b, c) \in sb; [w]$. If $(a, b) \in ctrl$, then by definition, we have $(a, c) \in ctrl$, and so $(a, c) \in ppo$. Otherwise, $(a, b) \in addr; sb$. In this case, we have $(a, c) \in addr; sb$, and so $(a, c) \in ppo$. □

**Proposition F.5.** Let $R = deps \cup addr; sb \cup psbloc$. Then, $(deps \cup addr; sb); R^+; [w] \subseteq ppo$.

**Proof.** We prove by induction that for every $n \geq 0$, $(deps \cup addr; sb); R^n; [w] \subseteq ppo$. For $n = 0$, we have $(deps \cup addr; sb); [w] \subseteq ppo$ by definition. Let $n \geq 1$ and suppose that $( deps \cup addr; sb ); R^k; [w] \subseteq ppo$ for every $k < n$. Let $(a, b) \in (deps \cup addr; sb); R^n; [w]$. Let $c \in E$ such that $(a, c) \in (deps \cup addr; sb)$, and $(c, b) \in R^n$. If $c \in R$, then we are done using Prop. F.4. Otherwise, $c \in \mathbb{W}$, and $(c, b) \in psbloc; R^{n-1}$. Let $d$ be the sb-maximal event satisfying $(c, d) \in psbloc$ and $(d, b) \in R^{k}$ for some $k \leq n - 1$. The maximality of $d$ ensures that $(d, b) \in (deps \cup addr; sb); R^{k-1}$. By the induction hypothesis, we have $(d, b) \in ppo$. Hence, we have $(a, b) \in (deps \cup addr; sb); [w]; psbloc; ppo; [w]$, and the claim follows by Prop. F.3. □

**Proposition F.6.** Let $R = deps \cup addr; sb \cup psbloc$. Then, $rfe; R^+; [w] \subseteq rfe; ppo$.

**Proof.** Let $(a, c) \in rfe; R^+; [w]$. Let $b$ be the sb-maximal event satisfying $(a, b) \in rfe$ and $(b, c) \in R^+$. If $(b, c) \in (deps \cup addr; sb); R^+$, then we are done by Prop. F.5. Otherwise, let $d$ be the sb-maximal element such that $(b, d) \in psbloc$ and $(d, c) \in R^+$. Then, $d \in R$, and since $c \in \mathbb{W}$, we have $(d, c) \in R^+$. The maximality of $b$ and SC-PER-LOC ensure that $(b, d) \in rfe$, and so $(b, d) \in ppo$. The maximality of $d$ ensures that $(d, c) \in (deps \cup addr; sb); R^+$. By Prop. F.5, we have $(d, c) \in ppo$, and so $(a, c) \in rfe; ppo$. □

**Proposition F.7.** $ppo^?; rbi \subseteq ppo; mo; yo \cup mo \cup rbi$.

**Proof.** For any $n \geq 0$, let $ppo_n$ denote $ppo$-edges that are formed by at most $n$ basic $ppo$ edges $(i_0, i_0, c_0)$, and $c_0$. Then, $ppo^? = \bigcup_{n \geq 0} ppo_n$. The proof proceeds by induction on $n$. For $n = 0$, the claim obviously holds. Suppose now that it holds for $n - 1$, and let $(a, b) \in ppo_n$ and $(b, c) \in rbi$. Then, $b$ must be a read event, and so there exists $a'$ such that $(a, a') \in ppo_{n-1}$ and $(a', b) \in ii_0 \cup ci_0$. This leads to five cases:

- $(a', b) \in addr$. In this case we have $(a', c) \in cc_0$, and so $(a, c) \in ppo$.
- $(a', b) \in rfe$. In this case we have $(a', c) \in rbi$, and the claim follows by the induction hypothesis.

21 2016/11/16
\[ (a', b) \in \text{rfi.} \] In this case we have \( \langle a', c \rangle \in \text{mo,} \) and so \( \langle a, c \rangle \in \text{ppo}^2; \text{mo}. \)

\[ (a', b) \in \text{ctrl}_{\text{sync}.} \] In this case we have \( \langle a', c \rangle \in \text{ci}_0, \) and so \( \langle a, c \rangle \in \text{ppo}. \)

\[ (a', b) \in \text{detour.} \] In this case we have \( \langle a', c \rangle \in \text{mo}, \) and so \( \langle a, c \rangle \in \text{ppo}^2; \text{mo}. \)

\section*{F.2 Additional Properties}

**Proposition F.8.** \( \text{rmw} \cap (\text{rb}; \text{mo}) = \emptyset. \)

\textit{Proof.} \textsc{power-atomicity} condition ensures that \( \text{rmw} \cap (\text{rbe}; \text{mo}) = \emptyset. \) In addition, in every execution we have \( \text{rmw} \subseteq \text{sb, rbe; sb} \subseteq \text{sb, sb; mo}; \text{sb} \subseteq \text{sb, sb}; \text{sb} \subseteq \text{rmw}. \) It follows that \( \text{rmw} \cap (\text{rb}; \text{mo}) = \emptyset. \)

**Proposition F.9.** Let \( R \in \{ \text{sync, fence} \}. \) Then, \( R; \text{hp}^*; \text{rbi} \subseteq R; \text{hp}^*; \text{mo}^2. \)

\textit{Proof.} We prove by induction on \( n \) that for every \( n \geq 0, \) we have \( R; \text{hp}^n; \text{rbi} \subseteq R; \text{hp}^*; \text{mo}^2. \) For \( n = 0, \) the claim follows since \( R; \text{rbi} \subseteq R. \) Now, suppose it holds for \( n - 1, \) and let \( a, b, c, d \) such that \( \langle a, b \rangle \in R; \text{hp}^{n-1} \), \( \langle b, c \rangle \in \text{hp}_p, \) and \( \langle c, d \rangle \in \text{rbi}. \) If \( \langle b, c \rangle \in \text{rfi}, \) then we have \( \langle b, d \rangle \in \text{mo,} \) and so \( \langle a, d \rangle \in R; \text{hp}^*; \text{mo}. \) If \( \langle b, c \rangle \in \text{fence}, \) then we have \( \langle b, d \rangle \in \text{fence,} \) and so \( \langle a, d \rangle \in R; \text{hp}^* \). Otherwise, we have \( \langle b, c \rangle \in \text{ppo}, \) and the claim follows using \textit{Prop. F.7} and the induction hypothesis.

**Proposition F.10.** \text{fence is transitive.}

\textit{Proof.} Immediately follows from the definition of \text{fence.}

**Proposition F.11.** \( \text{fence}; \text{hp}^* \subseteq \text{sb} \cup \text{fence}; [W]; \text{hp}^*. \)

\textit{Proof.} Let \( a, b, c \in E \) such that \( \langle a, b \rangle \in \text{fence} \) and \( \langle b, c \rangle \in \text{hp}^*. \) If \( \langle b, c \rangle \in \text{sb}, \) then the claim follows since \( \text{fence} \subseteq \text{sb}. \) Suppose otherwise. Then, there exists \( \langle d, e \rangle \in \text{rfi} \) such that \( \langle b, d \rangle \in \text{hp}^* \cap \text{sb} \) and \( \langle e, c \rangle \in \text{hp}^*. \) It follows that \( \langle a, d \rangle \in \text{fence,} \) and so \( \langle a, c \rangle \in \text{fence}; [W]; \text{hp}^*. \)

**Proposition F.12.** \( [\text{rb}]; \text{sb}; \text{(fence: hp}^*)^2; \text{sync} \subseteq \text{(fence: hp}^*)^2; \text{sync}. \)

\textit{Proof.} Immediately follows from the definition of \text{sync} and \textit{Prop. F.11.}

**Proposition F.13.** \( \text{eco}^2; \text{(fence: hp}^*)^2; \text{sync}; \text{hp}^* \text{ is acyclic.} \)

\textit{Proof.} By definition, we have \( \text{eco}^2 = (\text{mo} \cup \text{rbe})^2; \text{rfi}^2 \cup \text{rbi}; \text{rfi}^2 \cup \text{rbi}; \text{rfi}. \) Thus, it suffices to show that the union of the following relations is acyclic:

\[ A = ((\text{mo} \cup \text{rbe})^2; \text{rfi}^2 \cup \text{rbi}; \text{rfi}^2; \text{sync}; \text{hp}^* \]

\[ B = \text{rbi}; \text{rfi}; (\text{fence}; \text{hp}^*)^2; \text{sync}; \text{hp}^* \]

By \textit{Prop. F.9}, \( A; B \subseteq A \) and \( B; B \subseteq B; A. \) Hence, it suffices to show that \( A \) is acyclic and \( B \) is irreflexive. Acyclicity of \( A \) follows from Power’s \textsc{propagation} condition, since we have \( A \subseteq \text{mo}^2; \text{prop}_2 \) (using \textit{Prop. F.12}). Irreflexivity of \( B \) also follows from \textsc{propagation}, using \textit{Prop. F.9}.

**Proposition F.14.** \( \text{Let } A = \{ a \in W \mid \exists b \in F. \langle b, a \rangle \in \text{sb}_{\text{imm}}; \text{rmw} \}. \) \( \text{Then, } (\text{sb}^2; [F]; \text{sb} \cup [A]; \text{mo}^2); \text{rfi}; \text{hp}^*; (\text{sb}; [F])^2 \text{ is a strict partial order.} \)

\textit{Proof.} Let \( R = (\text{sb}^2; [F]; \text{sb} \cup [A]; \text{mo}^2); \text{rfi}; \text{hp}^*; (\text{sb}; [F])^2. \) The fact that \( R \) is transitive follows from the following facts (obtained by expanding the relevant definitions):

\[ \text{sb}; [F]; (\text{sb}^2; [F]; \text{sb} \cup [A]; \text{mo}^2); \text{rfi} \subseteq \text{fence}; \text{rfi} \subseteq \text{hp}^+. \]

\[ \text{rfi}; \text{hp}^*; \text{sb}^2; [F]; \text{sb}; \text{rfi} \subseteq \text{rfi}; \text{hp}^+; \text{fence}; \text{rfi} \subseteq \text{rfi}; \text{hp}^+. \]

\[ \text{rfi}; \text{hp}^*; [A]; \text{mo}^2; \text{rfi} \subseteq \text{rfi}; \text{hp}^+; \text{rmw}; \text{sb} \cup \text{sb}; [F]; \text{sb}; \text{rfi} \subseteq \text{rfi}; \text{hp}^+; (\text{ppo} \cup \text{fence}); \text{rfi} \subseteq \text{rfi}; \text{hp}^+. \]

Now, to see that \( R \) is irreflexive, note that \( \langle a, a \rangle \in R \) implies (using these three properties) that \( \langle a, a \rangle \in \text{hp}^+ \) which contradicts \textsc{power-no-thin-air}.

**Proposition F.15.** \( \text{eco}; (\text{sb} \cup \text{fence}; \text{hp}^*) \text{ is irreflexive.} \)
Proof. $\text{eco; sb}$ is irreflexive using \text{SC-PER-LOC}. By Prop. F.11, it suffices to show that $\text{eco; fence; [\W]; hb}_p^*$ is irreflexive. Suppose otherwise, and let $a, b \in E$ such that $\langle a, b \rangle \in \text{eco}$ and $\langle b, a \rangle \in \text{fence; [\W]; hb}_p^*$. First, if $\langle a, b \rangle \in \text{sb}$, then we have $\langle a, a \rangle \in \text{fence; hb}_p^* \subseteq \text{hb}_p^+$, which contradicts \text{POWER-NO-THIN-AIR}. Suppose otherwise, and consider the possible cases:

- $\langle a, b \rangle \in \text{rfe}$. In this case we obtain $\langle a, a \rangle \in \text{hb}_p^+$, which contradicts \text{POWER-NO-THIN-AIR}.
- $\langle a, b \rangle \in \text{mo; rf}_f$. Let $c \in E$ such that $\langle a, c \rangle \in \text{mo}$ and $\langle c, b \rangle \in \text{rf}_f^\mathsf{2}$. Then, we have $\langle c, a \rangle \in \text{prop}_1$, and we obtain that $\text{mo; prop}_1$ is not irreflexive, which contradicts \text{PROPAGATION}.
- $\langle a, b \rangle \in \text{rbe}; \text{rf}_f^\mathsf{2}$. Let $c \in W$ such that $\langle a, c \rangle \in \text{rbe}$ and $\langle c, b \rangle \in \text{rf}_f^\mathsf{2}$. Let $d \in W$ such that $\langle b, d \rangle \in \text{fence}$ and $\langle d, a \rangle \in \text{hb}_p^*$. Then, we have $\langle c, d \rangle \in \text{prop}_1$, and obtain a violation of \text{OBSERVATION}.
- $\langle a, b \rangle \in \text{rbi; rfe}$. Let $c \in W$ such that $\langle a, c \rangle \in \text{rbi}$ and $\langle c, b \rangle \in \text{rfe}$. By Prop. F.9, we have $\langle b, c \rangle \in \text{fence; hb}_p^*; \text{mo}^\mathsf{?}$. Let $d \in E$ such that $\langle b, d \rangle \in \text{fence; hb}_p^*$ and $\langle d, c \rangle \in \text{mo}^\mathsf{?}$. Then, we have $\langle c, d \rangle \in \text{prop}_1$, and we obtain that $\text{mo}^\mathsf{?}; \text{prop}_1$ is not irreflexive, which contradicts \text{PROPAGATION}. □

F.3 Removing Redundant Fences

Lemma F.1. Let $G$ be a Power execution, and let $\langle a, b \rangle \in [\text{sync}; \text{sb}]_{\text{imm}}; [\text{f1sync}]$. Let $G'$ be the execution obtained from $G$ by removing $b$ ($G' = G_{(G, E, \{b\})}$. If $G'$ is Power-consistent, then so is $G$.

Proof. Since $b$’s immediate $\text{sb}$-predecessor is a full fence, we have $G'.\text{fence} = G.\text{fence}$. Then, it is easy to see that for every relation $c$ mentioned in Def. 5, we have $G'.c = G.c$, and so if $G'$ is Power-consistent, then so is $G$. □

Lemma F.2. Let $G$ be a Power execution, and let $\langle a, b \rangle \in [E; \{\text{sb}]_{\text{imm}} \cap \text{ctrl}_{\text{f1sync}}\}; [F]$. Let $G'$ be the execution obtained from $G$ by removing the $\text{ctrl}_{\text{f1sync}}$ dependency edges from a onwards ($G'.\text{ctrl}_{\text{f1sync}} = G.\text{ctrl}_{\text{f1sync}} \backslash \{(a) \times E\}$). If $G'$ is Power-consistent, then so is $G$.

Proof. Since $a$’s immediate $\text{sb}$-successor is a fence, we have $\langle a, c \rangle \in G.\text{fence}$ for every $c \in \text{RW}$ such that $\langle a, c \rangle \in \text{sb}$. Now, by omitting $\text{ctrl}_{\text{f1sync}}$ dependency edges from $a$ onwards, we may remove $\text{ppo}$-edges from $a$, but whenever $\text{ppo}$ is used to form an $\text{hb}_p$-edge, it can be replaced by a $\text{fence}$-edge. Consequently, for every relation $c$ mentioned in Def. 5, we have $G'.c = G.c$, and so if $G'$ is Power-consistent, then so is $G$. □

G. Power-before Relation

In this section, we define a relation that we call \text{Power-before (pb)}, and show that if $\text{pb}$ is acyclic in some execution $G$ of a program $P$, then either $G$ is RC11-consistent, or $P$ has undefined behavior under RC11. This relation is the key for showing that \text{NO-THIN-AIR} holds when proving compilation correctness. (Thus, if one is only interested in weakRC11-consistency, this section can be completely ignored.)

In what follows we assume an execution $G$.

\text{pb} is given by:

\[
\begin{align*}
\text{psbloc} & \triangleq \text{sb}|_{\text{loc}}; [R] \setminus \text{sb}|_{\text{loc}}; [W]; \text{sb} & \text{(preserved sb-loc)}
\text{pbi} & \triangleq \text{deps} \cup \text{addr}; \text{sb} \cup [E^{\text{rel}}]; \text{sb} \cup \text{psbloc} \cup \text{sb}; [E^{\text{rel}}] & \text{(internal Power-before)}
\text{pb} & \triangleq \text{pbi} \cup \text{rfe} & \text{(Power-before)}
\end{align*}
\]

Clearly, $\text{pb} \subseteq \text{sb} \cup \text{rf}$, and so $\text{pb}$ is acyclic in every RC11-consistent execution.

Proposition G.1. If $G$ is weakRC11-consistent, then $\text{rf} \subseteq \text{pb}$.

Proof. COHERENCE guarantees that $\text{rfi} \subseteq \text{psbloc} \subseteq \text{pbi}$, and by definition we have $\text{rfe} \subseteq \text{pb}$. □

Proposition G.2. For every weakRC11-consistent execution $G$, $\text{hb} \subseteq \text{sb} \cup \text{pb}^+$.

Proof. It suffices to show that $\text{sb}^?; \text{swe}; \text{sb}^? \subseteq \text{pb}^+$. By definition, we have $\text{sb}^?; \text{swe}; \text{sb}^? \subseteq \text{sb}^?; [E^{\text{rel}}]; \text{sb}^?; (\text{rf} \cup \text{rmw})^+; [E^{\text{rel}}]; \text{sb}^?$.  

23 2016/11/16
The claim follows because we have:

- $sb^2; E^{rel}_i; sb^2 \subseteq pb^*\rf$
- $\rf \subseteq pb$ and $rmw \subseteq deps \subseteq pb_i$.
- $[R^{rel}_i]; sb^2 \subseteq pb^2_i$.

\[\blacklozenge\]

**Proposition G.3.** If $pb$ is acyclic, but $sb \cup \rf$ is cyclic, then $(\rf; [\na] \setminus hb; sb) \neq \emptyset$.

**Proof.** A cycle in $sb \cup \rf$ implies a cycle in $\rf; sb$. Since $\rf; [R^{rel}_i]; sb$ and $(\rf; [\na] \setminus hb; sb)$ are contained in $pb^+$ (using Prop. G.2 for the latter), there must exist an edge $(a, b) \in \rf; sb$ that is neither in $\rf; [R^{rel}_i]; sb$ nor in $(\rf; [\na] \setminus hb); sb$. Then, we have $(a, b) \in (\rf; [\na] \setminus hb); sb$. \[\blacklozenge\]

**Lemma G.1.** Suppose that $G$ is a weakRC11-consistent execution of a program $P$, and that $pb$ is acyclic, but $G$ is not RC11-consistent. Then, $P$ has undefined behavior under $RC11$.

**Proof.** Since $G$ is weakRC11-consistent but not RC11-consistent, we have that $sb \cup \rf$ is cyclic. By Prop. G.3, $\rf; [\na] \not\subseteq hb$. We show that this implies that $P$ has undefined behavior under RC11.

Let $a_1, \ldots, a_n$ be an enumeration of $E$ that respects $pb$ (that is, $i < j$ whenever $(a_i, a_j) \in pb^+$). For every $1 \leq i \leq n$, let $E_i = \{a_1, \ldots, a_i\}$. Let $k$ be the minimal index such that $[E_k]; \rf; [\na] \not\subseteq hb$. Then, we have $(a_j, a_k) \in \rf; [\na] \setminus hb$ for some $j < k$. Let $B = dom(sb^2; [E_k])$ and $H = B \setminus E_k$.

Claim 1: $h \in [\na] \cup \wnr$ for every $h \in H$.

**Proof:** Otherwise, since $[\na] \cup \wnr \cup \wnf; sb \subseteq pb$, we would obtain $(h, a) \in pb$ for some $a \in E_k$. This contradicts the fact that $h \not\in E_k$. \[\blacklozenge\]

Claim 2: $(h, b) \not\in sb^2$ for every $h \in H$ and $b \in B \cap \wnr$.

**Proof:** Suppose otherwise. Let $a \in E_k$ such that $(b, a) \in sb^2$. It follows that $(h, a) \in sb^2; E^{rel}_i; sb^2$, and so $(h, a) \in pb^+$. Hence, $h \in E_k$ as well, which contradicts our assumption. \[\blacklozenge\]

Claim 3: $(h, b) \not\in deps^*; ctrl$ for every $h \in H$ and $b \in B$.

**Proof:** Suppose otherwise. Let $a \in E_k$ such that $(b, a) \in sb^2$. Since $ctrl; sb^2 \subseteq ctrl$, it follows that $(h, a) \in deps^+$, and so $(h, a) \in pb^+$. This contradicts the fact that $h \not\in E_k$. \[\blacklozenge\]

Claim 4: $(h, b) \not\in deps^*; addr$ for every $h \in H$ and $b \in B$.

**Proof:** Suppose otherwise. Let $a \in E_k$ such that $(b, a) \in sb^2$. Then, $(h, a) \in deps^*; addr; sb^2 \subseteq pb^+$. This contradicts the fact that $h \not\in E_k$. \[\blacklozenge\]

Let $h_1, \ldots, h_m$ be an enumeration of $H$ that respects $sb$, and let $H_i = \{h_1, \ldots, h_i\}$ for every $0 \leq i \leq m$.

Claim 5: For every $1 \leq i \leq m$, $h_i \not\in dom(deps^+; [E_k \cup H_{i-1}])$.

**Proof:** Suppose otherwise, and let $a \in E_k \cup H_{i-1}$ such that $(h_i, a) \in deps^+$. Then, $(h_i, a) \in pb^+$. If $a \in E_k$, then $h_i \in E_k$ as well, which contradicts our assumption. Hence, we have $a \in H_{i-1}$. This contradicts the fact that the $h_i$’s enumeration respects $sb$. \[\blacklozenge\]

Claim 6: Let $1 \leq i \leq m$, and let $x = \loc(h_i)$. Let $a \in (E_k \cup H_{i-1}) \cap R_x$ and suppose that $(h_i, a) \in sb$. Then, $(h_i, a) \in sb; ([E_k \cup H_{i-1}] \cap \wnr) \setminus sb$.

**Proof:** Suppose otherwise. Let $i \leq j \leq m$ be the maximal index satisfying $h_j \in E_x$, $(h_i, h_j) \in sb^+$ and $(h_j, a) \in sb$. Then, $(h_j, a) \in psbloc$, and so $(h_j, a) \in pb$. If $a \in E_k$, then $h_j \in E_k$ as well, which contradicts our assumption. Hence, we have $a \in H_{i-1}$. This contradicts the fact that the $h_i$’s enumeration respects $sb$. \[\blacklozenge\]

For every $1 \leq i \leq n$, let and $G_i = G_{E_i}$. Since $G_i.rf \subseteq G_i.pb$ (Prop. G.1), all the $G_i$’s are weakRC11-consistent. Additionally, $G_i.pb$ is acyclic for every $1 \leq i \leq n$.

We inductively construct a sequences of labeling functions $lab_0, \ldots, lab_m : B \rightarrow Label$ and executions $G'_0, \ldots, G'_m$ such that the following hold:

1. For every $0 \leq i \leq m$, $G'_i.E = E_k \cup H_i$. 

\[\blacklozenge\]
2. For every $0 \leq i \leq m$, $G'_i.\text{lab} = \text{lab}_i|_{G'_i.\text{E}}$.
3. For every $0 \leq i \leq m$, $G'_i$ is $\text{RC}_{\text{na}}$-consistent.
4. For every $0 \leq i \leq m$, $\langle a_j, a_k \rangle, \langle a_k, a_j \rangle \not\in G'_i.\text{hh}$.
5. For every $0 \leq i \leq m$, $\text{lab}_i(G'_i|B)$ is an execution of $P$.
6. For every $0 \leq i \leq m$, $G.\text{rmw}^{-1}; G'_i.\text{rf}^{-1}; G'_i.\text{rf}; G.\text{rmw} \subseteq [G.E]$.

Finally, we would obtain that $G'_m$ is a racy $\text{RC}_{\text{na}}$-consistent execution with $G'_m.\text{E} = B$, and $\text{lab}_m(G'_m|B) = \text{lab}_m(G|B)$ is an execution of $P$. Hence, $G'_m$ is an execution of $P$, and by Lemma D.1, $G'_m$ is RC11-consistent or $P$ has undefined behavior under RC11. Since $G'_m$ is racy, in any case we would obtain that $P$ has undefined behavior under RC11.

First, we define $\text{lab}_0$ and $G'_0$. The minimality of $k$ and Prop. G.3 ensure that $G_{k-1}$ is RC11-consistent. Hence, Lemma D.4 ensures that there exists some event $b \in E_k$ such that the execution $G'$ given by $G'.c = G_k.c$ for every $c \in \{\text{sb}, \text{rmw}, \text{data}, \text{addr}, \text{ctrl}, \text{mo}\}$, $G'.\text{lab} = G_k.\text{lab}[a_k \mapsto \text{RA}(G.\text{loc}(a_k), G.\text{val}_v(b))]$, and $G'.\text{rf} = G_k.\text{rf} \cup \{(b, a_k)\}$ is $\text{RC}_{\text{na}}$-consistent. Additionally, by Prop. C.4, $G'.\text{hb} = G_k.\text{hb}$, and so, we have $\langle a_j, a_k \rangle, \langle a_k, a_j \rangle \not\in G'.\text{hb}$.

Next, let $1 \leq i \leq m$, and suppose that $\text{lab}_{i-1}$ and $G'_{i-1}$ are defined. We construct $\text{lab}_i$ and $G'_i$. By Claim 1 above, we have $h_i \in G.R^{\text{pa}} \cup G.\text{W=}^{\text{riri}}$. Let $G'_i$ be the execution obtained from $G'_{i-1}$ by adding the event $h_i$, labeled with $\text{lab}_{i-1}(h_i)$, and the $\text{sb}, \text{rmw}$, and dependency edges from/to $h_i$ as in $G'_i|B$. By Claim 2 above, we also have $h_i \not\in \text{dom}(G'_i.\text{sb}; G'_i.\text{E}^{\text{ra}}))$. Let $x = G.\text{loc}(h_i)$, and consider the two cases:

$h_i \in G.R^{\text{pa}}$: Since $G'_{i-1}$ is $\text{RC}_{\text{na}}$-consistent, Lemma D.4 ensures that there exists some event $b \in E_k \cup H_i$ such that the execution $G'$ given by $G'.E = E_k \cup H_i$, $G'.\text{lab} = G'_{i-1}.\text{lab} \cup \{h_i \mapsto \text{RA}(x, G'_i.\text{val}_v(b))\}$, $G'.c = G'_i.c$ for every $c \in \{\text{sb}, \text{rmw}, \text{data}, \text{addr}, \text{ctrl}, \text{mo}\}$, and $G'.\text{rf} = G'_i.\text{rf} \cup \{(b, a_k)\}$ is $\text{RC}_{\text{na}}$-consistent. Additionally, by Prop. C.4, $G'.\text{hb} = G'_i.\text{hb}$, and so, we have $\langle a_j, a_k \rangle, \langle a_k, a_j \rangle \not\in G'.\text{hb}$.

$h_i \in G.\text{W=}^{\text{riri}}$: By Claim 6 above, we have that for every $b \in G'_i.\text{E}$, if $\langle h_i, b \rangle \in G'_i.\text{sb}; [G'_i.\text{R}_{x}]$ then $\langle a, b \rangle \in G'_i.\text{sb}; [G'_i.\text{W}_{x}]; G'_i.\text{sb}$. Thus, since $G'_{i-1}$ is $\text{RC}_{\text{na}}$-consistent, and $G.\text{rmw}^{-1}; G'_{i-1}.\text{rf}^{-1}; G'_{i-1}.\text{rf}; G.\text{rmw} \subseteq [G.E]$, Lemmas D.2 and D.3 ensure that there exists $T \subseteq G'_i.\text{W}_{x} \times G'_i.\text{W}_{x}$ such that the execution $G'$ given by $G'.E = E_k \cup H_i$, $G'.\text{lab} = \text{lab}_{i-1}|_{G'.E}$, $G'.c = G'_i.c$ for every $c \in \{\text{sb}, \text{rmw}, \text{data}, \text{addr}, \text{ctrl}\}$, $G'.\text{rf} = G'_{i-1}.\text{rf}$, and $G'_i.\text{mo} = G'_{i-1}.\text{mo} \cup T$ is $\text{RC}_{\text{na}}$-consistent. We take $\text{lab}_i = \text{lab}_{i-1}$ and $G'_i = G'$. It is straightforward to see that $\text{lab}_{i-1}$ and $G'$ satisfy the required conditions. In particular, $G'_i.\text{lab} = \text{lab}_{i-1}|_{E_k \cup H_{i-1}}$, and $G'_i.\text{data} \subseteq G'_i.\text{data}, G'_i.\text{addr} \subseteq G'_i.\text{addr}$, and $G'_i.\text{ctrl} \subseteq G'_i.\text{ctrl}$.

H. Proof of Compilation Correctness

Lemma H.1. Let $G$ be an execution without SC accesses. Let $G_p$ be a Power-execution. Suppose that the following hold:

- $G.data \subseteq G_p.data, G.addr \subseteq G_p.addr$, and $G.ctrl \subseteq G_p.ctrl$.
- $G.rmw; G.sb \subseteq G_p.ctrl$.
- $G.F^{\text{sc}} \subseteq G_p.F^{\text{risci}}$ and $G.F^{\text{sc}} = G_p.F^{\text{risci}}$.
- $G.W^{\text{rel}} \subseteq A$ where $A = \{a \in G_p.W \mid \exists b \in G_p.F. \langle b, a \rangle \in G_p.sb|_{\text{imm}}; G_p.rmw\}$. 

25 2016/11/16
Then:

- If \( G \) and \( G_p \) have the same outcome.
- If \( G_p \) is Power-consistent, then \( G \) is weakRC11-consistent and \( G.p_b \) is acyclic.

**Proof.** The first claim easily follows from our definitions. Suppose that \( G_p \) is Power-consistent. Before proving the second claim, we present some properties relating \( G \) and \( G_p \).

1. \( G.swe; G.sb\subseteq (G.p.F); G.p.sb \cup [A]; G.p.mo^2); G.p.rfe; G.p.hb^*; (G.p.sb; [G.p.F])^2 \) (follows from the definition of \( swe \))

2. \( G.hb\subseteq G.p.sb \cup [G.p.F]; G.p.sb \cup ([A] \cup G.p.rmw); G.p.mo^2); G.p.rfe; G.p.hb^*; (G.p.sb; [G.p.F])^2 \) (follows from Item 1 using Prop. F.14; note that \( G.p.sb; [A] \subseteq G.p.sb^2; [F]; G.p.sb \cup G.p.rmw \))

3. \( [G.RW]; (G.ssb \setminus G.rmw); G.hb\subseteq G.p.sb \cup G.p.fence; G.p.hb^*; (G.p.sb; [G.p.F])^2 \) (again follows from Item 1 using Prop. F.14)

4. \( [G.F^b]; G.hb; [G.RW] \subseteq [G.p.F^b]; G.p.sb; G.p.hb^*; [G.p.RW] \) (easily follows from Item 2)

In addition, in order to apply Prop. C.2 in the proof below, we note that:

- \( [G.W]; G.sbb_{loc}; [G.W] \subseteq G.mo: \) Indeed, we have \( [G.W]; G.sbb_{loc}; [G.W] = [G.W]; G.p.sb_{loc}; [G.W] \) and \( G.mo = G.p.mo \), and the claim follows by Power’s SC-PER-LOC condition.

- \( G.rmw \subseteq G.rb: \) Indeed, we have \( G.rmw = G.p.rmw \) and \( G.rb = G.p.rb \), and the claim follows by Power’s SC-PER-LOC and the fact that \( G \) is complete.

Next, we show that \( G \) is weakRC11-consistent. Clearly, it is complete (since \( G.R = G.p.R \) and \( G.rf = G.p.rf \)).

**COHERENCE.** We show that \( G.eco; G.hb \) is irreflexive. The irreflexivity of \( G.hb \) follows from Prop. F.14. Now, applying Prop. C.2, it suffices to show that \( G.eco \cup G.eco; (G.sbb \setminus G.rmw); G.hb \) is irreflexive. First, \( G.eco = G.p.eco \) is irreflexive because of SC-PER-LOC. Second, by property 3 above, we have \( G.eco; (G.sbb \setminus G.rmw); G.hb^2; [G.RW] \subseteq G.p.eco; (G.p.sb \cup G.p.fence; G.p.hb^*) \). By Prop. F.15, \( G.p.eco; (G.p.sb \cup G.p.fence; G.p.hb^*) \) is irreflexive.

**ATOMICITY.** By Prop. F.8, we have \( G.p.rmw \cap (G.p.rb; G.p.mo) = \emptyset \). Then, \( G.rmw \cap (G.rb; G.mo) = \emptyset \) immediately follows since \( G.rmw = G.p.rmw, G.rb = G.p.rb, \) and \( G.mo = G.p.mo \).

**SC.** We show that \( G.psc \) is acyclic. Assuming no SC accesses, we have \( G.psc = R_1 \sqcup R_2 \) where \( R_1 = [G.F^p]; G.hb; G.eco; G.hb; [G.F^p] \) and \( R_2 = [G.F^b]; G.hb; [G.F^b] \). Since \( R_2 \) is irreflexive and \( R_2^2; R_1 \subseteq R_1 \), it suffices to prove the acyclicity of \( R_1 \). To this end, we show that \( G.eco; G.hb; [G.F^b]; G.hb; [G.RW] \) is acyclic. Applying Prop. C.2, it suffices to show that \( G.eco; (G.sbb \setminus G.rmw); G.hb^2; [G.F^p]; G.hb; [G.RW] \) is acyclic. Using properties 3-4 above (and applying several simple simplifications), it suffices to show that the following relation is acyclic:

\[
\]

Using the definition of \( sync \), this relation is equal to:

\[
G.p.eco; (G.p.fence; G.p.hb^*)^2; G.p.sync; G.p.hb^*; [G.p.RW].
\]

Its acyclicity then follows by Prop. F.13.

Next, we show that \( G.p_b \) is acyclic. Suppose otherwise. Then, there are \( a_1, \ldots, a_n \) such that \( \langle a_i, a_{i+1} \rangle \in G.rfe; G.pbi^+ \) for every \( 1 \leq i \leq n \) (where \( a_{n+1} = a_1 \)). We show that \( \langle a_i, a_{i+1} \rangle \in G.p.hb^+ \) for every \( 1 \leq i \leq n \) (which contradicts POWER-NO-THIN-AIR). Let \( 1 \leq i \leq n \), and let \( b \in E \) such that \( \langle a_i, b \rangle \in G.rfe = G.p.rfe \) and \( \langle b, a_{i+1} \rangle \in G.pbi^+ \). If \( \langle b, a_{i+1} \rangle \in G.p.fence, \) then we are done since \( G.p.rfe, G.p.fence \subseteq G.p.hb^+ \). Otherwise, it follows that \( \langle b, a_{i+1} \rangle \in (G.p.deps \cup G.p.addr; G.p.sb \cup G.p.psblc)^+ \). By Prop. F.6, we have \( \langle a_i, a_{i+1} \rangle \in G.p.rfe; G.p.ppo \subseteq G.hb^+ \).
Lemma H.2. Given a program $P$ without SC accesses, every outcome of $\langle P \rangle$ under Power is an outcome of $P$ under RC11.

Proof. Given a full Power-consistent Power execution $G_p$ of $\langle P \rangle$, the compilation scheme (see Fig. 7) ensures that there exists some full execution $G$ of $P$ for which the properties of Lemma H.1 hold. Here we assumed that all RMW write attempts ($\text{stwcx}$) succeed in the first attempt. Indeed, otherwise, one could always remove the RMW reads ($\text{lwx}$) that precede the failed $\text{stwcx}$ attempts while preserving Power-consistency as well as the outcome of the execution. Now, Lemma H.1 ensures that $G$ has the same outcome as $G_p$. $G$ is weakRC11-consistent, and $G$.pb is acyclic. By Lemma G.1, either $G$.sb $\cup$ $G$.rf is acyclic (and no-THIN-AIR holds) or $P$ has undefined behavior under RC11. In any case, we obtain that the outcome of $G_p$ is an outcome of $P$ under RC11.

I. Proofs for §7 (Correctness of Program Transformations)

In this appendix, we state (and outline the proofs of) the properties that ensure the soundness of the transformations discussed in §7. For this purpose, it is technically convenient to employ a different presentation of RMWs, that treat them as single events (like in C11). To this end, we consider RMW-executions, defined as the executions in §3, with the following exceptions:

- Labels in RMW-executions may also be $\text{RMW}(x, v, w)$ where $o \in \{r1x, acq, rel, acqrel, sc\}$. Both sets $G$.r and $G$.w include all events $a$ with typ$(a) = \text{RMW}$, while $G$.RMW denotes the set of all events $a$ with typ$(a) = \text{RMW}$.
- RMW-executions do not include an rmw component.

RC11-consistency for RMW-executions is also defined as for executions, with the following exceptions:

- $G$.rb $\triangleq$ rf$^{-1}$:mo $\setminus$ E.
- Instead of ATOMICITY we now require:

$$\text{rf} \cap (\text{mo; mo}) = \emptyset.$$  

(ATOMICTY-RMW)

The rest of the notions are defined for RMW-executions exactly as for executions above.

There exists a trivial one-to-one correspondence, denoted by $\sim$, between executions according to §3 and RMW-executions (the latter are obtained by collapsing rmw-edges to single RMW events).

Proposition I.1. Suppose that $G \sim G^{\text{RMW}}$ for some execution $G$ and RMW-execution $G^{\text{RMW}}$. Then:

- $G$ is RC11-consistent iff $G^{\text{RMW}}$ is RC11-consistent.
- $G$ is racy iff $G^{\text{RMW}}$ is racy.

Using this correspondence, we may define and prove the correctness of transformations on RMW-executions.

Lemma I.1 (Strengthening). Let $G_{tgt}$ be an RMW-execution, obtained from an RMW-execution $G_{src}$ by strengthening some access/fence modes ($G_{src}.\text{mod}(a) \subseteq G_{tgt}.\text{mod}(a)$ for every $a \in G_{src}.E$). Then:

- If $G_{tgt}$ is RC11-consistent, then so is $G_{src}$.  
- If $G_{tgt}$ is racy, then so is $G_{src}$.

Proof. Easily follows from our definitions, because both properties are monotone with respect to the mode ordering.

Lemma I.2 (Sequentialization). Let $G_{tgt}$ be an RMW-execution, and let $\langle a, b \rangle \in \text{sb} \setminus \text{sb}; \text{sb}$. Let $G_{src}$ be the RMW-execution obtained from $G$ by removing the sb-edge $\langle a, b \rangle$. Then:

- If $G_{tgt}$ is RC11-consistent, then so is $G_{src}$.  
- If $G_{tgt}$ is racy, then so is $G_{src}$.

Proof. Easily follows from our definitions, because both properties are monotone with respect to sb.

Next, to state the soundness of deordering transformations, we use the following definition of adjacency.
Definition I.1. Let $R$ be a strict partial order on a set $A$. A pair $(a, b) \in A \times A$ is called $R$-adjacent if the following hold for every $c \in A$:

- If $(c, a) \in R$ then $(c, b) \in R$.
- If $(b, c) \in R$ then $(a, c) \in R$.

Lemma I.3 (Non-load-store deordering). Let $G_{\text{tgt}}$ be an RMW-execution, and let $a, b \in G_{\text{tgt}}.E$ such that $(a, b)$ is $G_{\text{tgt}}.sb$-adjacent. Let $G_{\text{src}}$ be the RMW-execution obtained from $G_{\text{src}}$ by adding an $sb$-edge $(a, b)$. Suppose that the labels of $a$ and $b$ form a deorderable pair according to Table 1, except for the load-store deorderable pairs $(R; W, R; \text{RMW}, \text{RMW}; W)$. Then:

- If $G_{\text{tgt}}$ is RC11-consistent, then so is $G_{\text{src}}$.
- If $G_{\text{tgt}}$ is racy, then so is $G_{\text{src}}$.

Proof. It is straightforward to verify that all components and derived relations in $G_{\text{src}}$ are identical to those of $G_{\text{tgt}}$ except for: $G_{\text{src}}.sb = G_{\text{tgt}}.sb \cup \{(a, b)\}$ and $G_{\text{src}}.hb = G_{\text{tgt}}.hb \cup \{(a, b)\}$. Then, the fact that $G_{\text{src}}$ is RC11-consistent, easily follows from the fact that $G_{\text{tgt}}$ is RC11-consistent. In particular, since $a, b$ is not a load-store deorderable pair, assuming that $G_{\text{tgt}}$ satisfies NO-THIN-AIR, we cannot have $(b, a) \in (G_{\text{src}}.sb \cup G_{\text{src}}.rf)^{\perp}$, so the additional $sb$-edge $(a, b)$ cannot close an $sb \cup rf$ cycle. Finally, since $G_{\text{src}}.race = G_{\text{tgt}}.race$, we have that $G_{\text{src}}$ is racy if $G_{\text{tgt}}$ is racy. □

Lemma I.4 (Load-store deordering). Let $G_{\text{tgt}}$ be an RMW-execution, and let $a, b \in G_{\text{tgt}}.E$ such that $(a, b)$ is $G_{\text{tgt}}.sb$-adjacent. Let $G_{\text{src}}$ be the RMW-execution obtained from $G_{\text{src}}$ by adding an $sb$-edge $(a, b)$. Suppose that the labels of $a$ and $b$ form a load-store deorderable pair $(R; W, R; \text{RMW}, \text{RMW}; W)$ according to Table 1. Then:

- If $G_{\text{tgt}}$ is RC11-consistent, then $G_{\text{src}}$ is weakRC11-consistent and $G_{\text{src}}.pb$ is acyclic.
- If $G_{\text{tgt}}$ is racy, then so is $G_{\text{src}}$.

Proof. The proof is similar to the proof of Lemma I.3. The fact that $G_{\text{src}}$ is weakRC11-consistent follows from the fact that $G_{\text{tgt}}$ is RC11-consistent. In addition, since $G_{\text{src}}.pb = G_{\text{tgt}}.pb \subseteq G_{\text{src}}.sb \cup G_{\text{src}}.rf$, assuming that $G_{\text{tgt}}$ satisfies NO-THIN-AIR, we have that $G_{\text{src}}.pb$ is acyclic. □

Using Lemma G.1, one obtains the soundness of load-store deordering according to Table 1.

Notation I.1. For a binary relation $R$ on a set $A$ and an element $a \in A$, we denote by $R_{a}^\uparrow$ the set $\{b \in A \mid \langle b, a \rangle \in R\}$, and by $R_{a}^\downarrow$ the set $\{b \in A \mid \langle a, b \rangle \in R\}$.

Lemma I.5 (Read-read merging). Let $G_{\text{tgt}}$ be an RC11-consistent RMW-execution. Let $a \in R \setminus \text{RMW}$, and let $a' \in E$ such that $\langle a', a \rangle \in rf$. Let $b \notin E$, and let $G_{\text{src}}$ be the RMW-execution satisfying:

- $G_{\text{src}}.E = G_{\text{tgt}}.E \cup \{b\}$.
- $G_{\text{src}}.lab = G_{\text{tgt}}.lab \cup \{a \mapsto G_{\text{tgt}}.lab(a)\}$.
- $G_{\text{src}}.sb = G_{\text{tgt}}.sb \cup \{(a, b)\} \cup (G_{\text{tgt}}.sb_a^\uparrow \times \{b\}) \cup (\{b\} \times G_{\text{tgt}}.sb_a^\downarrow)$.
- $G_{\text{src}}.rf = G_{\text{tgt}}.rf \cup \{(a', b)\}$.
- $G_{\text{src}}.mo = G_{\text{tgt}}.mo$.

Then, $G_{\text{src}}$ is RC11-consistent, and it is racy if $G_{\text{tgt}}$ is racy.

Proof. By definition, $G_{\text{src}}$ is complete, and ATOMICITY-RMW holds (since $G_{\text{src}}.mo = G_{\text{tgt}}.mo$ and $b \notin G_{\text{src}}.R \setminus G_{\text{src}}.\text{RMW}$). It is also easy to see that we have:

- $G_{\text{src}}.eco = G_{\text{tgt}}.eco \cup (G_{\text{tgt}}.eco_{a}^\uparrow \times \{b\}) \cup (\{b\} \times G_{\text{tgt}}.eco_{a}^\downarrow)$.
- $G_{\text{src}}.hb = G_{\text{tgt}}.hb \cup \{(a, b)\} \cup (G_{\text{tgt}}.hb_{a}^\uparrow \times \{b\}) \cup (\{b\} \times G_{\text{tgt}}.hb_{a}^\downarrow)$.

Hence, $G_{\text{src}}$ satisfies COHERENCE. To see that NO-THIN-AIR holds, note that if we had $\langle b, a \rangle \in (G_{\text{src}}.sb \cup G_{\text{src}}.rf)^{\perp}$, then we would have $\langle a, a \rangle \in (G_{\text{tgt}}.sb \cup G_{\text{tgt}}.rf)^{\perp}$; and similarly, if we had $\langle b, a' \rangle \in (G_{\text{src}}.sb \cup G_{\text{src}}.rf)^{\perp}$, then we would have $\langle a, a' \rangle \in (G_{\text{tgt}}.sb \cup G_{\text{tgt}}.rf)^{\perp}$. It remains to show that $G_{\text{src}}.psc$ is acyclic. If $G_{\text{tgt}}.mod(a) \neq sc$, then we have $G_{\text{src}}.psc = G_{\text{tgt}}.psc$, and the claim follows since $G_{\text{tgt}}$ satisfies SC. Otherwise, we have:

- $G_{\text{src}}.psc = G_{\text{tgt}}.psc \cup \{(a, b)\} \cup (G_{\text{tgt}}.psc_{a}^\uparrow \times \{b\}) \cup (\{b\} \times G_{\text{tgt}}.psc_{a}^\downarrow)$. 
This implies that a $G_{\text{src}.\text{psc}} \cup G_{\text{src}.\text{psc}}$ cycle would imply a $G_{\text{tgt}.\text{psc}} \cup G_{\text{tgt}.\text{psc}}$ cycle. Finally, if $(c, b) \in G_{\text{src}.\text{race}}$, then we have $(c, a) \in G_{\text{tgt}.\text{race}}$. 

\begin{lemma}[Write-write merging] Let $G_{\text{tgt}}$ be an RC11-consistent RMW-execution. Let $b \in W \setminus \text{RMW}$, $a \not\in E$, and $v \in \text{Val}$. Let $G_{\text{src}}$ be the RMW-execution satisfying:

\begin{itemize}
  \item $G_{\text{src}.\text{E}} = G_{\text{tgt}.\text{E}} \cup \{a\}$.
  \item $G_{\text{src}.\text{lab}} = G_{\text{tgt}.\text{lab}} \cup \{a \mapsto \text{RMW}(b, v)\}$.
  \item $G_{\text{src}.\text{sb}} = G_{\text{tgt}.\text{sb}} \cup \{b \mapsto (a) \times G_{\text{tgt}.\text{sb}}\}$.
  \item $G_{\text{src}.\text{rf}} = G_{\text{tgt}.\text{rf}}$.
  \item $G_{\text{src}.\text{mo}} = G_{\text{tgt}.\text{mo}} \cup \{(a, b) \mapsto (a) \times G_{\text{tgt}.\text{mo}}\}$.
\end{itemize}

Then, $G_{\text{src}}$ is RC11-consistent, and it is racy if $G_{\text{tgt}}$ is racy.

\begin{proof}
By definition, $G_{\text{src}}$ is complete. To see that ATOMICITY-RMW holds, note that we have $G_{\text{src}.\text{mo}}; G_{\text{src}.\text{mo}}; \text{RMW} \subseteq G_{\text{tgt}.\text{mo}}; G_{\text{tgt}.\text{mo}} \cup \{(a) \times G_{\text{src}.\text{E}}\}$, and that $a$ has no outgoing $\text{rf}$-edges. It is also easy to see that we have:

\begin{itemize}
  \item $G_{\text{src}.\text{eco}} = G_{\text{tgt}.\text{eco}} \cup \{\}(a, b) \times (a) \cup \{b \times G_{\text{tgt}.\text{eco}}\}$.
  \item $G_{\text{src}.\text{hb}} = G_{\text{tgt}.\text{hb}} \cup \{a \mapsto (a, b) \times (a) \times G_{\text{tgt}.\text{hb}}\}$.
\end{itemize}

Hence, $G_{\text{src}}$ satisfies COHERENCE. To see that NO-THIN-AIR holds, note that if we had $(b, a) \in (G_{\text{src}.\text{sb}} \cup G_{\text{src}.\text{rf}})$, then we would have $(b, b) \in (G_{\text{tgt}.\text{sb}} \cup G_{\text{tgt}.\text{rf}})$. It remains to show that $G_{\text{src}.\text{psc}}$ is acyclic. If $G_{\text{tgt}.\text{mod}}(a) \neq \text{sc}$, then we have $G_{\text{src}.\text{psc}} = G_{\text{tgt}.\text{psc}}$, and the claim follows since $G_{\text{tgt}}$ satisfies SC. Otherwise, we have:

\begin{itemize}
  \item $G_{\text{src}.\text{psc}} = G_{\text{tgt}.\text{psc}} \cup \{(a, b) \times (a) \times G_{\text{tgt}.\text{psc}}\}$.
\end{itemize}

This implies that a $G_{\text{src}.\text{psc}} \cup G_{\text{src}.\text{psc}}$ cycle would imply a $G_{\text{tgt}.\text{psc}} \cup G_{\text{tgt}.\text{psc}}$ cycle. Finally, if $(c, b) \in G_{\text{src}.\text{race}}$, then we have $(c, a) \in G_{\text{tgt}.\text{race}}$. 
\end{proof}

\begin{lemma}[Write/RMW-read merging] Let $G_{\text{tgt}}$ be an RC11-consistent RMW-execution. Let $a \in W$ and $b \not\in E$. Let $o \in \text{Ord}$, such that:

\begin{itemize}
  \item If $\text{typ}(a) = \text{w}$ and $o = \text{sc}$, then $\text{mod}(a) = \text{sc}$.
  \item If $\text{typ}(a) = \text{RMW}$, then $o \subseteq \text{mod}(a)$.
\end{itemize}

Let $G_{\text{src}}$ be the RMW-execution satisfying:

\begin{itemize}
  \item $G_{\text{src}.\text{E}} = G_{\text{tgt}.\text{E}} \cup \{b\}$.
  \item $G_{\text{src}.\text{lab}} = G_{\text{tgt}.\text{lab}} \cup \{b \mapsto \text{RMW}(a, v)\}$.
  \item $G_{\text{src}.\text{sb}} = G_{\text{tgt}.\text{sb}} \cup \{a \mapsto (a) \times (\text{RMW}(a, v))\}$.
  \item $G_{\text{src}.\text{rf}} = G_{\text{tgt}.\text{rf}} \cup \{(a, b)\}$.
  \item $G_{\text{src}.\text{mo}} = G_{\text{tgt}.\text{mo}}$.
\end{itemize}

Then, $G_{\text{src}}$ is RC11-consistent, and it is racy if $G_{\text{tgt}}$ is racy.

\begin{proof}
Similar to the proof of Lemma I.5.
\end{proof}

\begin{lemma}[Write-RMW merging] Let $G_{\text{tgt}}$ be an RC11-consistent RMW-execution. Let $b \in W \setminus \text{RMW}$, $a \not\in E$, $v \in \text{Val}$, and $o \in \text{Ord}$ such that $o_v = \text{mod}(b)$. Let $G_{\text{src}}$ be the RMW-execution satisfying:

\begin{itemize}
  \item $G_{\text{src}.\text{E}} = G_{\text{tgt}.\text{E}} \cup \{a\}$.
  \item $G_{\text{src}.\text{lab}} = G_{\text{tgt}.\text{lab}} \cup \{b \mapsto \text{RMW}(a, v)\}$.
  \item $G_{\text{src}.\text{sb}} = G_{\text{tgt}.\text{sb}} \cup \{a \mapsto (a) \times G_{\text{tgt}.\text{sb}}\}$.
  \item $G_{\text{src}.\text{rf}} = G_{\text{tgt}.\text{rf}} \cup \{(a, b)\}$.
  \item $G_{\text{src}.\text{mo}} = G_{\text{tgt}.\text{mo}} \cup \{(a, b) \mapsto (a) \times G_{\text{tgt}.\text{mo}}\}$.
\end{itemize}

Then, $G_{\text{src}}$ is RC11-consistent, and it is racy if $G_{\text{tgt}}$ is racy.

\begin{proof}
By definition, $G_{\text{src}}$ is complete. To see that ATOMICITY-RMW holds, note that we have $G_{\text{src}.\text{mo}}; G_{\text{src}.\text{mo}}; \text{RMW} \subseteq G_{\text{tgt}.\text{mo}}; G_{\text{tgt}.\text{mo}} \cup \{(a) \times G_{\text{src}.\text{E}}\}$, and that $a$ has only an $\text{rf}$-edge to its immediate $G_{\text{src}.\text{mo}}$-successor $b$. The rest of the properties are proved as in the proof of Lemma I.6. 
\end{proof}
Lemma I.9 (RMW-RMW merging). Let $G_{tgt}$ be an RC11-consistent RMW-execution. Let $a \in E$ with $\text{lab}(a) = \text{RMW}^o(x, v_r, v_w)$. Let $b \notin E$ and $v \in \text{Val}$, and let $G_{src}$ be the RMW-execution satisfying:

- $G_{src}.E = G_{tgt}.E \uplus \{b\}$.
- $G_{src}.\text{lab} = G_{tgt}.\text{lab}[a \mapsto \text{RMW}^o(x, v_r, v)] \cup \{b \mapsto \text{RMW}^o(x, v, v_w)\}$.
- $G_{src}.\text{sb} = G_{tgt}.\text{sb} \cup \{(a, b)\} \cup (G_{tgt}.\text{sb}_a^+ \times \{b\}) \cup (\{b\} \times G_{tgt}.\text{sb}_b^+)$. 
- $G_{src}.\text{rf} = G_{tgt}.\text{rf} \cup \{(a, b)\}$.
- $G_{src}.\text{mo} = G_{tgt}.\text{mo} \cup \{(a, b)\} \cup (G_{tgt}.\text{mo}_a^+ \times \{b\}) \cup (\{b\} \times G_{tgt}.\text{mo}_b^+)$. 

Then, $G_{src}$ is RC11-consistent, and it is racy if $G_{tgt}$ is racy.

Lemma I.10 (Fence-fence merging). Let $G_{tgt}$ be an RC11-consistent RMW-execution. Let $a \in F$, $b \notin E$, and let $G_{src}$ be the RMW-execution satisfying:

- $G_{src}.E = G_{tgt}.E \uplus \{b\}$.
- $G_{src}.\text{lab} = G_{tgt}.\text{lab} \cup \{b \mapsto G_{tgt}.\text{lab}(a)\}$.
- $G_{src}.\text{sb} = G_{tgt}.\text{sb} \cup \{(a, b)\} \cup (G_{tgt}.\text{sb}_a^+ \times \{b\}) \cup (\{b\} \times G_{tgt}.\text{sb}_b^+)$. 
- $G_{src}.\text{rf} = G_{tgt}.\text{rf}$.
- $G_{src}.\text{mo} = G_{tgt}.\text{mo}$.

Then, $G_{src}$ is RC11-consistent, and it is racy if $G_{tgt}$ is racy.

Proof. By definition, $G_{src}$ is complete, and ATOMICITY-RMW holds since $G_{src}.\text{rf} = G_{tgt}.\text{rf}$ and $G_{src}.\text{mo} = G_{tgt}.\text{mo}$. It is also easy to see that we have $G_{src}.\text{eco} = G_{tgt}.\text{eco}$ and:

- $G_{src}.\text{hb} = G_{tgt}.\text{hb} \cup \{(a, b)\} \cup (G_{tgt}.\text{hb}_a^+ \times \{b\}) \cup (\{b\} \times G_{tgt}.\text{hb}_b^+)$. 

Hence, $G_{src}$ satisfies COHERENCE. To see that NO-THIN-AIR holds, note that if we had $(a, b) \in G_{src}.\text{sb} \cup G_{src}.\text{rf}$, then we would have $(a, a) \in G_{tgt}.\text{sb} \cup G_{tgt}.\text{rf}$. It remains to show that $G_{src}.\text{psc}$ is acyclic. If $G_{tgt}.\text{mod}(a) \neq \text{sc}$, then we have $G_{src}.\text{psc} = G_{tgt}.\text{psc}$, and the claim follows since $G_{tgt}$ satisfies SC. Otherwise, we have:

- $G_{src}.\text{psc} = G_{tgt}.\text{psc} \cup \{(a, b)\} \cup (G_{tgt}.\text{psc}_a^+ \times \{b\}) \cup (\{b\} \times G_{tgt}.\text{psc}_b^+)$. 

This implies that a $G_{src}.\text{psc} \cup G_{src}.\text{psc}$ cycle would imply a $G_{tgt}.\text{psc} \cup G_{tgt}.\text{psc}$ cycle. Finally, $G_{src}.\text{race} = G_{tgt}.\text{race}$, so $G_{src}$ is racy if $G_{tgt}$ is racy. 

Soundness of register promotion is proved in two steps. First, we show that if all accesses to some location are in one thread, then they can be safely weakened to non-atomic accesses. Second, we show that these non-atomic accesses can be safely removed (replaced by register assignments at the program level).

Lemma I.11 (Register promotion-a). Let $G_{tgt}$ be an RC11-consistent RMW-execution. Suppose that all accesses to some location $x$ are related by $G_{tgt}.\text{sb}$. Let $G_{src}$ be the RMW-execution obtained by strengthening the accesses mode of all accesses to $x$ to $\text{sc}$. Then, $G_{src}$ is RC11-consistent, and it is racy if $G_{tgt}$ is racy.

Proof. By definition, we have $G_{src}.c = G_{tgt}.c$ for $c \in \{\text{sb}, \text{rf}, \text{mo}, \text{eco}\}$. It is also easy to see that $G_{src}.\text{hb} = G_{tgt}.\text{hb}$. Hence, $G_{src}$ is complete, and ATOMICITY, COHERENCE, NO-THIN-AIR hold for $G_{src}$ since they hold for $G_{tgt}$. To see that $G_{src}.\text{psc}$ is acyclic, it suffices to note that $G_{src}.\text{psc} \subseteq G_{tgt}.\text{psc} \cup G_{tgt}.\text{sb}$ (acyclicity of $G_{tgt}.\text{psc} \cup G_{tgt}.\text{sb}$ follows from the acyclicity of $G_{tgt}.\text{psc}$ since $\text{sb} \subseteq \text{psc} \cup \text{psc}^+$ in every execution). Finally, if $(a, b) \in G_{tgt}.\text{race}$ and $\text{na} \in \{G_{tgt}.\text{mod}(a), G_{tgt}.\text{mod}(b)\}$, then the same holds in $G_{src}$; we must have $\text{loc}(a) \neq x$ if $(a, b) \notin G_{tgt}.\text{hb} \cup (G_{tgt}.\text{hb})^{-1}$. 

Lemma I.12 (Register promotion-b). Let $G_{tgt}$ be an RC11-consistent RMW-execution. Let $x \in \text{Loc}$ and let $X = \{b \in E \mid \text{loc}(b) = x\}$. Suppose that all accesses in $X$ are related by $G_{tgt}.\text{sb}$. Let $a \notin E$, let $G_{src}$ be an RMW-execution satisfying:

- $G_{src}.E = G_{tgt}.E \uplus \{a\}$.
- $G_{src}.\text{lab} = G_{tgt}.\text{lab} \cup \{a \mapsto L\}$ where $L$ is some access label with mode $\text{na}$ and location $x$.
- $G_{src}.\text{sb} \supset G_{tgt}.\text{sb}$ and every event in $X$ is $G_{src}.\text{sb}$-related to $a$.
- $G_{src}.\text{rf} = G_{tgt}.\text{rf}$ if $G_{src}.\text{typ}(a) = W \setminus \text{RMW}$, and otherwise $G_{src}.\text{rf} = G_{tgt}.\text{rf} \cup \{(\max G_{src}.\text{sb} G_{src}.\text{sb}_a^+, a)\}$. 

30 2016/11/16
Then, \(G_{src}\) is RC11-consistent, and it is racy if \(G_{tgt}\) is racy.

Proof. Easily follows from our definitions.

## J. Proofs for §8 (Programming Guarantees)

**Theorem 3.** If in all SC-consistent executions of a program \(P\), every race \(\langle a, b \rangle\) has \(\mod(a) = \mod(b) = sc\), then the outcomes of \(P\) under RC11 coincide with those under SC.

Proof. Let \(P\) be a program, and suppose that every race \(\langle a, b \rangle\) in some SC-consistent execution of \(P\) has \(\mod(a) = \mod(b) = sc\). We prove that \(P\) has no weak behaviors. Suppose toward a contradiction that there exists an execution \(G\) of \(P\) that is RC11-consistent but not SC-consistent. (Note that if \(P\) has undefined behavior under RC11, then there exists a racy RC11-consistent execution of \(P\), and our assumption ensures that this execution is not SC-consistent.)

We call an execution \(G'\) a prefix of an execution \(G\) if it is obtained by restricting \(G\) to a set \(E\) of events that contains the set \(E_0\) of initialization events, and is closed with respect to \(G_{sb} \cup G_{rf}\) (\(a \in E\) whenever \(b \in E\) and \(\langle a, b \rangle \in G_{sb} \cup G_{rf}\)). It is easy to show that \(G'\) is RC11-consistent, provided that \(G\) is RC11-consistent.

**Notation J.1.** For an execution \(G\), \(G_{rf}\vert_{sc}\) denotes the restriction of \(G_{rf}\) to SC accesses \((G_{rf}\vert_{sc} = [G.E\vert_{sc}; G_{rf}; [G.E^+])\). A similar notation is used for \(G_{mo}\) and \(G_{rb}\).

For a set of events \(E\), let \(\Pi(E)\) denote the set of all pairs \(\langle a, b \rangle \in E \times E\) of conflicting events, such that \(\{G_{mod}(a), G_{mod}(b) \} \neq \{sc\}\) and \(\langle a, b \rangle, \langle b, a \rangle \notin (G_{sb} \cup G_{rf}\vert_{sc})^+\). Let \(a_1, \ldots, a_n\) be an enumeration of \(E \setminus E_0\) that respects \(G_{sb} \cup G_{rf}\) (that is, \(i < j\) whenever \(\langle a_i, a_j \rangle \in G_{sb} \cup G_{rf}\)). For every \(1 \leq i \leq n\), let \(E_i = E_0 \cup \{a_1, \ldots, a_i\}\) and \(G_i = G\vert_{E_i}\). Since the \(G_i\)'s are all prefixes of \(G\), all of them are RC11-consistent.

**Claim:** For every \(1 \leq i \leq n\), if \(\Pi(E_i) = \emptyset\) then \(G_i\) is SC-consistent.

**Proof:** Suppose that \(\Pi(E_i) = \emptyset\). Since \(G\) satisfies COHERENCE, it follows that:

- \(G_i.rf \subseteq (G_{sb} \cup G_{rf}\vert_{sc})^+\).
- \(G_i.mo \subseteq (G_{sb} \cup G_{rf}\vert_{sc})^+ \cup G_{mo}\vert_{sc}.$
- \(G_i.rb \subseteq (G_{sb} \cup G_{rf}\vert_{sc})^+ \cup G_{rb}\vert_{sc}.$

Hence, we have \(G_i.sb \cup G_i.rf \cup G_i.mo \cup G_i.rb \subseteq R^+,\) where \(R = G_{sb} \cup G_{rf}\vert_{sc} \cup G_{mo}\vert_{sc} \cup G_{rb}\vert_{sc}.$

Since \(G\) satisfies the SC condition, we have that \(R\) is acyclic, and so \(G_i\) is SC-consistent (ATOMICITY holds since it holds for \(G\)).

Now, since \(G\) is not SC-consistent, we have \(\Pi(G.E) \neq \emptyset\). Let \(k = \min\{i | \Pi(E_i) \neq \emptyset\}\). Then, \(\Pi(E_{k-1}) = \emptyset\) (and so, \(G_{k-1}\) is SC-consistent), and there exists some \(j < k\), such that \(a_j\) and \(a_k\) are conflicting, \(\{G_{mod}(a_j), G_{mod}(a_k)\} \neq \{sc\}\), and \(\langle a_j, a_k \rangle, \langle a_k, a_j \rangle \notin (G_{sb} \cup G_{rf}\vert_{sc})^+\). Let \(B = \{b \in E_k | \langle b, a_k \rangle \in G_{sb}\}\). Since \(\langle a_j, a_k \rangle \notin (G_{sb} \cup G_{rf}\vert_{sc})^+\), and \(G_{k-1}.rf \subseteq (G_{sb} \cup G_{rf}\vert_{sc})^+\), we have \(\langle a_j, b \rangle \notin (G_{sb} \cup G_{rf})^+\) for every event \(b \in B\). Let \(x = \loc(a_k)\), and consider two cases:

- \(\typ(a_k) = W:$$

  **Claim:** \(\langle a_j, a_k \rangle \in G_{k}.race\).

  **Proof:** Clearly, we have \(\langle a_k, a_j \rangle \notin (G_{k}.sb \cup G_{k}.rf)^+\) (\(a_k\) has no outgoing sb and rf edges in \(G_k\)). In addition, we have \(\langle a_j, a_k \rangle \notin (G_{k}.sb \cup G_{k}.rf)^+\) (otherwise, \(\langle a_j, b \rangle \in (G_{sb} \cup G_{rf})^+\) for some \(b \in B\)).

- **Claim:** \(G_k\) is not SC-consistent.

  **Proof:** Since \(\langle a_j, a_k \rangle \in G_{k}.race\) and \(\{G_{mod}(a_j), G_{mod}(a_k)\} \neq \{sc\}\), the claim follows from our assumption.

- **Claim:** \(a_k \notin G_{At}\).
Proof: Suppose otherwise, and let $b \in G.E$ such that $\langle b, a_k \rangle \in \text{rmw}$. Since $G_k$ is not SC-consistent, but $G_{k-1}$ is SC-consistent, it must be the case that $\langle a_k, c \rangle \in G.\text{mo}$ and $\langle c, a_k \rangle \in (G.\text{sb} \cup G.\text{rf} \cup G.\text{mo} \cup G.\text{rb})^+$ for some $c \in E_{k-1}$. Let $d \in E_{k-1}$ such that $\langle c, d \rangle \in (G.\text{sb} \cup G.\text{rf} \cup G.\text{mo} \cup G.\text{rb})^*$ and $\langle d, a_k \rangle \in G.\text{sb} \cup G.\text{mo} \cup G.\text{rb}$. Then, we also have $\langle c, d \rangle \in (G_{k-1}.\text{sb} \cup G_{k-1}.\text{rf} \cup G_{k-1}.\text{mo} \cup G_{k-1}.\text{rb})^*$. If $\langle d, a_k \rangle \in G.\text{mo} \cup G.\text{rb}$, then we obtain $\langle d, c \rangle \in G.\text{mo} \cup G.\text{rb}$, and so $\langle d, c \rangle \in G_{k-1}.\text{mo} \cup G_{k-1}.\text{rb}$, which contradicts the fact that $G_{k-1}$ is SC-consistent. Otherwise, $\langle d, a_k \rangle \in G.\text{sb}$. It follows that $\langle d, b \rangle \in G.\text{sb}$. Now, COHERENCE ensures that $G.\text{rmw} \subseteq G.\text{rb}$, and it follows that $\langle b, c \rangle \in G.\text{rb}$. Hence, $\langle b, c \rangle \in G_{k-1}.\text{rb}$, which again contradicts the fact that $G_{k-1}$ is SC-consistent. \hfill \Box

Let $G'_k$ be the extension of $G_{k-1}$ with the event $a_k$ (with the same label as in $G_k$), the $\text{sb}$-edges of $G_k$, and the $\text{mo}$-edges $\{\langle a, a_k \rangle | a \in G_{k-1}.\text{loc}\}$. It is easy to see that $G'_k$ is SC-consistent as well (in particular, it is important here that $a_k \notin G.\text{At}$). Except for $\text{mo}$, it is identical to $G_k$, and so it is an execution of $P$ and $\langle a_j, a_k \rangle \in G'_k.\text{race}$. Since $\{G.\text{mod}(a_j), G.\text{mod}(a_k)\} \neq \{\text{sc}\}$, this contradicts our assumption.

- $\text{typ}(a_k) = \text{R}$:
  In this case, we must have $\text{typ}(a_j) = \text{W}$. Let

$$E = \{a \in G.E | \langle a, a_k \rangle \in (G.\text{sb} \cup G_{k-1}.\text{rf})^* \lor \langle a, a_j \rangle \in (G.\text{sb} \cup G_{k-1}.\text{rf})^*\}.$$

Let $G'$ be the restriction of $G_k$ to the events in $E$. Since $G'|_{E \setminus \{a_k\}}$ is a prefix of $G_{k-1}$, it is SC-consistent. Let $c = \max_{G.\text{sb} \cup G_{k-1}.\text{loc}} G'.\text{loc}$, and consider two cases.

- $c \neq a_j$:
  Let $G''$ be the execution obtained from $G'$ by (i) modifying the value read at $a_k$ to $\text{val}_\text{W}(c)$, and (ii) adding the reads-from edge $\langle c, a \rangle$. It is easy to see that $G''$ is SC-consistent, and Assumption B.1 ensures that it is an execution of $P$. Additionally, $\langle a_j, a_k \rangle \notin (G''.\text{sb} \cup G''.\text{rf})^+$ (there are no outgoing $\text{sb}$ and $\text{rf}$ edges from $a_j$ in $G''$), and so, $\langle a_j, a_k \rangle \in G''.\text{race}$. Since $\{G.\text{mod}(a_j), G.\text{mod}(a_k)\} \neq \{\text{sc}\}$, this contradicts our assumption.

- $c = a_j$:
  Let $d$ be the immediate $G.\text{mo}$-predecessor of $c$, and let $G''$ be the execution obtained from $G'$ by (i) modifying the value read at $a_k$ to $\text{val}_\text{W}(d)$, and (ii) adding the reads-from edge $\langle d, a \rangle$. Again, it is easy to see that $G''$ is SC-consistent, and Assumption B.1 ensures that it is an execution of $P$. As in the previous case we obtain a contradiction to our assumption. \hfill \Box

Theorem 4. Let $G$ be an RC11-consistent execution. Suppose that for every two distinct shared locations $x$ and $y$, $[E_x]; \text{sb}; [E_y] \subseteq \text{sb}; [F_{xy}]; \text{sb}$. Then, $G$ is SC-consistent.

Proof. It suffices to show that $\text{sb} \cup \text{eco}$ is acyclic (recall that $\text{eco} = \text{eco} \setminus \text{sb}$). Consider a cycle in $\text{sb} \cup \text{eco}$ of a minimal length. Cycles with at most one $\text{eco}$ edge are ruled out by COHERENCE. Hence, our cycle must have at least two $\text{eco}$ edges. Let $a_1, b_1, a_2, b_2, \ldots, a_n, b_n \in E$ (where $n \geq 2$) such that $\langle a_i, b_i \rangle \in \text{eco}$ and $\langle b_i, a_{i+1} \rangle \in \text{sb}$ for every $1 \leq i \leq n$ (where we take $a_{n+1}$ to be $a_1$). The events $a_1, b_1, \ldots, a_n, b_n$ are all accesses to shared locations (since $\langle a_i, b_i \rangle \in \text{eco}$ $\subseteq=_{\text{loc}}$ for every $1 \leq i \leq n$). In addition, we have $\text{loc}(b_i) \neq \text{loc}(a_{i+1})$ for every $1 \leq i \leq n$ (otherwise we would have $\langle a_i, a_{i+1} \rangle \in \text{eco}; \text{sb}|_{\text{loc}} \subseteq \text{eco}$, which contradicts the minimality of the cycle). Therefore, our assumption entails that there exist $f_1, \ldots, f_n \in F_{\text{sc}}$ such that $\langle b_i, f_i \rangle \in \text{sb}$ and $\langle f_i, a_{i+1} \rangle \in \text{sb}$ for every $1 \leq i \leq n$. It follows that $\langle f_i, f_{i+1} \rangle \in [F_{xy}]; \text{sb}; \text{eco}; \text{sb}; [F_{sc}] \subseteq \text{psc}$ for every $1 \leq i \leq n$ (where we take $f_{n+1}$ to be $f_1$). This contradicts the fact that $G$ satisfies the $\text{sc}$ constraint. \hfill \Box