# **Taming Release-Acquire Consistency**

Ori Lahav Nick Giannarakis Viktor Vafeiadis Max Planck Institute for Software Systems (MPI-SWS)



## C11's release/acquire declarative memory model

#### Store buffering

$$x = y = 0$$

$$x := 1; \quad | \quad y := 1;$$

$$\text{print } y \quad | \quad \text{print } x$$

both threads may print 0



#### Message passing

$$\begin{array}{c|c} x = 0 \\ m := 42; & \text{while } x = 0 \\ x := 1 & \text{skip}; \\ print m \end{array}$$

only 42 may be printed



**RA model:** C11 model where all reads are acquire, all writes are release, and all atomic updates are acquire/release

#### Good balance between performance and programmability:

- Supports intended hardware/compiler optimizations:
  - Elimination of redundant adjacent accesses
- Store-load reordering:  $\forall x \rightarrow \exists y \rightarrow \exists x \text{ (unlike SC)}$
- DRF theorem:

No data races under SC ensures no weak behaviors

Monotonicity:

Adding synchronization does not introduce behaviors (unlike TSO)

- Verified compilation schemes for x86-TSO and Power
- Program logics: RSL, GPS, OGRA

### Strong release/acquire

Problem: Some behaviors allowed by RA are never observed.

$$egin{array}{c} x := 1; \ y := 2; \ z_1 := 1 \end{array} egin{array}{c} ext{print } z_1; \ ext{print } z_2; \ ext{print } x; \ ext{z}_2 := 1 \end{array} egin{array}{c} y := 1; \ x := 2; \ z_2 := 1 \end{array}$$

C11 allows printing 1, 1, 1, 1.



**Proposed solution:** Rule out  $hb \cup mo$  cycles.

- No additional cost:
  - Compilation schemes are not affected.
- Same compiler optimizations are sound.
- No better deal for Power:

Power model restricted to RA accesses = strong RA

#### **SC-fences**

**Problem:** SC-fences are overly weak.

$$x = y = 0$$

$$x := 1 \quad \text{print } x; \quad \text{print } y; \quad x := 1$$

$$\text{print } y \quad \text{print } x \quad y := 1$$

C11 allows both threads to print 1, 0.



**Proposed solution:** Model SC-fences as atomic updates of a distinguished fence location.

- RA semantics enforces all fence events to be ordered by hb.
- Compilation schemes are not affected.
- Adding fences to guarantee SC:
  - Between every two racy accesses
- Between racy writes and racy reads for *client-server programs*

## **Operational semantics**

- Based on point-to-point communication.
- Each processor has a local memory and an outgoing message buffer.
- Processors non-deterministically choose between performing their own commands and processing incoming messages.
- Messages are processed in the order they were issued.





- Processing a message updates the local memory and adds the message to the outgoing buffer.
- Global timestamps are used to ensure coherence properties:
  - Every write obtains a new timestamp, larger than all previous ones.
  - When processing a message, the local memory is updated only if the message's timestamp is larger than the stored one.



If the first thread prints 8, the second thread cannot print 7

