Memory Sequential Consistency

Understand the distinction between Cache Consistency and Cache Coherency. Learn why sequential consistency in shared memory is important.

Problems with Shared Memory

A very important note on Cache

For this note, cache refers to the private data cache belonging to each core and not the LLC (Last-Level Cache), such as L3, that is shared by all cores and serves as the last cache level before main memory. (Assuming the architecture includes L1, L2, and L3 caches).

Multicore Access of Shared Memory Objects

Figure 1

Core C1Core C2Comments

S1: Store data = NEW;

data is initialized to 0

S2: Store flag = SET;

L1: Load r1 = flag;

flag is initialized != SET

B1: if (r1 != SET) goto L1

B1 is a branch instruction,

L2: Load r2 = data

and can cause B1 and L1 to repeat

The table above shows two cores, core C1 and core C2, with access to the variables data flag located in shared memory. At first glance, the expectation is for core C2's r2 register (which holds the value for data) to read NEW.

Preserved Ordering: S1, S2, L1, L2

However, this assumption assumes that the CPU does not reorder instructions (see Why are Instructions Reordered?). If these instructions are reordered as follows, core C2's r2 register (holding the value for the data variable) will return 0 instead of NEW.

Instructions Reordered: S2, L1, L2, S1

Figure 2

To illustrate this, consider the modified table.

Core C1Core C2Coherence of dataCoherence of flag

S2: Store flag = SET;

C2: read-only

C1: read-write

L1: Load r1 = flag;

C2: read-only

C2: read-only

B1: if (r1 != SET) goto L1

N/A

N/A

L2: Load r2 = data

C2: read-only

C2: read-only

S1: Store data = NEW;

C1: read-write

C2: read-only

This clearly shows the consequence of reordering instructions in the above scenario; the branch instruction B1 will not jump to L1 (because r1 = SET), causing L2 to load r2, which will hold a value of 0 because core C1's store instruction was reordered after core C2's load.

Next, the following table honors the SWMR invariant.

Figure 3

Core C1Core C2Comments

S1: x = NEW;

S2: y = NEW;

x and y initialized to 0

L1: r1 = y;

L2 r2 = x;

The following order-dependent outcomes are possible after execution. In other words, this example is non-deterministic.

  • (r1, r2) = (0, NEW) : S1, L1, S2, L2

  • (r1, r2) = (NEW, 0) : S2, L2, S1, L1

  • (r1, r2) = (NEW, NEW) : S1, S2, L1, L2

Some x86 systems from AMD and Intel use FIFO write buffers to enhance performance. In such cases, (0, 0) is also a possibility. A FIFO buffer will delay data writing to cache or main memory. If a core reads at a location and the most recent write is still in a buffer, it isn't synchronized, and the read can return stale data. You can use mfence, lfence, sfence, or clflush instructions.

Each outcome above honors the cache coherence invariants and is thus coherent. This illustrates the difference between cache consistency and coherence and highlights the importance of consistency.

Cache consistency models define specifications for the behavior of multithreaded programs. Cache coherency invariants can be satisfied, but instruction reordering in multithreaded programs can break consistency.

Memory consistency models specify what values dynamic loads may return and the final state of memory. They typically split executions into two categories: those that obey Memory consistency (MC executables) and those that do not (non-MC executables). These then become MC implementations and non-MC implementations, respectively.

Why are instructions reordered?

This will not be a comprehensive list of reasons, but enough to develop a general understanding.

  • Store-Store reordering: Reordering can occur if a core has a non-FIFO write buffer, allowing stores to depart in a different order than they entered. An example of this (dynamic) reordering is if the first store misses in the cache and the second hits or if the second store can coalesce with an earlier store. Reordering stores to different memory locations does not affect single-threaded programs, only multi-threaded.

  • Load-Load reordering: Dynamic instruction scheduling is a feature supported by many modern cores (see Tomasulo's algorithm); reordering loads have a similar effect. For example, if the branch instruction B1 is removed, there is no longer a control dependency between L1 and L2, which will allow reordering. Recall that control-dependent instructions are precluded from reordering so that some instruction J that depends on the condition of instruction I (conditionals statements using jmp/branch instructions) is reordered to precede I.

  • Load-store and store-load reordering: Cores may reorder loads and stores to different addresses from the same thread. This can appear safe, however, recall figure 3 from the example in section Multicore Access of Shared Memory Objects. Reordering an earlier store with a later load (store-load reordering) would yield an order L1, L2, S1, S2 and return values of 0 for r1 and r2 a Read-After-Write (RAW) error. Another example is reordering an earlier load with a later store (load-store reordering), which is a Write-After-Read (WAR) error. (See: Architecture hazards / Anti-dependence).

Common Misunderstanding

Cache coherence does not deal with shared memory.

  • The goal of cache coherence is to make the cache in a multicore architecture invisible as if it were a single core. It is essentially oriented toward making the respective caches belonging to each core coherent to the degree that the system appears to operate as if there is only one cache. After this is achieved, we still must deal with the data consistency (or inconsistency) that can be present between core caches.

    • Suppose SWMR (single-writer-multiple-reader) invariant is honored. Each epoch allows a single core to read-write to a given memory address A or multiple cores to read-only at a given address A.

    • The Data-Value invariant ensures that, for each epoch, the value stored at memory address A matches the value stored after the last (most recent) read-write epoch for that same memory address.

    • These invariants make caches coherent but do not solve consistency issues introduced by instruction reordering. Such issues arise in the context of shared memory, where multithreaded programs can potentially access shared data in an order that breaks consistency but does not break coherency.

  • Cache coherence is granular, typically deals with cache blocks, and isn't concerned with access to multiple cache blocks. Real-world programs access variables across numerous cache blocks.

  • It is possible to implement a memory system without coherence and caches.

    • Not usually ideal

Sequential Consistency

A memory consistency model is said to be sequentially consistent if "the result of any execution is the same as if the operations of all processors (cores) were executed in some sequential order, and the operations of each individual processor (core) appear in the sequence in the order specified by its program."

Another key difference between coherence and consistency is discovered when observing sequential consistency: coherency operates on a cache block-by-block basis, and consistency is established across all blocks.

Sequentially consistent execution requires the following:

  1. All cores insert their loads and store them into the order <m respecting their program order, regardless of whether they are to the same or different addresses. (Let <m and <p represent inequalities showing memory and program order, respectively. e.g., L(a) <p L(b) means that the program order of the Load a instruction precedes Load b).

    • Load -> Load sequential: If L(a) <p L(b), then L(a) <m L(b)

    • Load -> Store sequential: If L(a) <p S(b), then L(a) <m S(b)

    • Store -> Store sequential: If S(a) <p S(b), then S(a) <m S(b)

    • Store -> Load sequential: If S(a) <p L(b), then S(a) <m L(b)

  2. Every load gets its value from the last store before it to the same address.

    • Value of L(a) = Value of MAX<mMAX_<m { S(a) | S(a) <m L(a)}, where MAX<mMAX_<m denotes "latest in memory order."

    • The value of a load L(a) is the most recent store S(a), such that S(a) precedes L(a) and is the most recent memory store at address a.

Naive Implementations

The Switch

Given a set of Ci:iNC_i : i \in \mathbb{N}, each core attempts to execute in program order. A single switch will point to, at most, one core, allowing it to complete a single memory access and then repeat by selecting another core. The switch defines the memory in sequential order.

Multitasking Uniprocessor

Multi-threaded programs execute all threads on a single sequential core known as a uniprocessor.

Optimized Sequential Consistency Implementations with Cache Coherence

  • Let modified (M) denote an L1 block that one core can write and read.

  • Let shared (S) denote an L1 block from which one or more cores can only read.

  • Let GetM and GetS denote coherence requests to obtain a block in M and S, respectively.

Non-Binding Prefetching

A non-binding prefetch for block B requests the coherent memory system to change B's coherence state. In most cases, this changes B's state to allow loads (modified or shared state) or store instruction (modified state) by issuing GetM and GetS requests. Non-binding in this context refers to the out-of-order prefetching of coherence permissions; these out-of-order prefetches do not impact coherency as long as the loads and stores are performed in program order. That is, the order in which coherence permissions are prefetched is non-binding on the order of execution of loads and stores.

Non-binding prefetches place the data in a hardware-coherent cache, not a register. (See Binding vs Non-binding prefetches).

Speculative Cores

Recall the concepts of speculative execution and branch prediction, where a branch predictor may allow the processor to speculate the branch outcome, move or jump to the directed location, and begin processing instructions. Branch mispredictions require the processor to squash the pipelined instructions. If a branch prediction results in the execution of loads and stores and is eventually squashed because of a misprediction, the squashed loads and stores can resemble a non-binding prefetch.

Consider the following example. A speculative load that misses the L1 cache forces a non-binding GetS, allowing the core to retrieve the value from a peer core, cache it, and update the register. If the speculation is discovered to be a misprediction, the register update will be squashed, clearing the loaded data and any other effect of the load. However, because non-binding prefetches do not inherently impact execution (provided program order is sustained during execution), there is no need to remove the cached load (instruction cache) or the cached data (data cache) indicated by the operands.

Executing instructions that operate on (spatially or temporally) local data may realize performance improvements because of locality. Subsequent execution of the identical instruction will be a cache hit.

Recall the special note on private data cache (not LLC). Because a non-binding prefetch is used, a misprediction can be masked as a non-binding prefetch. Suppose a misprediction is discovered (usually in the Instruction Decode or Execute pipeline stage) after the load has already updated a register. In that case, only the register is squashed -- no removal or invalidation is necessary for the cache because its effect on memory order is negligible.

See Spectre

Dynamically Scheduled Cores

Simply put, dynamic instruction scheduling is an optimization used by modern cores. If two load instructions, L1 and L2, are dynamically rescheduled, memory consistency issues may arise. Many cores will speculatively execute L2 before L1 and predict this speculation to be hidden from other cores, violating sequential consistency.

A core, therefore, must verify the correctness of a prediction on systems that enforce sequential consistency. Gharachorloo discovered two ways to verify a correct prediction:

  • After speculatively executing L2 and before committing L2, the core can check that the speculatively accessed block has not left the cache. If it remains in the cache, its value could not have changed between the load's execution and commit. (Changing the value would require an execution, which would likely take more pipeline stages than are between exec and wb). The core tracks the loaded address for L2 and compares it to blocks evicted and incoming coherence requests. An incoming GetM (which indicates another core wants a modified state of that address) implies that another core can view the out-of-order L2 value (it is no longer hidden as the core assumed) and necessitates a squash of the speculative load.

  • The second method involves replaying the same load when the core is ready to commit the original speculatively guessed load. A mismatch in value implies an incorrect prediction. Suppose the replayed value of L2 does not equal the original L2 (both of which were speculatively guessed). In that case, load-load reordering has resulted in an observable difference, and the execution must be squashed.

Additional Optimizations and Considerations

Refer to the following paper: Shared memory consistency models, a tutorial.

Last updated