18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Required Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984. Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010. Censier and Feautrier, “A new solution to coherence problems in multicache systems,” IEEE Trans. Comput., 1978. Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983. Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997. Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79, 1992. Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003. Recommended Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA 1988. Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691. Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8. 2 Shared Memory Model Many parallel programs communicate through shared memory Proc 0 writes to an address, followed by Proc 1 reading This implies communication between the two Proc 0 Mem[A] = 1 Proc 1 … Print Mem[A] Each read should receive the value last written by anyone This requires synchronization (what does last written mean?) What if Mem[A] is cached (at either end)? 3 Cache Coherence Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state? P2 P1 Interconnection Network x 1000 Main Memory 4 The Cache Coherence Problem P2 P1 ld r2, x 1000 Interconnection Network x 1000 Main Memory 5 The Cache Coherence Problem ld r2, x P1 P2 1000 1000 ld r2, x Interconnection Network x 1000 Main Memory 6 The Cache Coherence Problem ld r2, x add r1, r2, r4 st x, r1 P1 P2 2000 1000 ld r2, x Interconnection Network x 1000 Main Memory 7 The Cache Coherence Problem ld r2, x add r1, r2, r4 st x, r1 P1 P2 2000 1000 ld r2, x Should NOT load 1000 ld r5, x Interconnection Network x 1000 Main Memory 8 Cache Coherence: Whose Responsibility? Software Can the programmer ensure coherence if caches are invisible to software? What if the ISA provided a cache-flush instruction? What needs to be flushed (what lines, and which caches)? When does this need to be done? Hardware Simplifies software’s job Knows sharer set Doesn’t need to be conservative in synchronization 9 Coherence: Guarantees Writes to location A by P0 should be seen by P1 (eventually), and all writes to A should appear in some order Coherence needs to provide: Write propagation: guarantee that updates will propagate Write serialization: provide a consistent global order seen by all processors Need a global point of serialization for this store ordering Ordering between writes to different locations is a memory consistency model problem: separate issue 10 Coherence: Update vs. Invalidate How can we safely update replicated data? Option 1: push updates to all copies Option 2: ensure there is only one copy (local), update it On a Read: If local copy isn’t valid, put out request (If another node has a copy, it returns it, otherwise memory does) 11 Coherence: Update vs. Invalidate On a Write: Read block into cache as before Update Protocol: Write to block, and simultaneously broadcast written data to sharers (Other nodes update their caches if data was present) Invalidate Protocol: Write to block, and simultaneously broadcast invalidation of address to sharers (Other nodes clear block from cache) 12 Update vs. Invalidate Which do we want? Write frequency and sharing behavior are critical Update + If sharer set is constant and updates are infrequent, avoids the cost of invalidate-reacquire(broadcast update pattern) - If data is rewritten without intervening reads by other cores, updates were useless - Write-through cache policy bus becomes bottleneck Invalidate + After invalidation broadcast, core has exclusive access rights + Only cores that keep reading after each write retain a copy - If write contention is high, leads to ping-ponging (rapid mutual invalidation-reacquire) 13 Cache Coherence Methods How do we ensure that the proper caches are updated? Snoopy Bus [Goodman83, Papamarcos84] Bus-based, single point of serialization Processors observe other processors’ actions and infer ownership E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this and invalidates its own copy of A Directory [Censier78, Lenoski92, Laudon97] single point of serialization per block, distributed among nodes Processors make explicit requests for blocks Directory tracks ownership (sharer set) for each block Directory coordinates invalidation appropriately E.g.: P1 asks directory for exclusive copy, directory asks P0 to invalidate, waits for ACK, then responds to P1 14 Snoopy Bus vs. Directory Coherence Snoopy + Critical path is short: miss bus transaction to memory + Global serialization is easy: bus provides this already (arbitration) + Simple: adapt bus-based uniprocessors easily - Requires single point of serialization (bus): not scalable (not quite true that snoopy needs bus: recent work on this later) Directory - Requires extra storage space to track sharer sets Can be approximate (false positives are OK) - Adds indirection to critical path: request directory mem - Protocols and race conditions are more complex + Exactly as scalable as interconnect and directory storage (much more scalable than bus) 15 Snoopy-Bus Coherence 16 Snoopy Cache Coherence Caches “snoop” (observe) each other’s write/read operations A simple protocol: PrRd/-- PrWr / BusWr Valid BusWr PrRd / BusRd Write-through, nowrite-allocate cache Actions: PrRd, PrWr, BusRd, BusWr Invalid PrWr / BusWr 17 Snoopy Invalidation Protocol: MSI Extend single valid bit per block to three states: M(odified): cache line is only copy and is dirty S(hared): cache line is one of several copies I(nvalid): not present Read miss makes a Read request on bus, saves in S state Write miss makes a ReadEx request, saves in M state When a processor snoops ReadEx from another writer, it must invalidate its own copy (if any) SM upgrade can be made without re-reading data from memory (via Invl) 18 MSI State Machine M BusRd/Flush PrWr/BusRdX PrWr/BusRdX PrRd/-PrWr/-- BusRdX/Flush PrRd/BusRd S I PrRd/-BusRd/-BusRdX/-- ObservedEvent/Action [Culler/Singh96] 19 MSI Correctness Write propagation Immediately after write: New value will exist only in writer’s cache Block will be in M state Transition into M state ensures all other caches are in I state Upon read in another thread, that cache will miss BusRd BusRd causes flush (writeback), and read will see new value Write serialization Only one cache can be in M state at a time Entering M state generates BusRdX BusRdX causes transition out of M state in other caches Order of block ownership (M state) defines write ordering This order is global by virtue of central bus 20 More States: MESI InvalidSharedModified sequence takes two bus ops What if data is not shared? Unnecessary broadcasts Exclusive state: this is the only copy, and it is clean Block is exclusive if, during BusRd, no other cache had it Wired-OR “shared” signal on bus can determine this: snooping caches assert the signal if they also have a copy Another BusRd also causes transition into Shared Silent transition ExclusiveModified is possible on write! MESI is also called the Illinois protocol [Papamarcos84] 21 MESI State Machine M PrWr/-PrWr/BusRdX BusRd/Flush E BusRd/ $ Transfer S PrWr/BusRdX PrRd (S’)/BusRd PrRd (S)/BusRd BusRdX/Flush (all incoming) I [Culler/Singh96] 22 Snoopy Invalidation Tradeoffs Should a downgrade from M go to S or I? Cache-to-cache transfer S: if data is likely to be reused I: if data is likely to be migratory On a BusRd, should data come from another cache, or mem? Another cache may be faster, if memory is slow or highly contended Memory Simpler: mem doesn’t need to wait to see if cache has data first Less contention at the other caches Requires writeback on M downgrade Writeback on Modified->Shared: necessary? One possibility: Owner (O) state (MOESI system) One cache owns the latest data (memory is not updated) Memory writeback happens when all caches evict copies 23 Update-Based Coherence: Dragon Four states: (E)xclusive: Only copy, clean (Sm): shared, modified (Owner state in MOESI) (Sc): shared, clean (with respect to Sm, not memory) (M)odified: only copy, dirty Use of updates allows multiple copies of dirty data No I state: there is no invalidate (only ordinary evictions) Invariant: at most one Sm in a sharer set If a cache is Sm, it is authoritative, and Sc caches are clean relative to this, not clean to memory McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984. (cited in Culler/Singh96) 24 Dragon State Machine PrRd/-BusUpd/Update PrRd/-PrRdMiss/BusRd(S’) E BusRd/-- PrRdMiss/BusRd(S) Sc PrWr/BusUpd(S) BusUpd/Update PrWrMiss/ (BusRd(S); BusUpd) Sm PrRd/-PrWr/BusUpd(S) BusRd/Flush PrWr/-- BusRd/Flush PrWr/BusUpd(S’) PrWr/BusUpd(S’) PrWrMiss/BusRd(S’) M PrRd/-PrWr/-- [Culler/Singh96] 25 Update-Protocol Tradeoffs Shared-Modified state vs. keeping memory up-to-date Equivalent to write-through cache when there are multiple sharers Immediate vs. lazy notification of Sc-block eviction Immediate: if there is only one sharer left, it can go to E or M and save a BusUpd later (only one saved) Lazy: no extra traffic required upon eviction 26 Directory-Based Coherence 27 Directory-Based Protocols Required when scaling past the capacity of a single bus Distributed, but: Coherence still requires single point of serialization (for write serialization) This can be different for every block (striped across nodes) We can reason about the protocol for a single block: one server (directory node), many clients (private caches) Directory receives Read and ReadEx requests, and sends Invl requests: invalidation is explicit (as opposed to snoopy buses) 28 Directory: Data Structures 0x00 0x04 0x08 0x0C … Shared: {P0, P1, P2} --Exclusive: P2 ----- Key operation to support is set inclusion test False positives are OK: want to know which caches may contain a copy of a block, and spurious invals are ignored False positive rate determines performance Most accurate (and expensive): full bit-vector Compressed representation, linked list, Bloom filter [Zebchuk09] are all possible Here, we will assume directory has perfect knowledge 29 Directory: Basic Operations Follow semantics of snoop-based system Directory: but with explicit request, reply messages Receives Read, ReadEx, Upgrade requests from nodes Sends Inval/Downgrade messages to sharers if needed Forwards request to memory if needed Replies to requestor and updates sharing state Protocol design is flexible Exact forwarding paths depend on implementation For example, do cache-to-cache transfer? 30 MESI Directory Transaction: Read P0 acquires an address for reading: 1. Read P0 Home 2. DatEx (DatShr) P1 Culler/Singh Fig. 8.16 31 RdEx with Former Owner 1. RdEx P0 Home 2. Invl 3a. Rev Owner 3b. DatEx 32 Contention Resolution (for Write) P0 1a. RdEx 1b. RdEx 4. Invl 3. RdEx 5a. Rev Home 2a. DatEx P1 2b. NACK 5b. DatEx 33 Issues with Contention Resolution Need to escape race conditions by: NACKing requests to busy (pending invalidate) entries OR, queuing requests and granting in sequence (Or some combination thereof) Fairness Original requestor retries Which requestor should be preferred in a conflict? Interconnect delivery order, and distance, both matter We guarantee that some node will make forward progress Ping-ponging is a higher-level issue With solutions like combining trees (for locks/barriers) and better shared-data-structure design 34 Protocol and Directory Tradeoffs Forwarding vs. strict request-reply Speculative replies from memory/directory node Decreases critical path length in best case More complex implementation (and potentially more network traffic) Directory storage can imply protocol design 1A. Increases complexity Shorten critical path by creating a chain of request forwarding E.g., linked list for sharer set Hansson et al, “Avoiding message-dependent deadlock in network-based systems on chip,” VLSI Design 2007. 35 Other Issues / Backup 36 Memory Consistency (Briefly) We consider only sequential consistency [Lamport79] here Sequential Consistency gives the appearance that: All operations (R and W) happen atomically in a global order Operations from a single thread occur in order in this stream Proc 0 A = 1; Proc 1 while (A == 0) ; B = 1; Proc 2 while (B == 0) ; print A A=1? Thus, ordering between different mem locations exists More relaxed models exist; usually require memory barriers when synchronizing 37 Correctness Issue: Inclusion What happens with multilevel caches? Snooping level (say, L2) must know about all data in private hierarchy above it (L1) Inclusive cache is one solution [Baer88] What about directories? L2 must contain all data that L1 contains Must propagate invalidates upward to L1 Other options Non-inclusive: inclusion property is optional Why would L2 evict if it has more sets and more ways than L1? Prefetching! Exclusive: line is in L1 xor L2 (AMD K7) 38