18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA)

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence  Required for Review    Required       Papamarcos and Patel, “A low-overhead coherence solution for multiprocessors with private cache memories,” ISCA 1984. Kelm et al, “Cohesion: A Hybrid Memory Model for Accelerators”, ISCA 2010. Censier and Feautrier, “A new solution to coherence problems in multicache systems,” IEEE Trans. Comput., 1978. Goodman, “Using cache memory to reduce processor-memory traffic,” ISCA 1983. Laudon and Lenoski, “The SGI Origin: a ccNUMA highly scalable server,” ISCA 1997. Lenoski et al, “The Stanford DASH Multiprocessor,” IEEE Computer, 25(3):63-79, 1992. Martin et al, “Token coherence: decoupling performance and correctness,” ISCA 2003. Recommended    Baer and Wang, “On the inclusion properties for multi-level cache hierarchies,” ISCA 1988. Lamport, “How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs”, IEEE Trans. Comput., Sept 1979, pp 690-691. Culler and Singh, Parallel Computer Architecture, Chapters 5 and 8. 2 Shared Memory Model   Many parallel programs communicate through shared memory Proc 0 writes to an address, followed by Proc 1 reading  This implies communication between the two Proc 0 Mem[A] = 1 Proc 1 … Print Mem[A]   Each read should receive the value last written by anyone  This requires synchronization (what does last written mean?) What if Mem[A] is cached (at either end)? 3 Cache Coherence  Basic question: If multiple processors cache the same block, how do they ensure they all see a consistent state? P2 P1 Interconnection Network x 1000 Main Memory 4 The Cache Coherence Problem P2 P1 ld r2, x 1000 Interconnection Network x 1000 Main Memory 5 The Cache Coherence Problem ld r2, x P1 P2 1000 1000 ld r2, x Interconnection Network x 1000 Main Memory 6 The Cache Coherence Problem ld r2, x add r1, r2, r4 st x, r1 P1 P2 2000 1000 ld r2, x Interconnection Network x 1000 Main Memory 7 The Cache Coherence Problem ld r2, x add r1, r2, r4 st x, r1 P1 P2 2000 1000 ld r2, x Should NOT load 1000 ld r5, x Interconnection Network x 1000 Main Memory 8 Cache Coherence: Whose Responsibility?  Software    Can the programmer ensure coherence if caches are invisible to software? What if the ISA provided a cache-flush instruction?  What needs to be flushed (what lines, and which caches)?  When does this need to be done? Hardware   Simplifies software’s job Knows sharer set  Doesn’t need to be conservative in synchronization 9 Coherence: Guarantees   Writes to location A by P0 should be seen by P1 (eventually), and all writes to A should appear in some order Coherence needs to provide:     Write propagation: guarantee that updates will propagate Write serialization: provide a consistent global order seen by all processors Need a global point of serialization for this store ordering Ordering between writes to different locations is a memory consistency model problem: separate issue 10 Coherence: Update vs. Invalidate   How can we safely update replicated data?  Option 1: push updates to all copies  Option 2: ensure there is only one copy (local), update it On a Read:  If local copy isn’t valid, put out request  (If another node has a copy, it returns it, otherwise memory does) 11 Coherence: Update vs. Invalidate  On a Write:  Read block into cache as before Update Protocol:   Write to block, and simultaneously broadcast written data to sharers (Other nodes update their caches if data was present) Invalidate Protocol:   Write to block, and simultaneously broadcast invalidation of address to sharers (Other nodes clear block from cache) 12 Update vs. Invalidate  Which do we want?   Write frequency and sharing behavior are critical Update + If sharer set is constant and updates are infrequent, avoids the cost of invalidate-reacquire(broadcast update pattern) - If data is rewritten without intervening reads by other cores, updates were useless - Write-through cache policy  bus becomes bottleneck  Invalidate + After invalidation broadcast, core has exclusive access rights + Only cores that keep reading after each write retain a copy - If write contention is high, leads to ping-ponging (rapid mutual invalidation-reacquire) 13 Cache Coherence Methods  How do we ensure that the proper caches are updated?  Snoopy Bus [Goodman83, Papamarcos84]   Bus-based, single point of serialization Processors observe other processors’ actions and infer ownership   E.g.: P1 makes “read-exclusive” request for A on bus, P0 sees this and invalidates its own copy of A Directory [Censier78, Lenoski92, Laudon97]  single point of serialization per block, distributed among nodes    Processors make explicit requests for blocks Directory tracks ownership (sharer set) for each block Directory coordinates invalidation appropriately  E.g.: P1 asks directory for exclusive copy, directory asks P0 to invalidate, waits for ACK, then responds to P1 14 Snoopy Bus vs. Directory Coherence   Snoopy + Critical path is short: miss  bus transaction to memory + Global serialization is easy: bus provides this already (arbitration) + Simple: adapt bus-based uniprocessors easily - Requires single point of serialization (bus): not scalable  (not quite true that snoopy needs bus: recent work on this later) Directory - Requires extra storage space to track sharer sets  Can be approximate (false positives are OK) - Adds indirection to critical path: request  directory  mem - Protocols and race conditions are more complex + Exactly as scalable as interconnect and directory storage (much more scalable than bus) 15 Snoopy-Bus Coherence 16 Snoopy Cache Coherence   Caches “snoop” (observe) each other’s write/read operations A simple protocol: PrRd/-- PrWr / BusWr Valid  BusWr PrRd / BusRd  Write-through, nowrite-allocate cache Actions: PrRd, PrWr, BusRd, BusWr Invalid PrWr / BusWr 17 Snoopy Invalidation Protocol: MSI  Extend single valid bit per block to three states:        M(odified): cache line is only copy and is dirty S(hared): cache line is one of several copies I(nvalid): not present Read miss makes a Read request on bus, saves in S state Write miss makes a ReadEx request, saves in M state When a processor snoops ReadEx from another writer, it must invalidate its own copy (if any) SM upgrade can be made without re-reading data from memory (via Invl) 18 MSI State Machine M BusRd/Flush PrWr/BusRdX PrWr/BusRdX PrRd/-PrWr/-- BusRdX/Flush PrRd/BusRd S I PrRd/-BusRd/-BusRdX/-- ObservedEvent/Action [Culler/Singh96] 19 MSI Correctness  Write propagation  Immediately after write:       New value will exist only in writer’s cache Block will be in M state Transition into M state ensures all other caches are in I state Upon read in another thread, that cache will miss  BusRd BusRd causes flush (writeback), and read will see new value Write serialization  Only one cache can be in M state at a time     Entering M state generates BusRdX BusRdX causes transition out of M state in other caches Order of block ownership (M state) defines write ordering This order is global by virtue of central bus 20 More States: MESI     InvalidSharedModified sequence takes two bus ops What if data is not shared? Unnecessary broadcasts Exclusive state: this is the only copy, and it is clean Block is exclusive if, during BusRd, no other cache had it  Wired-OR “shared” signal on bus can determine this: snooping caches assert the signal if they also have a copy  Another BusRd also causes transition into Shared  Silent transition ExclusiveModified is possible on write!  MESI is also called the Illinois protocol [Papamarcos84] 21 MESI State Machine M PrWr/-PrWr/BusRdX BusRd/Flush E BusRd/ $ Transfer S PrWr/BusRdX PrRd (S’)/BusRd PrRd (S)/BusRd BusRdX/Flush (all incoming) I [Culler/Singh96] 22 Snoopy Invalidation Tradeoffs  Should a downgrade from M go to S or I?    Cache-to-cache transfer     S: if data is likely to be reused I: if data is likely to be migratory On a BusRd, should data come from another cache, or mem? Another cache  may be faster, if memory is slow or highly contended Memory  Simpler: mem doesn’t need to wait to see if cache has data first  Less contention at the other caches  Requires writeback on M downgrade Writeback on Modified->Shared: necessary?  One possibility: Owner (O) state (MOESI system)   One cache owns the latest data (memory is not updated) Memory writeback happens when all caches evict copies 23 Update-Based Coherence: Dragon      Four states:  (E)xclusive: Only copy, clean  (Sm): shared, modified (Owner state in MOESI)  (Sc): shared, clean (with respect to Sm, not memory)  (M)odified: only copy, dirty Use of updates allows multiple copies of dirty data No I state: there is no invalidate (only ordinary evictions) Invariant: at most one Sm in a sharer set If a cache is Sm, it is authoritative, and Sc caches are clean relative to this, not clean to memory McCreight, E. “The Dragon computer system: an early overview.” Tech Report, Xerox Corp., Sept 1984. (cited in Culler/Singh96) 24 Dragon State Machine PrRd/-BusUpd/Update PrRd/-PrRdMiss/BusRd(S’) E BusRd/-- PrRdMiss/BusRd(S) Sc PrWr/BusUpd(S) BusUpd/Update PrWrMiss/ (BusRd(S); BusUpd) Sm PrRd/-PrWr/BusUpd(S) BusRd/Flush PrWr/-- BusRd/Flush PrWr/BusUpd(S’) PrWr/BusUpd(S’) PrWrMiss/BusRd(S’) M PrRd/-PrWr/-- [Culler/Singh96] 25 Update-Protocol Tradeoffs  Shared-Modified state vs. keeping memory up-to-date   Equivalent to write-through cache when there are multiple sharers Immediate vs. lazy notification of Sc-block eviction   Immediate: if there is only one sharer left, it can go to E or M and save a BusUpd later (only one saved) Lazy: no extra traffic required upon eviction 26 Directory-Based Coherence 27 Directory-Based Protocols   Required when scaling past the capacity of a single bus Distributed, but:     Coherence still requires single point of serialization (for write serialization) This can be different for every block (striped across nodes) We can reason about the protocol for a single block: one server (directory node), many clients (private caches) Directory receives Read and ReadEx requests, and sends Invl requests: invalidation is explicit (as opposed to snoopy buses) 28 Directory: Data Structures 0x00 0x04 0x08 0x0C …  Shared: {P0, P1, P2} --Exclusive: P2 ----- Key operation to support is set inclusion test   False positives are OK: want to know which caches may contain a copy of a block, and spurious invals are ignored False positive rate determines performance  Most accurate (and expensive): full bit-vector Compressed representation, linked list, Bloom filter [Zebchuk09] are all possible  Here, we will assume directory has perfect knowledge  29 Directory: Basic Operations  Follow semantics of snoop-based system   Directory:      but with explicit request, reply messages Receives Read, ReadEx, Upgrade requests from nodes Sends Inval/Downgrade messages to sharers if needed Forwards request to memory if needed Replies to requestor and updates sharing state Protocol design is flexible   Exact forwarding paths depend on implementation For example, do cache-to-cache transfer? 30 MESI Directory Transaction: Read P0 acquires an address for reading: 1. Read P0 Home 2. DatEx (DatShr) P1 Culler/Singh Fig. 8.16 31 RdEx with Former Owner 1. RdEx P0 Home 2. Invl 3a. Rev Owner 3b. DatEx 32 Contention Resolution (for Write)  P0 1a. RdEx 1b. RdEx 4. Invl 3. RdEx 5a. Rev Home 2a. DatEx P1  2b. NACK  5b. DatEx 33 Issues with Contention Resolution  Need to escape race conditions by:  NACKing requests to busy (pending invalidate) entries       OR, queuing requests and granting in sequence (Or some combination thereof) Fairness   Original requestor retries Which requestor should be preferred in a conflict? Interconnect delivery order, and distance, both matter We guarantee that some node will make forward progress Ping-ponging is a higher-level issue  With solutions like combining trees (for locks/barriers) and better shared-data-structure design 34 Protocol and Directory Tradeoffs  Forwarding vs. strict request-reply    Speculative replies from memory/directory node    Decreases critical path length in best case More complex implementation (and potentially more network traffic) Directory storage can imply protocol design  1A. Increases complexity Shorten critical path by creating a chain of request forwarding E.g., linked list for sharer set Hansson et al, “Avoiding message-dependent deadlock in network-based systems on chip,” VLSI Design 2007. 35 Other Issues / Backup 36 Memory Consistency (Briefly)   We consider only sequential consistency [Lamport79] here Sequential Consistency gives the appearance that:   All operations (R and W) happen atomically in a global order Operations from a single thread occur in order in this stream Proc 0 A = 1; Proc 1 while (A == 0) ; B = 1; Proc 2 while (B == 0) ; print A A=1?   Thus, ordering between different mem locations exists More relaxed models exist; usually require memory barriers when synchronizing 37 Correctness Issue: Inclusion   What happens with multilevel caches? Snooping level (say, L2) must know about all data in private hierarchy above it (L1)   Inclusive cache is one solution [Baer88]    What about directories? L2 must contain all data that L1 contains Must propagate invalidates upward to L1 Other options  Non-inclusive: inclusion property is optional    Why would L2 evict if it has more sets and more ways than L1? Prefetching! Exclusive: line is in L1 xor L2 (AMD K7) 38

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA)

Related documents

Products

Support

18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib