A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Also known as “Snoopy cache” Paper by: Mark S. Papamarcos and Janak H. Patel Presented by: Cameron Mott 3/25/2005 Outline Goals Outline Examples Solutions Details on this method Results Analysis Success Comments/Questions Goals Reduce bus traffic Reduce bus wait Increase possible number of processors before saturation of bus Increase processor utilization Low cost Extensible Long length of life for strategy Structure The typical layout for a multi-processor machine: Difficulties Bus speed and saturation limits the processor utilization (there is a single time-shared bus with an arbitration mechanism). This scheme suffers from the well-known data consistency or “cache coherence” problem where two processors have the same writable data block in their private cache. Coherence example Process communication in shared-memory multiprocessors can be implemented by exchanging information through shared variables This sharing can result in several copies of a shared block in one or more caches at the same time. Time Event Cache contents for CPU A Cache contents for CPU B 0 Memory contents for location X 1 1 CPU A reads X 1 1 2 CPU B reads X 1 1 1 3 CPU A stores 0 into X 0 1 0 Enforcing Coherence Styles - - Hardware based Use a global table, the table keeps track of what memory is held and where. “Snoopy” cache No need for centralized hardware All processors share the same cache bus Each cache “snoops” or listens to cache transactions from other processors Used in CSM machines using a bus Snoopy caches To solve coherence, each processor can send out the address of the block that is being written in cache, each other processor that contains that entry then invalidates the local entry (called broadcast invalidate). Other Snoopy Methods Broadcast-Invalidate Any write to cache transmits the address throughout the system. Other caches check their directory, and purge the block if it exists locally. This does not require extra status bits, but does eat up a lot of bus time. Improvements to above Introduce a bias filter. The bias filter is a small associative memory that stores the most frequently invalidated blocks. Goodman’s Strategy Goodman proposes his strategy for multiple processor systems with independent cache but a shared bus. Invalidate is broadcast only when a block is written in cache the first time (thus “write once”). This block is also written through to main memory. If a block in cache is written to more than once (by different processors for example), the block must be written back to memory before replacing it. Write-Once Combination of write-through and write-back. Example Online example http://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/writeOnceHelp.htm Note that the only browser that displayed this on my computer was IE… Details Two bits in each block in the cache keep track of the status of that block. 1. 2. 3. 4. Invalid = The data in this line is not present or is not valid. Exclusive-Unmodified (Excl-Unmod) = This is an exclusive cache line. The line is coherent with memory and is held unmodified only in one cache. The cache owns the line and can modify it without having to notify the rest of the system. No other caches in the system may have a copy of this line. Shared-Unmodified (Shared-Unmod) = This is a shared cache line. The line is coherent with memory and may be present in several caches. Caches must notify the rest of the system about any changes to this line. The main memory owns this cache line. Exclusive-Modified (Excl-Mod) = There is modified data in this cache line. The line is incoherent with memory, so the cache is said to own the line. No other caches in the system may have a copy of this line. Other papers discuss MESI caches. How does this fit with Papamarcos and Patel’s work? M: Exclusive Modified E: Exclusive Unmodified S: Shared Unmodified I: Invalid Details (cont.) Snoopy cache actions: Read With Intent to Modify – This is the “write” cycle. If the address on the bus matches a Shared or Exclusive line, the line is invalidated. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, invalidate the line, and then allow the bus read to retry. Alternatively, the owning cache can supply the line directly to the requestor across the bus. Read - If the address on the bus matches a Shared line there is no change. If the line is Exclusive, the state changes to Shared. If a line is Modified, the cache must cause the bus action to abort, write the modified line back to memory, change the line to Shared, and then allow the bus read to retry. Alternatively, the owning cache can supply the line to the requestor directly and change its state to Shared. Flow diagrams Other cache can now provide requested memory. This changes the status bit to shared-unmod. Block is also written back to memory if another cache had an Excl-Mod entry for that block. The status of that block is then changed to shared-unmod after being written and shared with the other processor. Writes cause any other cache to set the corresponding entry to invalid. If memory provided the block, the status becomes exclusive-unmod. No signal is necessary if the status is not sharedunmod. Problems What if? A block is Shared-Unmodified and two caches attempt to change the block at the same time. Depending on the implementation, the bus provides the “sync” mechanism. Only one processor can have control of the bus at any one time. This provides a contention mechanism to determine which processor wins. Requires that this operation is indivisible. Results Results were analyzed using an approximation algorithm. Is this appropriate? Can an approximation be used to justify the algorithm? Accuracy of the approximation: error rate of less than 5% in certain circumstances Parameters Variable Label Number assumed for calculations N Description Number of processors a 90% Processor Memory reference rate (cache requests) m 5% Miss ratio w 20% Fraction of memory references that are written d 50% Probability that a block in cache has been locally modified or (“dirty”) u 30% Fraction of write requests that reference unmodified blocks s 5% Fraction of write requests that reference shared blocks A 1 Number of cycles required for bus arbitration T 2 Number of cycles for a block transfer I 2 Number of cycles for a block Invalidate W Average waiting time per bus request b (Derived) Ave. number of bus requests per unit of useful processor activity Miss Ratio Miss Ratio (Cont) Degree of Sharing Write Back Probability Block Transfer Time Cost of implementing Note This algorithm and structure does have a finite limit to the number of supported processors. Diminishing returns are noted for performance as the number of processors increase. Thus, this strategy should not be utilized in systems of 30 processors or more (as an estimate). This all depends on the system parameters of course, but it is limited by these factors. For a system utilizing a finite number of processors, this strategy is very effective, and is in use today. References A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories Mark S. Papamarcos and Janak H. Patel “Cache Coherence” Srini Devadas http://csg.csail.mit.edu/u/d/devadas/public_html/6.004/Lectures/l ect23/sld001.htm “Dynamic Decentralized Cache Schemes for MIMD Parallel Processors” Tu Phan http://www.cs.nmsu.edu/~pfeiffer/classes/573/sem/s03/presentat ions/Dynamic%20Decentralized%20Cache%20Schemes.ppt H&P 3rd. Edition Mark Smotherman http://www.cs.clemson.edu/~mark/464/hp3e6.html Vivio: Write Once cache coherency protocol Jeremy Jones http://www.cs.tcd.ie/Jeremy.Jones/vivio/caches/writeOnceHelp.h tm