2015-12-01 Lecture 10: Cache Coherence Directory protocols Introduction Snoopy protocols L1-L2 consistence Zebo Peng, IDA, LiTH 1 TDTS 08 – Lecture 10 Review: What is a Cache? Small, fast memory used to improve memory system performance. It exploits spatial and temporal locality. We have conceptually many caches! Registers a cache for first-level cache; First-level cache a cache for second-level cache; Second-level cache a cache for main memory; and Main memory a cache for hard disk (virtual memory). Processor Regs L1-Cache Bigger L2-Cache Faster Main Memory Disk, Tape, etc. Zebo Peng, IDA, LiTH 2 TDTS 08 – Lecture 10 1 2015-12-01 What Happen with Multiprocessor? Pi $ Pj $ M I/O Different processors may access values at same memory location Multiple copies of the same data in different caches. How to ensure data integrity at all times? An update by a processor at time t should be available for other processors at time t+1. I/O may also address main memory directly. In particular, WRITE operations must be carefully coordinated. Zebo Peng, IDA, LiTH 3 TDTS 08 – Lecture 10 Write Through All writes go to main memory as well as cache. Cache and memory are consistent. Write slows down to main memory access. OK, since the write percentage is small (ca. 15%). For multiprocessor: There is additional inconsistency due to other cache copies of the same memory location. Multiple CPUs must therefore monitor main memory traffic to keep local cache up to date. May lead to lots of traffic and monitoring activities. Zebo Peng, IDA, LiTH 4 TDTS 08 – Lecture 10 2 2015-12-01 Write Back Updates initially made in cache only. Cache and memory are not consistent by nature. Update bit for cache slot is set when update occurs. If a block is to be replaced, write to main memory if its update bit is set. For multiprocessor: Other caches get also out of sync! Mechanism must be used to maintain cache coherence. I/O must access main memory also through cache. Zebo Peng, IDA, LiTH 5 TDTS 08 – Lecture 10 Software Solutions Based on code analysis: Determine which data items may become unsafe for caching, e.g., global variables used and updated by several processes. Mark them, so that they are not cached. Alternatively, determine the unsafe periods and insert code to enforce cache coherence. Compiler and OS deal with the problem. Overhead is transferred to compile time. Design complexity is transferred from hardware to software. However, software tends to make conservative decisions Inefficient cache utilization. Low performance of the memory system. Zebo Peng, IDA, LiTH 6 TDTS 08 – Lecture 10 3 2015-12-01 Hardware Solutions Use cache coherence protocols. Dynamic recognition of potential problems. Run-time solution. More efficient use of cache. Transparent to programmer. Two main techniques: Directory protocols Snoopy protocols Zebo Peng, IDA, LiTH 7 TDTS 08 – Lecture 10 Lecture 10: Cache Coherence Introduction Directory protocols Snoopy protocols L1-L2 consistence Zebo Peng, IDA, LiTH 8 TDTS 08 – Lecture 10 4 2015-12-01 Directory Protocols A directory is used to collect and maintain information about copies of data in caches. The directory is stored in the memory system. Memory references are checked against the directory. Appropriate transfers are performed. Memory access may be avoided by copying from another cache rather than fetching the data from the memory. Creates central bottleneck = the directory. We can have multiple directories. Effective in large scale systems with complex interconnection schemes, such as NUMA. Zebo Peng, IDA, LiTH 9 TDTS 08 – Lecture 10 Directory Protocol Example Each node maintains a directory for a portions of memory and cache status. Cache Coherence Directory Zebo Peng, IDA, LiTH 10 TDTS 08 – Lecture 10 5 2015-12-01 Directory Protocol Example ... 798 in M1 … 798 Zebo Peng, IDA, LiTH 11 TDTS 08 – Lecture 10 Memory Access Sequence Node 2 processor 3 (P2-3) requests location 798 which is in the memory of node 1: P2-3 issues read request on snoopy bus of node 2. Directory on node 2 recognizes location is on node 1. Node 2 directory requests node 1’s directory. Node 1 directory requests contents of 798. Node 1 memory puts data on (node 1 local) bus. Node 1 directory gets data from (node 1 local) bus. Data transferred to node 2’s directory. Node 2 directory puts data on (node 2 local) bus. Data picked up, put in P2-3’s cache and delivered to processor. Zebo Peng, IDA, LiTH 12 TDTS 08 – Lecture 10 6 2015-12-01 Cache Coherence Operations Node 1 directory keeps note that node 2 has copy of data (at location 798). If data modified in the cache, this is broadcasted to other nodes. Local directories monitor and purge local cache if necessary. Local directory monitors changes to local data in remote caches and marks memory invalid until write back. Local directory forces write back if memory location requested by another processor. Zebo Peng, IDA, LiTH 13 TDTS 08 – Lecture 10 Lecture 10: Cache Coherence Introduction Directory protocols Snoopy protocols L1-L2 consistence Zebo Peng, IDA, LiTH 14 TDTS 08 – Lecture 10 7 2015-12-01 Snoopy Protocols Distribute cache coherence responsibility among all cache controllers. The initial cache controller recognizes that a line is shared. Updates are announced to all other caches. Each cache controller “snoop” on the network to observe these broadcasted notifications, and react accordingly. Ideally suited to a bus based multiprocessor system: The bus provides a simple means of broadcasting and snooping. Increased bus traffic due to broadcasting and snooping. Two approaches: write invalidate and write update. P P $ $ Zebo Peng, IDA, LiTH . . . P $ 15 M TDTS 08 – Lecture 10 Write Invalidate SP Suitable for multiple readers, but one writer at a time. Generally, a line may be shared among several caches for reading purposes. When one of the caches wants to write to the line, it first issues a notice to invalidate the line in all other caches. The writing processor then has exclusive access until the line is required by another processor. Used in Pentium and PowerPC systems. State of every cache line is marked as Modified, Exclusive, Shared or Invalid. The MESI Protocol. Zebo Peng, IDA, LiTH 16 TDTS 08 – Lecture 10 8 2015-12-01 Snoopy Cache Organization Cache Tags Processor Snoop H/W Cache Tags Processor Snoop H/W Tags Snoop H/W Processor Cache Dirty Address/data Memory Zebo Peng, IDA, LiTH Modified modified and differs from memory. Exclusive same as in memory and not in other caches. Shared same as in memory and present in other caches. Invalid contends not valid. 17 TDTS 08 – Lecture 10 MESI State Transition Diagram Cache line at initialing processor (next event triggered by processor) Line in snooping cache(s) (next event triggered by bus) SHR Shared RH Invalid RH Modif SHW RH Modif WH Excl SHW Shared SHR RMS WM Invalid Excl WH fill copyback Zebo Peng, IDA, LiTH Invalidate 18 Read-with-intent-to-modify TDTS 08 – Lecture 10 9 2015-12-01 Write Update SP Work well with multiple readers and writers. Updated word is distributed to all other processors. P M . . . $ P P $ $ It may generate many unnecessary updates: If a processor just reads a value once and does not need it again; or If a processor updates a value many times before it is read by the other processors (bad programming). Zebo Peng, IDA, LiTH 19 TDTS 08 – Lecture 10 Invalidate vs. Update Protocols An update protocol may generate many unnecessary cache updates. However, if two processors make interleaved reads and updates to a variable, an update protocol is better. An invalidate protocol may lead to many memory accesses. Both protocols suffer from false sharing overheads: Two words are not shared, however, they lie on the same cache line. P M . . . $ P P $ $ Most modern machines use invalidate protocols, since we have usually the situation of one writer with many readers. Zebo Peng, IDA, LiTH 20 TDTS 08 – Lecture 10 10 2015-12-01 Directory vs. Snoopy Schemes Snoopy caches: Each coherence operation is sent to all processors. It generates large traffic, which is an inherent limitation. Easy to implement on a bus-based system. Not feasible for machines with memory distributed across a large number of sub-systems. Directory caches: The need for a broadcast media is replaced by the directory. The additional information stored in the directory may add significant overhead. The underlying network must be able to carry all the coherence requests. The directory is a point of contention, therefore, distributed directory schemes are often used. Zebo Peng, IDA, LiTH 21 TDTS 08 – Lecture 10 Lecture 10: Cache Coherence Introduction Directory protocols Snoopy protocols L1-L2 consistence Zebo Peng, IDA, LiTH 22 TDTS 08 – Lecture 10 11 2015-12-01 L1-L2 Cache Consistency Cache coherence techniques Apply only to caches connected to a bus or other interconnection mechanism typically L2 caches. However, processors often have L1 caches that are not connected to a bus, therefore no snoopy protocol can be used. Zebo Peng, IDA, LiTH 23 TDTS 08 – Lecture 10 L1-L2 Cache Consistency (Cont’d) Solution: Extend cache coherence protocols to L1 caches: L1 line should keep track of the state of the corresponding L2 line, and L1 should write-through to L2 It requires: L1 must be a subset of L2. The associativity of the L2 cache should be equal or greater than that of the L1 cache. • Ex. if L2 is 2-way set associate while L1 is 4-way set associate, it doesn’t work. P L1 $ L2 $ If L1 has a write-back policy, the interaction between L1 and L2 is more complex. Zebo Peng, IDA, LiTH 24 TDTS 08 – Lecture 10 12 2015-12-01 An Example: Alpha-Server 4100 Four-processor shared-memory symmetric multiprocessor system. Each processor has a three-level cache hierarchy: L1 consists of two direct-mapped on-chip caches, one for instruction and one for data. • Write-through to L2 with a write buffer. L2 is an on-chip three-way set associative cache with write-back to L3. L3 is a off-chip direct-mapped cache with write-back to main memory. Zebo Peng, IDA, LiTH 25 TDTS 08 – Lecture 10 Summary Cache coherence in multiprocessor systems is an important issue to be considered. Otherwise, performance will suffer. Additional hardware is required to coordinate access to data that might have multiple cache copies. The underlying technique must provide guarantees on the correct semantics. Both hardware and software solutions can be used. There are several different protocols to be selected for the hardware solutions. Zebo Peng, IDA, LiTH 26 TDTS 08 – Lecture 10 13