Two Techniques for Improving the Performance of Memory Consistency Models* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck •Based on “Two Techniques for Improving the Performance of Memory Consistency Models” in the 1991 International Conference on Parallel Processing Outline 1. Motivation 2. Definitions 3. Prefetching 4. Prefetching Example 5. Speculative Execution 6. Combined Example 7. Conclusions Motivation Problems: 1. Sequential Consistency (SC) is slow, too much waiting 2. Release Consistency (RC) is fast, but complex to program Ideas: 1. Most cases, execution is correct regardless of enforcing arcs 2. Use non-binding prefetching to hide latency of memory accesses 3. Use speculative execution to hide latency of memory accesses 4. Make implementation general to all memory consistency models Definitions • Read Performed: a read is said to perform when the return value is bound and can not be modified by other write operations • Write Performed: a write is said to perform when the value written by the write operation is visible to all processors • Non-binding Prefetch: a non-binding prefetch allows data to be brought into cache and is kept coherent until processor actually requires data How does this differ from a binding prefetch? Why is this difference important? Prefetching Implementation: • Prefetch from read/write buffer (used to only issue 1 outstanding request at a time) • Already know address -> prefetches don’t need to speculate on addresses • Maintain coherence of data until its ready to be used Caveats: • Require invalidation based coherence (why?) • Requires complex hardware: pipelined memory system and lock-up free caches • Need advanced look ahead (why?) Prefetching Example Without Prefetching With Prefetching 1. lock L (miss) 1. lock L (miss) 2. read C (miss) 2. read C (hit) 3. read D (hit) 3. read D (hit) 4. read E[D] (miss) 4. read E[D] (miss) 5. unlock L (hit) 5. unlock L (hit) SC Total Cycles 100 100 1 100 + 1 302 RC Total Cycles 100 1 1 100 + 1 203 SC Total Cycles RC Total Cycles 100 1 (p) 1 100 + 1 203 100 0 (p) 1 100 + 1 202 Speculative Execution Three Components to Speculative Execution 1. Speculation Mechanism – provide support for starting speculative execution and keeping track of state 2. Detection Mechanism – determine when speculation has failed due to violating some invariant of execution 3. Correction Mechanism – allow for recovery from violations Speculation Mechanism Stores have two cases: - SC stores will remain in ROB until they have performed - RC stores can issue at front of ROB after only address translation Loads are more complicated: 1. First, speculatively perform load 2. Then monitor cache coherence traffic for given cache line using Load Address Speculative-Load BufferAcquire Done Store Tag Advantage: • Only have to access the cache once, and not twice like prefetching • Allows for issuing accesses out of order Example 1. Add r5, r4, r1 2. Store r5, A 3. Load B, r6 Can issue instruction 3 OOO! Detection Mechanism Two Conditions for Load Correctness: 1. Store Tag must be set to null 2. Done field must be set if Acquire field is set Monitor invalidations, updates, and replacements for matches to load address Any matches are assumed to make speculation incorrect. SC treat all loads like acquires. RC treat only loads of uncached locations as acquires. (How does this give us RC?) Acquire Done Store Tag Load Address Recovery Mechanism Need to have some mechanism for recovering from mis-speculation, similar to when branches are mis-predicted. Two Cases for Recovery: 1. Load has completed; pessimistically assume that all instructions after it depend on it. 2. Load has not committed; only have to reissue load since nobody has used its value yet. Get small performance boost. Combined Example Code Segment read A (miss) write B (miss) write C (miss) read D (hit) read E[D] (miss) Results? Conclusions 1. Can use prefetching and speculative execution to hide long latency memory accesses 2. Techniques are general to enough to be applied to any memory consistency model 3. Allow for the use of complex OOO cores even with SC BIG The Question: Do these techniques make relaxed memory consistency models even better than SC, or do these techniques narrow the gap between SC and more relaxed models? e.g. Does SC + ILP = RC?