Two Techniques for Improving the Performance of Memory Consistency Models*

advertisement
Two Techniques for Improving
the Performance of
Memory Consistency Models*
Presented by: Michael Bauer
ECE 259/CPS 221
Spring Semester 2008
Dr. Lebeck
•Based on “Two Techniques for Improving the Performance of Memory Consistency Models”
in the 1991 International Conference on Parallel Processing
Outline
1. Motivation
2. Definitions
3. Prefetching
4. Prefetching Example
5. Speculative Execution
6. Combined Example
7. Conclusions
Motivation
Problems:
1. Sequential Consistency (SC) is
slow, too much waiting
2. Release Consistency (RC) is fast,
but complex to program
Ideas:
1. Most cases, execution is correct regardless of enforcing arcs
2. Use non-binding prefetching to hide latency of memory accesses
3. Use speculative execution to hide latency of memory accesses
4. Make implementation general to all memory consistency models
Definitions
• Read Performed: a read is said to perform when the
return value is bound and can not be modified by other write
operations
• Write Performed: a write is said to perform when the
value written by the write operation is visible to all
processors
• Non-binding Prefetch: a non-binding prefetch allows data
to be brought into cache and is kept coherent until
processor actually requires data
How does this differ from a binding prefetch?
Why is this difference important?
Prefetching
Implementation:
• Prefetch from read/write buffer (used to only issue 1 outstanding request
at a time)
• Already know address -> prefetches don’t need to speculate on addresses
• Maintain coherence of data until its ready to be used
Caveats:
• Require invalidation based coherence (why?)
• Requires complex hardware: pipelined memory system and lock-up free
caches
• Need advanced look ahead (why?)
Prefetching Example
Without Prefetching
With Prefetching
1. lock L
(miss)
1. lock L
(miss)
2. read C
(miss)
2. read C
(hit)
3. read D
(hit)
3. read D
(hit)
4. read E[D]
(miss)
4. read E[D]
(miss)
5. unlock L
(hit)
5. unlock L
(hit)
SC Total Cycles
100
100
1
100
+
1
302
RC Total Cycles
100
1
1
100
+
1
203
SC Total Cycles
RC Total Cycles
100
1 (p)
1
100
+
1
203
100
0 (p)
1
100
+
1
202
Speculative Execution
Three Components to Speculative Execution
1. Speculation Mechanism – provide support for starting speculative
execution and keeping track of state
2. Detection Mechanism – determine when speculation has failed due to
violating some invariant of execution
3. Correction Mechanism – allow for recovery from violations
Speculation Mechanism
Stores have two cases:
- SC stores will remain in ROB until they have performed
- RC stores can issue at front of ROB after only address translation
Loads are more complicated:
1. First, speculatively perform load
2. Then monitor cache coherence traffic for given cache line using
Load Address
Speculative-Load BufferAcquire Done Store Tag
Advantage:
• Only have to access the cache once, and not twice like prefetching
•
Allows for issuing accesses out of order
Example
1. Add r5, r4, r1
2. Store r5, A
3. Load B, r6
Can issue instruction 3 OOO!
Detection Mechanism
Two Conditions for Load Correctness:
1. Store Tag must be set to null
2. Done field must be set if Acquire field is set
Monitor invalidations, updates, and replacements for matches to load address
Any matches are assumed to make speculation incorrect.
SC treat all loads like acquires.
RC treat only loads of uncached locations as acquires.
(How does this give us RC?)
Acquire
Done
Store Tag
Load Address
Recovery Mechanism
Need to have some mechanism for recovering from mis-speculation,
similar to when branches are mis-predicted.
Two Cases for Recovery:
1. Load has completed; pessimistically assume that all instructions
after it depend on it.
2. Load has not committed; only have to reissue load since nobody
has used its value yet. Get small performance boost.
Combined Example
Code Segment
read A (miss)
write B (miss)
write C (miss)
read D (hit)
read E[D] (miss)
Results?
Conclusions
1. Can use prefetching and speculative execution to hide
long latency memory accesses
2. Techniques are general to enough to be applied to any
memory consistency model
3. Allow for the use of complex OOO cores even with SC
BIG
The
Question: Do these techniques make relaxed
memory consistency models even better than SC, or do
these techniques narrow the gap between SC and more
relaxed models?
e.g. Does SC + ILP = RC?
Download