CS 7810 Lecture 19 Proceedings of ASPLOS-XI

advertisement
CS 7810
Lecture 19
Coherence Decoupling: Making Use of Incoherence
J.Huh, J. Chang, D. Burger, G. Sohi
Proceedings of ASPLOS-XI
October 2004
Coherence / Consistency
• Coherence guarantees (i) that a write will
eventually be seen by other processors, and (ii) write
serialization (all processors see writes to the same location
in the same order)
• The consistency model defines the ordering of writes and
reads to different memory locations – the hardware
guarantees a certain consistency model and the
programmer attempts to write correct programs with
those assumptions
Consistency Examples
Initially, A = B = 0
P1
A=1
if (B == 0)
critical section
P2
B=1
if (A == 0)
critical section
Initially, A = B = 0
P1
A=1
P2
P3
if (A == 1)
B=1
if (B == 1)
register = A
P1
Data = 2000
Head = 1
P2
while (Head == 0)
{}
… = Data
Snooping-Based Cache Coherence
• Caches share a bus; every cache sees each transaction
in the same cycle; every cache manages itself
• When one cache writes to a block, every other cache
invalidates its copy of that block
• When a cache has a read miss, the block is provided
by memory or the last writer
• Protocols are defined by states: MSI, MESI, MOESI
Processor
Processor
Processor
Processor
Caches
Caches
Caches
Caches
Memory
Directory-Based Cache Coherence
• A directory keeps track of the sharing status of each block
• Every request goes to the directory and the directory then
sends directives to each cache – the directory is the point
of serialization (just as the bus is, in a snooping protocol)
• For example, on a write, the request reaches the directory,
the directory sends invalidates to other sharers, and
permissions are granted to the writer
Processor
Processor
Processor
Processor
Caches
Caches
Caches
Caches
Network
Memory
Directory
TLDS
• A certain ordering of reads and writes is assumed – if that
ordering is violated, the thread is re-executed
• The coherence protocol is used to propagate writes
Thread 1
Thread 2
Thread 3
Thread 4
Caches
Caches
Caches
Caches
Memory
The Traditional Model
• No thread is speculative – a parallel application with
synchronization points and parallel regions and guaranteed
to execute correctly with no need for re-execution
• Threads wait at synchronization points and wait for the
correct permissions for every block of data
Thread 1
Thread 2
Thread 3
Thread 4
Caches
Caches
Caches
Caches
Memory
Coherence Decoupling
• A simple coherence protocol is often a slow
protocol – for example, a simple protocol may not
allow multiple outstanding requests
• Coherence decoupling: maintain a fast and
incorrect protocol; and a slow and correct backing
protocol; incurs fewer stalls in the common case
and occasional recoveries
Coherence Decoupling
• A coherence operation is broken into two
components: (i) acquiring and using the value,
(ii) receiving the correct set of permissions
SCL Protocol
• Why does speculative cache look-up work?
 False sharing: a line was invalidated, but a
different word was written to
 Silent stores or value locality
 If there is spare bandwidth, updated values
can be pushed out to sharers
Implementation
• The Miss Status Holding Register (MSHR) keeps
track of outstanding requests – it can buffer the
speculative value and ensure it matches the
correct value – on a mis-speculation, that
instruction is treated like a branch mis-predict
• Speculation on a coherence operation is no
different from traditional forms of speculation
Coherence Decoupling Components
Microbenchmark Behavior
Results
Results
Summary
• Arguments for coherence decoupling:
 Reduces protocol complexity
 Reduces programming complexity
 Marginal hardware overhead
 Coherence misses will emerge as greater
bottlenecks?
• What is the expected trend for CMPs?
Title
• Bullet
Download