Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty

advertisement
Token Coherence: A
Framework for Implementing
Multiple-CMP Systems
Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2,
Milo Martin3, and David Wood1
1University
of Wisconsin-Madison
2University of British Columbia
3University of Pennsylvania
February 17th, 2005
(C) 2005 Multifacet Project
Summary
• Microprocessor  Chip Multiprocessor (CMP)
• Symmetric Multiprocessor (SMP)  Multiple CMPs
• Problem: Coherence with Multiple CMPs
• Old Solution: Hierarchical Directory Complex & Slow
• New Solution: Apply Token Coherence
– Developed for glueless multiprocessor [2003]
– Keep: Flat for Correctness
– Exploit: Hierarchical for performance
• Less Complex & Faster than Hierarchical Directory
Slide 2
Improving Multiple-CMP Systems using Token Coherence
Outline
•
Motivation and Background
– Coherence in Multiple-CMP Systems
– Example: DirectoryCMP
•
Token Coherence: Flat for Correctness
•
Token Coherence: Hierarchical for Performance
•
Evaluation
Slide 3
Improving Multiple-CMP Systems using Token Coherence
Coherence in Multiple-CMP Systems
• Chip Multiprocessors (CMPs) emerging
• Larger systems will be built with Multiple CMPs
P
I
P
D
I
D
P
P
I
D
I
D
interconnect
L2 1
CMP
L2
L2
CMP 2
L2
interconnect
CMP 3
Slide 4
CMP 4
Improving Multiple-CMP Systems using Token Coherence
Problem: Hierarchical Coherence
• Intra-CMP protocol for coherence within CMP
• Inter-CMP protocol for coherence between CMPs
• Interactions between protocols increase complexity
– explodes state space
CMP 2
CMP 1
Inter-CMP Coherence
interconnect
Intra-CMP Coherence
CMP 3
Slide 5
CMP 4
Improving Multiple-CMP Systems using Token Coherence
Improving Multiple CMP Systems with
Token Coherence
• Token Coherence allows Multiple-CMP systems to
be...
– Flat for correctness, but
– Hierarchical for performance
Low Complexity
Fast
Correctness Substrate
CMP 2
CMP 1
Performance
Protocol
interconnect
CMP 3
Slide 6
CMP 4
Improving Multiple-CMP Systems using Token Coherence
Example: DirectoryCMP
2-level MOESI Directory
RACE CONDITIONS!
CMP 0
CMP 1
Store B
P0
P1
P2
P3
P4
P5
P6
P7
L1 I&D
L1 I&D
L1 I&D
L1 I&D
L1 I&D
L1 I&D
L1 I&D
L1 I&D
data/
fwd ack
inv ack
getx
getx
data/
ack
S
inv
S
WB
ack inv ack
Shared L2 / directory
S
getx
data/
ack
O
S
Shared L2 / directory
WB
fwd
B: [M
[S O]
I]
Memory/Directory
Slide 7
getx
Memory/Directory
Improving Multiple-CMP Systems using Token Coherence
Token Coherence Summary
•
Token Coherence separates performance from
correctness
•
Correctness Substrate: Enforces coherence
invariant and prevents starvation
1. Safety with Token Counting
2. Starvation Avoidance with Persistent Requests
•
Performance Policy: Makes the common case fast
– Transient requests to seek tokens
•
Unordered, untracked, unacknowledged
– Possible prediction, multicast, filters, etc
Slide 8
Improving Multiple-CMP Systems using Token Coherence
Outline
•
Motivation and Background
•
Token Coherence: Flat for Correctness
– Safety
– Starvation Avoidance
•
Token Coherence: Hierarchical for Performance
•
Evaluation
Slide 9
Improving Multiple-CMP Systems using Token Coherence
Example: Token Coherence [ISCA 2003]
Load B
Store B
P0
P1
L1 I&D
L2
mem 0
•
•
•
•
Slide 10
L1 I&D
L2
P2
L1 I&D
L2
interconnect
P3
L1 I&D
L2
mem 3
Each memory block initialized with T tokens
Tokens stored in memory, caches, & messages
At least one token to read a block
All tokens to write a block
Improving Multiple-CMP Systems using Token Coherence
Extending to Multiple-CMP System
CMP 0
P0
L1 I&D
CMP 1
P1
L1 I&D
L2
L2
P2
L1 I&D
L1 I&D
L2
L2
interconnect
interconnect
Shared L2
mem 0
Slide 11
P3
Shared L2
interconnect
mem 1
Improving Multiple-CMP Systems using Token Coherence
Extending to Multiple-CMP System
CMP 0
CMP 1
Store B
P0
L1 I&D
P1
P2
L1 I&D
L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
• Token counting remains flat
• Tokens to caches
– Handles shared caches and other complex hierarchies
Slide 12
Improving Multiple-CMP Systems using Token Coherence
Safety Recap
• Safety: Maintain coherence invariant
– Only one writer, or multiple readers
• Tokens for Safety
– T Tokens associated with each memory block
– # tokens encoded in 1+log2T
– Processor acquires all tokens to write, a single token to read
• Tokens passed to nodes in glueless multiprocessor scheme
– But CMPs have private and shared caches
• Tokens passed to caches in Multiple-CMP system
– Arbitrary cache hierarchy easily handled
– Flat for correctness
Slide 13
Improving Multiple-CMP Systems using Token Coherence
Some Token Counting Implications
• Memory must store tokens
– Separate RAM
– Use extra ECC bits
– Token cache
• T sized to # caches to allow read-only copies in all caches
• Replacements cannot be silent
– Tokens must not be lost or dropped
• Targeted for invalidate-based protocols
– Not a solution for write-through or update protocols
• Tokens must be identified by block address
– Address must be in all token-carrying messages
Slide 14
Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance
• Request messages can miss tokens
– In-flight tokens
• Transient Requests are not tracked throughout system
– Incorrect filtering, multicast, destination-set prediction, etc
• Possible Solution: Retries
– Retry w/ optional randomized backoff is effective for races
• Guaranteed Solution: Persistent Requests
–
–
–
–
Slide 15
Heavyweight request guaranteed to succeed
Should be rare (uses more bandwidth)
Locates all tokens in the system
Orders competing requests
Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance
CMP 0
GETX
CMP 1
Store B
Store B
P0
P1
L1 I&D
L1 I&D
Store B
GETX
GETX
P2
L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
• Tokens move freely in the system
– Transient requests can miss in-flight tokens
– Incorrect speculation, filters, prediction, etc
Slide 16
Improving Multiple-CMP Systems using Token Coherence
Starvation Avoidance
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
L1 I&D
L1 I&D
L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
mem 0
mem 1
interconnect
• Solution: issue Persistent Request
– Heavyweight request guaranteed to succeed
– Methods: Centralized [2003] and Distributed (New)
Slide 17
Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003]
CMP 0
Store B
CMP 1
timeout
Store B
P0
L1 I&D
Store B timeout
timeout
P1
P2
L1 I&D
L1 I&D
interconnect
L1 I&D
interconnect
Shared L2
arbiter 0
Shared L2
mem 0
B: P0
B: P2
B: P1
P3
mem 1
interconnect
arbiter 0
– Processors issue persistent requests
Slide 18
Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003]
CMP 0
CMP 1
Store B
Store B
Store B
P0
P1
P2
B: P0 L1 I&D
L1 I&D
B: P0
B: P0 L1 I&D
interconnect
P3
L1 I&D
B: P0
interconnect
B: P0 Shared L2
Shared L2
B: P0
arbiter 0
mem 0
B: P0
B: P2
B: P1
mem 1
interconnect
arbiter 0
– Processors issue persistent requests
– Arbiter orders and broadcasts activate
Slide 19
Improving Multiple-CMP Systems using Token Coherence
Old Scheme: Central Arbiter [2003]
CMP 0
CMP 1
Store B
Store B
P1
P2
P0
B: P2
P0 L1 I&D
L1 I&D
interconnect
B: P2
P0 L1 I&D
B: P0
P2
3
arbiter 0
L1 I&D
Shared L2
B: P2
P0
2
mem 0
B: P0
B: P2
B: P1
B: P2
P0
interconnect
B: P2
P0 Shared L2
1
P3
mem 1
interconnect
arbiter 0
– Processor sends deactivate to arbiter
– Arbiter broadcasts deactivate (and next activate)
– Bottom Line: handoff is 3 message latencies
Slide 20
Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
CMP 1
Store B
Store B
P0
P1
P0: B
P1: B
P2: B L1 I&D
L1 I&D
Store B
P0: B
P1: B
P2: B
P2
P0: B
P1: B
P2: B L1 I&D
interconnect
P3
L1 I&D
interconnect
P0: B Shared L2
P1: B
P2: B
P0: B
P1: B
P2: B
Shared L2
mem 0
P0: B
P1: B
P2: B
P0: B
P1: B
P2: B
mem 1
interconnect
– Processors broadcast persistent requests
Slide 21
Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
CMP 1
Store B
Store B
P0
P1
P0: B
P1: B
P2: B L1 I&D
L1 I&D
Store B
P0: B
P1: B
P2: B
P2
P0: B
P1: B
P2: B L1 I&D
interconnect
P3
L1 I&D
interconnect
P0: B Shared L2
P1: B
P2: B
P0: B
P1: B
P2: B
Shared L2
mem 0
P0: B
P1: B
P2: B
P0: B
P1: B
P2: B
mem 1
interconnect
– Processors broadcast persistent requests
– Fixed priority (processor number)
Slide 22
Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
CMP 1
Store B
P0
P0: B
P1: B
P2: B L1 I&D
1
P1
L1 I&D
Store B
P0: B
P1: B
P2: B
P2
P0: B
P1: B
P2: B L1 I&D
interconnect
P3
L1 I&D
interconnect
P0: B Shared L2
P1: B
P2: B
P0: B
P1: B
P2: B
Shared L2
mem 0
P0: B
P1: B
P2: B
P0: B
P1: B
P2: B
mem 1
interconnect
– Processors broadcast persistent requests
– Fixed priority (processor number)
– Processors broadcast deactivate
Slide 23
Improving Multiple-CMP Systems using Token Coherence
Improved Scheme: Distributed Arbitration [NEW]
CMP 0
CMP 1
P0
P1
P1: B
P2: B L1 I&D
L1 I&D
P1: B
P2: B
P2
P1: B
P2: B L1 I&D
interconnect
P3
L1 I&D
interconnect
Shared L2
Shared L2
P1: B
P2: B
P1: B
P2: B
mem 0
P1: B
P2: B
P1: B
P2: B
mem 1
interconnect
– Bottom line: Handoff is a single message latency
• Subtle point: P0 and P1 must wait until next “wave”
Slide 24
Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests
• Table at each cache
– Sized to N entries for each processor (we use N=1)
– Indexed by processor ID
– Content-addressable by Address
• Each incoming message must access table
– Not on the critical path– can be slow CAM
• Activate/deactivate reordering cannot be allowed
– Persistent request virtual channel must be point-to-point
ordered
– Or, other solution such as sequence numbers or acks
Slide 25
Improving Multiple-CMP Systems using Token Coherence
Implementing Distributed Persistent Requests
• Should reads be distinguished from writes?
– Not necessary, but
– Persistent Read request is helpful
• Implications of flat distributed arbitration
– Simple  flat for correctness
– Global broadcast when used
• Fortunately they are rare in typical workloads (0.3%)
• Bad workload (very high contention) would burn bandwidth
– Maximum # processors must be architected
• What about a hierarchical persistent request
scheme?
– Possible, but correctness is no longer flat
– Make the common case fast
Slide 26
Improving Multiple-CMP Systems using Token Coherence
Reducing Unnecessary Traffic
• Problem: Which token-holding cache responds with
data?
• Solution: Distinguish one token as the owner token
– The owner includes data with token response
– Clean vs. dirty owner distinction also useful for writebacks
Slide 27
Improving Multiple-CMP Systems using Token Coherence
Outline
•
Motivation and Background
•
Token Coherence: Flat for Correctness
•
Token Coherence: Hierarchical for Performance
– TokenCMP
– Another look at performance policies
•
Slide 28
Evaluation
Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: TokenCMP
• Target System:
– 2-8 CMPs
– Private L1s, shared L2 per CMP
– Any interconnect, but high-bandwidth
• Performance Policy Goals:
–
–
–
–
Slide 29
Aggressively acquire tokens
Exploit on-chip locality and bandwidth
Respect cache hierarchy
Detecting and handling missed tokens
Improving Multiple-CMP Systems using Token Coherence
Hierarchical for Performance: TokenCMP
•
Approach:
– On L1 miss, broadcast within own CMP
•
Local cache responds if possible
– On L2 miss, broadcast to other CMPs
– Appropriate L2 bank responds or broadcasts within its CMP
•
Optionally filter
– Responses between CMPs carry extra tokens
for future locality
•
Handling missed tokens:
– Timeout after average memory latency
– Invoke persistent request (no retries)
•
Slide 30
Larger systems can use filters, multicast, soft-state
directories
Improving Multiple-CMP Systems using Token Coherence
Other Optimizations in TokenCMP
• Implementing E-state
– Memory responds with all tokens on read request
– Use clean/dirty owner distinction to eliminate writing back
unwritten data
• Implementing Migratory Sharing
– What is it?
• A processor’s read request results in exclusive permission if
responder has exclusive permission and wrote the block
– In TokenCMP, simply return all tokens
• Non-speculative delay
– Hold block for some # cycles so permission isn’t stolen
prematurely
Slide 31
Improving Multiple-CMP Systems using Token Coherence
Another Look at Performance Policies
• How to find tokens?
–
–
–
–
Broadcast
Broadcast w/ filters
Multicast (destination-set prediction)
Directories (soft or hard)
• Who responds with data?
– Owner token
• TokenCMP uses Owner token for Inter-CMP responses
– Other heuristics
• For TokenCMP intra-CMP responses, cache responds if it has
extra tokens
Slide 32
Improving Multiple-CMP Systems using Token Coherence
Transient Requests May Reduce Complexity
• Processor holds the only required state about
request
• L2 controller in TokenCMP very simple:
– Re-broadcasts L1 request message on a miss
– Re-broadcasts or filters external request messages
– Possible states:
• no tokens (I)
• all tokens (M)
• some tokens (S)
– Bounce unexpected tokens to memory
• DirectoryCMP’s L2 controller is complex
–
–
–
–
Slide 33
Allocates MSHR on miss and forward
Issues invalidates and receives acks
Orders all intra-CMP requests and writebacks
57 states in our L2 implementation!
Improving Multiple-CMP Systems using Token Coherence
Writebacks
• DirectoryCMP uses “3-phase writebacks”
–
–
–
–
L1 issues writeback request
L2 enters transient state or blocks request
L2 responds with writeback ack
L1 sends data
• TokenCMP uses “fire-and-forget” writebacks
– Immediately send tokens and data
– Heuristic: Only send data if # tokens > 1
Slide 34
Improving Multiple-CMP Systems using Token Coherence
Outline
•
Motivation and Background
•
Token Coherence: Flat for Correctness
•
Token Coherence: Hierarchical for Performance
•
Evaluation
– Model checking
– Performance w/ commercial workloads
– Robustness
Slide 35
Improving Multiple-CMP Systems using Token Coherence
TokenCMP Evaluation
• Simple?
– Some anecdotal examples and comparisons
– Model checking
• Fast?
– Full-system simulation w/ commercial workloads
• Robust?
– Micro-benchmarks to simulate high contention
Slide 36
Improving Multiple-CMP Systems using Token Coherence
Complexity Evaluation with Model Checking
This work performed by Jesse Bingham and Alan Hu of the
University of British Columbia
• Methods:
– TLA+ and TLC
– DirectoryCMP omits all intra-CMP details
– TokenCMP’s correctness substrate modeled
• Result:
– Complexity similar between TokenCMP and non-hierarchical
DirectoryCMP
– Correctness Substrate verified to be correct and deadlock-free
– All possible performance protocols correct
Slide 37
Improving Multiple-CMP Systems using Token Coherence
Performance Evaluation
• Target System:
– 4 CMPs, 4 procs/cmp
– 2GHz OoO SPARC, 8MB shared L2 per chip
– Directly connected interconnect
• Methods: Multifacet GEMS simulator
– Simics augmented with timing models
– Released soon: http://www.cs.wisc.edu/gems
• Benchmarks:
– Performance: Apache, Spec, OLTP
– Robustness: Locking uBenchmark
Slide 38
Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime
– TokenCMP performs
9-50% faster than
DirectoryCMP
Slide 39
Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Runtime
– TokenCMP performs
9-50% faster than
DirectoryCMP
DRAM Directory
Perfect L2
Slide 40
Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Inter-CMP Traffic
– TokenCMP traffic is reasonable
(or better)
• DirectoryCMP control overhead greater
than broadcast for small system
Slide 41
Improving Multiple-CMP Systems using Token Coherence
Full-system Simulation: Intra-CMP Traffic
Slide 42
Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
more contention
Slide 43
less contention
Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
(correctness substrate only)
more contention
Slide 44
less contention
Improving Multiple-CMP Systems using Token Coherence
Performance Robustness
Locking micro-benchmark
more contention
Slide 45
less contention
Improving Multiple-CMP Systems using Token Coherence
Summary
• Microprocessor  Chip Multiprocessor (CMP)
• Symmetric Multiprocessor (SMP)  Multiple CMPs
• Problem: Coherence with Multiple CMPs
• Old Solution: Hierarchical Directory Complex & Slow
• New Solution: Apply Token Coherence
– Developed for glueless multiprocessor [2003]
– Keep: Flat for Correctness
– Exploit: Hierarchical for performance
• Less Complex & Faster than Hierarchical Directory
Slide 46
Improving Multiple-CMP Systems using Token Coherence
Download