Coherence Ordering for Ring-based Chip Multiprocessors University of Wisconsin-Madison

advertisement
Coherence Ordering for
Ring-based Chip Multiprocessors
Mike Marty and Mark D. Hill
University of Wisconsin-Madison
Overview
Rings a viable interconnect for future CMPs
Problem: Ring != Bus for ordering
▫ Bus-based snooping coherence not sufficient
Solutions:
▫ ORDERING-POINT: establish an ordering point
▫ GREEDY-ORDER: greedily order requests
▫ RING-ORDER: complete requests in ring order
RING-ORDER offers
and
performance
Outline
Introduction and Motivation
Ring-based Coherence Protocols
Application to a CMP
Results
Conclusion
Future CMPs
Bus?
Crossbar?
Packet-Switched?
Ring?
The “Cell” Processor
Ring Interconnect
Why?
Short, fast point-to-point links
Fewer (data) ports
Less complex than packet-switched
Simple, distributed arbitration
Exploitable ordering for coherence
Cache Coherence for a Ring
Cache Coherence for a Ring
Ring is broadcast and offers ordering
Apply existing bus-based snooping protocols?
NO!
Order properties of ring are different
Ring Order != Bus Order
{A, B}
P12
A
P9
P3
B
P6
{B, A}
Outline
Introduction and Motivation
Ring-based Coherence Protocols
Application to a CMP
Results
Conclusion
Snooping Protocols for Rings
Assumptions:
▫
Unidirectional ring
•
Multiple rings per-address OK
▫
Write-back, write-invalidate caches
▫
Eager request forwarding
•
e.g., forward message then snoop
•
[Strauss et al. ISCA 2006]
Can total bus order be recreated? YES
ORDERING-POINT Example
ordering
point
S
P11
P1
P9 getM
P10
P2
(inactive)
P9
P3 O
Store
P8
P4
P7
P5
P6
ORDERING-POINT Example
ordering
point
S
P11
P9 getM
P10
own request
ordered
P1
P2
P9
P3 O  I
Store
P8
P4
P7
P5
P6
ORDERING-POINT Example
SI
ordering
point
P11
P9 getM
P10
own request
ordered
P1
P2
P9
P3 O  I
Store
P8
P4
P7
P5
P6
ORDERING-POINT Example
SI
ordering
point
P11
P1
P10
P2
P9 ACK
own request
ordered
P9
P3 O  I
Store
Data to P9
P8
P4
P7
P5
P6
ORDERING-POINT Example
SI
ordering
point
P11
P1
P10
P2
P9 ACK
P6 getM
own request
ordered
P6 getM
P9
P3 O  I
Store
Data to P9
P8
P4
P7
P5
P6
Store
ORDERING-POINT Example
ordering
point
P11
P1
P10
P2
Data to P6
P9
P6 getM
P3
Store Complete
P8
P4
P7
P5
P6
Store
Bottom line: ORDERING-POINT
Requests totally ordered
+ Stable, predictable performance
Slow
– Requests not active immediately
Extra control overhead
– N + N/2 hops for request message
– N/2 hops for Ack message
Can requests be active immediately?
YES
(e.g., IBM Power4/5)
GREEDY-ORDER Example
S I
P12
P11
P10
P1
P2
P9 getM
response:
P9
P3 O
Store
P8
P4
P7
P5
P6
Store
GREEDY-ORDER Example
P12
P11
P1
P10
P2
response: ACK
P9
P3 O  I
P9 getM
Store
will send data
P8
P4
P7
P5
P6
GREEDY-ORDER Example
P12
P11
P1
P10
P2
P6 getM
response: ACK
P9
P3 O  I
response:
P9 getM
Store
will send data
P8
P4
P7
P5
P6
Store
GREEDY-ORDER Example
P12
P11
P1
P6 getM
P10
P2
response:
acked
P9
P3 O  I
Data
Store
will send data
to P9
P8
P4
P7
P5
P6
Store
GREEDY-ORDER Example
P12
P11
P1
P10
acked
P9
M
Store
P2
P3
Data
response:
to P9
P8
P6 getM
P7
P4
P5
P6
Store RETRY
Bottom line: GREEDY-ORDER
Average case is fast
+ Request active immediately
Requires combined snoop response
▫ Synchronous timing of snoops for efficiency
Resorts to unbounded # of retries in conflict
▫ Will conditions eventually allow request completion?
▫ Probabilistic system (e.g. Ethernet)
Recap
Existing Solutions:
1. ORDERING-POINT
•
•
Establishes total order
Extra latency and control message overhead
2. GREEDY-ORDER
•
•
Fast in common case
Unbounded retries
Ideal Solution
▫
▫
Fast for average case
Stable for worse-case (no retries)
New Approach: RING-ORDER
+ Requests complete in order of ring position
▫ Fully exploits ring ordering
+ Initial requests always succeeds
▫ No retries, No ordering point
▫ Fast, stable, predictable performance
Key: Use token counting
▫ All tokens to write, one token to read
RING-ORDER Example
= token
P12
P11
= priority token
P1
P9 getM
P10
P2
P9
P3
Store
P8
P4
P7
P5
P6
RING-ORDER Example
= token
P12
P11
= priority token
P1
P9 getM
P10
P2
FurthestDest = P9
P9
P3
Store
P8
P4
P7
P5
P6
RING-ORDER Example
P12
P11
P1
P10
P2
P6 getM
P9
Store
P3
FurthestDest = P9
P8
P4
P7
P5
P6
Store
RING-ORDER Example
P12
P11
P1
P10
P2
P9
P3
Store Complete
P8
P4
P7
P5
FurthestDest = P9
P6
Store Complete
RING-ORDER Recap
Key: Exploit Order of Ring with token counting
▫ Requests never race with tokens
Furthest Destination field
▫ Carried in responses, tracked in MSHRs
▫ Determines if tokens need to keep moving
Priority token ensures liveness
Data satisfies all requestors during traversal
RING-ORDER vs. Token Coherence
Token Coherence
Safety
Liveness
DRAM state
(bits per block)
RING-ORDER
token counting
token counting
retries + persistent
requests
priority token +
ring order
Log2 (# tokens)
1
Outline
Introduction and Motivation
Ring-based Coherence Protocols
Application to a CMP
Results
Conclusion
Applying to Baseline CMP
Interfacing with Memory Controllers
Problem: When should memory respond?
Solution: 1-bit per block of memory
▫ Owner bit for ORDERING-POINT and GREEDY-ORDER
▫ Token-count bit for RING-ORDER
• All or none tokens
Cache the bits in a Memory Interface Cache
▫ Eliminates costly DRAM accesses
▫ Enable GREEDY-ORDER to meet snoop timing
Outline
Introduction and Motivation
Ring-based Coherence Protocols
Application to a CMP
Results
▫
▫
▫
▫
Metholodogy
Runtime
Traffic
Performance Stability
Conclusion
Methodology
Full-system Simulation
▫ Virtutech Simics
▫ Wisconsin GEMS
• GPL software
• http://www.cs.wisc.edu/gems
Workloads:
▫ Commercial: OLTP, Apache, SpecJBB, Zeus
▫ Scientific: OMPart, OMPfma3d, OMPmgrid
Protocols:
▫ ORDERING-POINT
▫ GREEDY-ORDER (called –IDEAL in paper)
▫ RING-ORDER
Simulation Parameters 1/2
SPARC
4GHz
64KB I&D, 4-way
2-cycle access
1MB, 4-way
15-cycle data access
8MB, 16-way
25-cycle bank access
Simulation Parameters 2/2
275-cycle DRAM access
Memory Interface Cache
128KB, 16-way
256-bits per tag
Ring Link:
8-cycles total delay
80-bytes per cycle
Normalized Runtime
1
0.8
0.6
0.4
0.2
RING-ORDER is up to 52% faster than ORDERING-POINT
0
A pa c he
O LT P
Orde ring-P o int
S pe c J B B
Ze us
OM P fm a 3 d
Gre e dy-Orde r
O M P m g rid
Ring-Orde r
O M P a rt
er
i
G ngP
re
ed oi n
t
yR Ord
in
g- er
O
rd
er
O
rd
er
i
G ngP
re
ed oi n
t
yR Ord
in
g- er
O
rd
er
O
rd
er
i
G ngP
re
ed oi n
t
yR Ord
in
g- er
O
rd
er
O
rd
er
i
G ngP
re
ed oi n
t
yR Ord
in
g- er
O
rd
er
O
rd
er
i
G ngre
P
ed oi n
yt
R Ord
in
g- er
O
rd
er
O
rd
er
i
G ngre
P
ed oi n
yt
R Ord
in
g- er
O
rd
er
O
rd
er
i
G ngre
P
ed oi n
yt
R Ord
in
g- er
O
rd
er
O
rd
normalized traffic (bytes)
Ring Bandwidth
Writeback Control
0.2
0
Apache
OLTP
Response Control
SpecJBB
Request Control
Zeus
Writeback Data
OMPfma3d
OMPmgrid
Response Data
1.2
1
0.8
0.6
0.4
RING-ORDER uses up to 34% less bandwidth
OMPart
GREEDY-ORDER Starvation
time
Processor 3
Processor 4
631597033
Processor 6
issue getM
RETRY #2
......045
......059
RETRY #10
......081
Complete
RETRY #1
......083
......087
ack p7, send data
......111
issue getM
......116
RETRY #11
......127
Complete
......140
RETRY #2
ack p3, send data
......148
RETRY #1
......161
......180
issue getM
RETRY #3
......197
......198
Complete
......205
ack p7, send data
RETRY #2
......218
......237
......254
Processor 7
issue getM
RETRY #4
......255
Complete
......262
ack p3, send data
issue getM
+70,000 cycles
RETRY #1402
Retries
MAX Retries/Request
GREEDY-ORDER
RING-ORDER
Apache
10
0
OLTP
8
0
SpecJBB
11
0
Zeus
14
0
OMPmgrid
timed out
0
OMPart
29
0
OMPfma3d
10
0
RING-ORDER offers stable, bounded performance
Conclusion
Rings a viable interconnect for CMPs
Ring != Bus for ordering
RING-ORDER protocol offers best of:
▫ ORDERING-POINT (stable) and,
▫ GREEDY-ORDER (fast)
P.S. RING-ORDER requires NO system-wide snoop response
▫ Useful for hierarchy of rings
BACKUP SLIDES
Flexible Snooping [Strauss et al. ISCA 2006]
Eager vs. Lazy forwarding
Key Differences:
▫ Targets coherence between bus-based CMPs
▫ Logical ring on message-passing interconnect
▫ Protocol similar to GREEDY-ORDER
• Uses a separate combined snoop response message
RING-ORDER also works with logical ring
▫ Possible to extend protocol to send data off the ring
Lazy vs. Eager Forwarding applies to RING-ORDER
▫ Synergistic fit to reduce snoop power
Download