Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison Overview Rings a viable interconnect for future CMPs Problem: Ring != Bus for ordering ▫ Bus-based snooping coherence not sufficient Solutions: ▫ ORDERING-POINT: establish an ordering point ▫ GREEDY-ORDER: greedily order requests ▫ RING-ORDER: complete requests in ring order RING-ORDER offers and performance Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion Future CMPs Bus? Crossbar? Packet-Switched? Ring? The “Cell” Processor Ring Interconnect Why? Short, fast point-to-point links Fewer (data) ports Less complex than packet-switched Simple, distributed arbitration Exploitable ordering for coherence Cache Coherence for a Ring Cache Coherence for a Ring Ring is broadcast and offers ordering Apply existing bus-based snooping protocols? NO! Order properties of ring are different Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A} Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion Snooping Protocols for Rings Assumptions: ▫ Unidirectional ring • Multiple rings per-address OK ▫ Write-back, write-invalidate caches ▫ Eager request forwarding • e.g., forward message then snoop • [Strauss et al. ISCA 2006] Can total bus order be recreated? YES ORDERING-POINT Example ordering point S P11 P1 P9 getM P10 P2 (inactive) P9 P3 O Store P8 P4 P7 P5 P6 ORDERING-POINT Example ordering point S P11 P9 getM P10 own request ordered P1 P2 P9 P3 O I Store P8 P4 P7 P5 P6 ORDERING-POINT Example SI ordering point P11 P9 getM P10 own request ordered P1 P2 P9 P3 O I Store P8 P4 P7 P5 P6 ORDERING-POINT Example SI ordering point P11 P1 P10 P2 P9 ACK own request ordered P9 P3 O I Store Data to P9 P8 P4 P7 P5 P6 ORDERING-POINT Example SI ordering point P11 P1 P10 P2 P9 ACK P6 getM own request ordered P6 getM P9 P3 O I Store Data to P9 P8 P4 P7 P5 P6 Store ORDERING-POINT Example ordering point P11 P1 P10 P2 Data to P6 P9 P6 getM P3 Store Complete P8 P4 P7 P5 P6 Store Bottom line: ORDERING-POINT Requests totally ordered + Stable, predictable performance Slow – Requests not active immediately Extra control overhead – N + N/2 hops for request message – N/2 hops for Ack message Can requests be active immediately? YES (e.g., IBM Power4/5) GREEDY-ORDER Example S I P12 P11 P10 P1 P2 P9 getM response: P9 P3 O Store P8 P4 P7 P5 P6 Store GREEDY-ORDER Example P12 P11 P1 P10 P2 response: ACK P9 P3 O I P9 getM Store will send data P8 P4 P7 P5 P6 GREEDY-ORDER Example P12 P11 P1 P10 P2 P6 getM response: ACK P9 P3 O I response: P9 getM Store will send data P8 P4 P7 P5 P6 Store GREEDY-ORDER Example P12 P11 P1 P6 getM P10 P2 response: acked P9 P3 O I Data Store will send data to P9 P8 P4 P7 P5 P6 Store GREEDY-ORDER Example P12 P11 P1 P10 acked P9 M Store P2 P3 Data response: to P9 P8 P6 getM P7 P4 P5 P6 Store RETRY Bottom line: GREEDY-ORDER Average case is fast + Request active immediately Requires combined snoop response ▫ Synchronous timing of snoops for efficiency Resorts to unbounded # of retries in conflict ▫ Will conditions eventually allow request completion? ▫ Probabilistic system (e.g. Ethernet) Recap Existing Solutions: 1. ORDERING-POINT • • Establishes total order Extra latency and control message overhead 2. GREEDY-ORDER • • Fast in common case Unbounded retries Ideal Solution ▫ ▫ Fast for average case Stable for worse-case (no retries) New Approach: RING-ORDER + Requests complete in order of ring position ▫ Fully exploits ring ordering + Initial requests always succeeds ▫ No retries, No ordering point ▫ Fast, stable, predictable performance Key: Use token counting ▫ All tokens to write, one token to read RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6 RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 FurthestDest = P9 P9 P3 Store P8 P4 P7 P5 P6 RING-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 Store P3 FurthestDest = P9 P8 P4 P7 P5 P6 Store RING-ORDER Example P12 P11 P1 P10 P2 P9 P3 Store Complete P8 P4 P7 P5 FurthestDest = P9 P6 Store Complete RING-ORDER Recap Key: Exploit Order of Ring with token counting ▫ Requests never race with tokens Furthest Destination field ▫ Carried in responses, tracked in MSHRs ▫ Determines if tokens need to keep moving Priority token ensures liveness Data satisfies all requestors during traversal RING-ORDER vs. Token Coherence Token Coherence Safety Liveness DRAM state (bits per block) RING-ORDER token counting token counting retries + persistent requests priority token + ring order Log2 (# tokens) 1 Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results Conclusion Applying to Baseline CMP Interfacing with Memory Controllers Problem: When should memory respond? Solution: 1-bit per block of memory ▫ Owner bit for ORDERING-POINT and GREEDY-ORDER ▫ Token-count bit for RING-ORDER • All or none tokens Cache the bits in a Memory Interface Cache ▫ Eliminates costly DRAM accesses ▫ Enable GREEDY-ORDER to meet snoop timing Outline Introduction and Motivation Ring-based Coherence Protocols Application to a CMP Results ▫ ▫ ▫ ▫ Metholodogy Runtime Traffic Performance Stability Conclusion Methodology Full-system Simulation ▫ Virtutech Simics ▫ Wisconsin GEMS • GPL software • http://www.cs.wisc.edu/gems Workloads: ▫ Commercial: OLTP, Apache, SpecJBB, Zeus ▫ Scientific: OMPart, OMPfma3d, OMPmgrid Protocols: ▫ ORDERING-POINT ▫ GREEDY-ORDER (called –IDEAL in paper) ▫ RING-ORDER Simulation Parameters 1/2 SPARC 4GHz 64KB I&D, 4-way 2-cycle access 1MB, 4-way 15-cycle data access 8MB, 16-way 25-cycle bank access Simulation Parameters 2/2 275-cycle DRAM access Memory Interface Cache 128KB, 16-way 256-bits per tag Ring Link: 8-cycles total delay 80-bytes per cycle Normalized Runtime 1 0.8 0.6 0.4 0.2 RING-ORDER is up to 52% faster than ORDERING-POINT 0 A pa c he O LT P Orde ring-P o int S pe c J B B Ze us OM P fm a 3 d Gre e dy-Orde r O M P m g rid Ring-Orde r O M P a rt er i G ngP re ed oi n t yR Ord in g- er O rd er O rd er i G ngP re ed oi n t yR Ord in g- er O rd er O rd er i G ngP re ed oi n t yR Ord in g- er O rd er O rd er i G ngP re ed oi n t yR Ord in g- er O rd er O rd er i G ngre P ed oi n yt R Ord in g- er O rd er O rd er i G ngre P ed oi n yt R Ord in g- er O rd er O rd er i G ngre P ed oi n yt R Ord in g- er O rd er O rd normalized traffic (bytes) Ring Bandwidth Writeback Control 0.2 0 Apache OLTP Response Control SpecJBB Request Control Zeus Writeback Data OMPfma3d OMPmgrid Response Data 1.2 1 0.8 0.6 0.4 RING-ORDER uses up to 34% less bandwidth OMPart GREEDY-ORDER Starvation time Processor 3 Processor 4 631597033 Processor 6 issue getM RETRY #2 ......045 ......059 RETRY #10 ......081 Complete RETRY #1 ......083 ......087 ack p7, send data ......111 issue getM ......116 RETRY #11 ......127 Complete ......140 RETRY #2 ack p3, send data ......148 RETRY #1 ......161 ......180 issue getM RETRY #3 ......197 ......198 Complete ......205 ack p7, send data RETRY #2 ......218 ......237 ......254 Processor 7 issue getM RETRY #4 ......255 Complete ......262 ack p3, send data issue getM +70,000 cycles RETRY #1402 Retries MAX Retries/Request GREEDY-ORDER RING-ORDER Apache 10 0 OLTP 8 0 SpecJBB 11 0 Zeus 14 0 OMPmgrid timed out 0 OMPart 29 0 OMPfma3d 10 0 RING-ORDER offers stable, bounded performance Conclusion Rings a viable interconnect for CMPs Ring != Bus for ordering RING-ORDER protocol offers best of: ▫ ORDERING-POINT (stable) and, ▫ GREEDY-ORDER (fast) P.S. RING-ORDER requires NO system-wide snoop response ▫ Useful for hierarchy of rings BACKUP SLIDES Flexible Snooping [Strauss et al. ISCA 2006] Eager vs. Lazy forwarding Key Differences: ▫ Targets coherence between bus-based CMPs ▫ Logical ring on message-passing interconnect ▫ Protocol similar to GREEDY-ORDER • Uses a separate combined snoop response message RING-ORDER also works with logical ring ▫ Possible to extend protocol to send data off the ring Lazy vs. Eager Forwarding applies to RING-ORDER ▫ Synergistic fit to reduce snoop power