Token Coherence: A Framework for Implementing Multiple-CMP Systems Mike Marty1, Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1 1University of Wisconsin-Madison 2University of British Columbia 3University of Pennsylvania February 17th, 2005 (C) 2005 Multifacet Project Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory Slide 2 Improving Multiple-CMP Systems using Token Coherence Outline • Motivation and Background – Coherence in Multiple-CMP Systems – Example: DirectoryCMP • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation Slide 3 Improving Multiple-CMP Systems using Token Coherence Coherence in Multiple-CMP Systems • Chip Multiprocessors (CMPs) emerging • Larger systems will be built with Multiple CMPs P I P D I D P P I D I D interconnect L2 1 CMP L2 L2 CMP 2 L2 interconnect CMP 3 Slide 4 CMP 4 Improving Multiple-CMP Systems using Token Coherence Problem: Hierarchical Coherence • Intra-CMP protocol for coherence within CMP • Inter-CMP protocol for coherence between CMPs • Interactions between protocols increase complexity – explodes state space CMP 2 CMP 1 Inter-CMP Coherence interconnect Intra-CMP Coherence CMP 3 Slide 5 CMP 4 Improving Multiple-CMP Systems using Token Coherence Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... – Flat for correctness, but – Hierarchical for performance Low Complexity Fast Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 Slide 6 CMP 4 Improving Multiple-CMP Systems using Token Coherence Example: DirectoryCMP 2-level MOESI Directory RACE CONDITIONS! CMP 0 CMP 1 Store B P0 P1 P2 P3 P4 P5 P6 P7 L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D L1 I&D data/ fwd ack inv ack getx getx data/ ack S inv S WB ack inv ack Shared L2 / directory S getx data/ ack O S Shared L2 / directory WB fwd B: [M [S O] I] Memory/Directory Slide 7 getx Memory/Directory Improving Multiple-CMP Systems using Token Coherence Token Coherence Summary • Token Coherence separates performance from correctness • Correctness Substrate: Enforces coherence invariant and prevents starvation 1. Safety with Token Counting 2. Starvation Avoidance with Persistent Requests • Performance Policy: Makes the common case fast – Transient requests to seek tokens • Unordered, untracked, unacknowledged – Possible prediction, multicast, filters, etc Slide 8 Improving Multiple-CMP Systems using Token Coherence Outline • Motivation and Background • Token Coherence: Flat for Correctness – Safety – Starvation Avoidance • Token Coherence: Hierarchical for Performance • Evaluation Slide 9 Improving Multiple-CMP Systems using Token Coherence Example: Token Coherence [ISCA 2003] Load B Store B P0 P1 L1 I&D L2 mem 0 • • • • Slide 10 L1 I&D L2 P2 L1 I&D L2 interconnect P3 L1 I&D L2 mem 3 Each memory block initialized with T tokens Tokens stored in memory, caches, & messages At least one token to read a block All tokens to write a block Improving Multiple-CMP Systems using Token Coherence Extending to Multiple-CMP System CMP 0 P0 L1 I&D CMP 1 P1 L1 I&D L2 L2 P2 L1 I&D L1 I&D L2 L2 interconnect interconnect Shared L2 mem 0 Slide 11 P3 Shared L2 interconnect mem 1 Improving Multiple-CMP Systems using Token Coherence Extending to Multiple-CMP System CMP 0 CMP 1 Store B P0 L1 I&D P1 P2 L1 I&D L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect • Token counting remains flat • Tokens to caches – Handles shared caches and other complex hierarchies Slide 12 Improving Multiple-CMP Systems using Token Coherence Safety Recap • Safety: Maintain coherence invariant – Only one writer, or multiple readers • Tokens for Safety – T Tokens associated with each memory block – # tokens encoded in 1+log2T – Processor acquires all tokens to write, a single token to read • Tokens passed to nodes in glueless multiprocessor scheme – But CMPs have private and shared caches • Tokens passed to caches in Multiple-CMP system – Arbitrary cache hierarchy easily handled – Flat for correctness Slide 13 Improving Multiple-CMP Systems using Token Coherence Some Token Counting Implications • Memory must store tokens – Separate RAM – Use extra ECC bits – Token cache • T sized to # caches to allow read-only copies in all caches • Replacements cannot be silent – Tokens must not be lost or dropped • Targeted for invalidate-based protocols – Not a solution for write-through or update protocols • Tokens must be identified by block address – Address must be in all token-carrying messages Slide 14 Improving Multiple-CMP Systems using Token Coherence Starvation Avoidance • Request messages can miss tokens – In-flight tokens • Transient Requests are not tracked throughout system – Incorrect filtering, multicast, destination-set prediction, etc • Possible Solution: Retries – Retry w/ optional randomized backoff is effective for races • Guaranteed Solution: Persistent Requests – – – – Slide 15 Heavyweight request guaranteed to succeed Should be rare (uses more bandwidth) Locates all tokens in the system Orders competing requests Improving Multiple-CMP Systems using Token Coherence Starvation Avoidance CMP 0 GETX CMP 1 Store B Store B P0 P1 L1 I&D L1 I&D Store B GETX GETX P2 L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect • Tokens move freely in the system – Transient requests can miss in-flight tokens – Incorrect speculation, filters, prediction, etc Slide 16 Improving Multiple-CMP Systems using Token Coherence Starvation Avoidance CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 L1 I&D L1 I&D L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 mem 0 mem 1 interconnect • Solution: issue Persistent Request – Heavyweight request guaranteed to succeed – Methods: Centralized [2003] and Distributed (New) Slide 17 Improving Multiple-CMP Systems using Token Coherence Old Scheme: Central Arbiter [2003] CMP 0 Store B CMP 1 timeout Store B P0 L1 I&D Store B timeout timeout P1 P2 L1 I&D L1 I&D interconnect L1 I&D interconnect Shared L2 arbiter 0 Shared L2 mem 0 B: P0 B: P2 B: P1 P3 mem 1 interconnect arbiter 0 – Processors issue persistent requests Slide 18 Improving Multiple-CMP Systems using Token Coherence Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B Store B Store B P0 P1 P2 B: P0 L1 I&D L1 I&D B: P0 B: P0 L1 I&D interconnect P3 L1 I&D B: P0 interconnect B: P0 Shared L2 Shared L2 B: P0 arbiter 0 mem 0 B: P0 B: P2 B: P1 mem 1 interconnect arbiter 0 – Processors issue persistent requests – Arbiter orders and broadcasts activate Slide 19 Improving Multiple-CMP Systems using Token Coherence Old Scheme: Central Arbiter [2003] CMP 0 CMP 1 Store B Store B P1 P2 P0 B: P2 P0 L1 I&D L1 I&D interconnect B: P2 P0 L1 I&D B: P0 P2 3 arbiter 0 L1 I&D Shared L2 B: P2 P0 2 mem 0 B: P0 B: P2 B: P1 B: P2 P0 interconnect B: P2 P0 Shared L2 1 P3 mem 1 interconnect arbiter 0 – Processor sends deactivate to arbiter – Arbiter broadcasts deactivate (and next activate) – Bottom Line: handoff is 3 message latencies Slide 20 Improving Multiple-CMP Systems using Token Coherence Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B Store B P0 P1 P0: B P1: B P2: B L1 I&D L1 I&D Store B P0: B P1: B P2: B P2 P0: B P1: B P2: B L1 I&D interconnect P3 L1 I&D interconnect P0: B Shared L2 P1: B P2: B P0: B P1: B P2: B Shared L2 mem 0 P0: B P1: B P2: B P0: B P1: B P2: B mem 1 interconnect – Processors broadcast persistent requests Slide 21 Improving Multiple-CMP Systems using Token Coherence Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B Store B P0 P1 P0: B P1: B P2: B L1 I&D L1 I&D Store B P0: B P1: B P2: B P2 P0: B P1: B P2: B L1 I&D interconnect P3 L1 I&D interconnect P0: B Shared L2 P1: B P2: B P0: B P1: B P2: B Shared L2 mem 0 P0: B P1: B P2: B P0: B P1: B P2: B mem 1 interconnect – Processors broadcast persistent requests – Fixed priority (processor number) Slide 22 Improving Multiple-CMP Systems using Token Coherence Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 Store B P0 P0: B P1: B P2: B L1 I&D 1 P1 L1 I&D Store B P0: B P1: B P2: B P2 P0: B P1: B P2: B L1 I&D interconnect P3 L1 I&D interconnect P0: B Shared L2 P1: B P2: B P0: B P1: B P2: B Shared L2 mem 0 P0: B P1: B P2: B P0: B P1: B P2: B mem 1 interconnect – Processors broadcast persistent requests – Fixed priority (processor number) – Processors broadcast deactivate Slide 23 Improving Multiple-CMP Systems using Token Coherence Improved Scheme: Distributed Arbitration [NEW] CMP 0 CMP 1 P0 P1 P1: B P2: B L1 I&D L1 I&D P1: B P2: B P2 P1: B P2: B L1 I&D interconnect P3 L1 I&D interconnect Shared L2 Shared L2 P1: B P2: B P1: B P2: B mem 0 P1: B P2: B P1: B P2: B mem 1 interconnect – Bottom line: Handoff is a single message latency • Subtle point: P0 and P1 must wait until next “wave” Slide 24 Improving Multiple-CMP Systems using Token Coherence Implementing Distributed Persistent Requests • Table at each cache – Sized to N entries for each processor (we use N=1) – Indexed by processor ID – Content-addressable by Address • Each incoming message must access table – Not on the critical path– can be slow CAM • Activate/deactivate reordering cannot be allowed – Persistent request virtual channel must be point-to-point ordered – Or, other solution such as sequence numbers or acks Slide 25 Improving Multiple-CMP Systems using Token Coherence Implementing Distributed Persistent Requests • Should reads be distinguished from writes? – Not necessary, but – Persistent Read request is helpful • Implications of flat distributed arbitration – Simple flat for correctness – Global broadcast when used • Fortunately they are rare in typical workloads (0.3%) • Bad workload (very high contention) would burn bandwidth – Maximum # processors must be architected • What about a hierarchical persistent request scheme? – Possible, but correctness is no longer flat – Make the common case fast Slide 26 Improving Multiple-CMP Systems using Token Coherence Reducing Unnecessary Traffic • Problem: Which token-holding cache responds with data? • Solution: Distinguish one token as the owner token – The owner includes data with token response – Clean vs. dirty owner distinction also useful for writebacks Slide 27 Improving Multiple-CMP Systems using Token Coherence Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance – TokenCMP – Another look at performance policies • Slide 28 Evaluation Improving Multiple-CMP Systems using Token Coherence Hierarchical for Performance: TokenCMP • Target System: – 2-8 CMPs – Private L1s, shared L2 per CMP – Any interconnect, but high-bandwidth • Performance Policy Goals: – – – – Slide 29 Aggressively acquire tokens Exploit on-chip locality and bandwidth Respect cache hierarchy Detecting and handling missed tokens Improving Multiple-CMP Systems using Token Coherence Hierarchical for Performance: TokenCMP • Approach: – On L1 miss, broadcast within own CMP • Local cache responds if possible – On L2 miss, broadcast to other CMPs – Appropriate L2 bank responds or broadcasts within its CMP • Optionally filter – Responses between CMPs carry extra tokens for future locality • Handling missed tokens: – Timeout after average memory latency – Invoke persistent request (no retries) • Slide 30 Larger systems can use filters, multicast, soft-state directories Improving Multiple-CMP Systems using Token Coherence Other Optimizations in TokenCMP • Implementing E-state – Memory responds with all tokens on read request – Use clean/dirty owner distinction to eliminate writing back unwritten data • Implementing Migratory Sharing – What is it? • A processor’s read request results in exclusive permission if responder has exclusive permission and wrote the block – In TokenCMP, simply return all tokens • Non-speculative delay – Hold block for some # cycles so permission isn’t stolen prematurely Slide 31 Improving Multiple-CMP Systems using Token Coherence Another Look at Performance Policies • How to find tokens? – – – – Broadcast Broadcast w/ filters Multicast (destination-set prediction) Directories (soft or hard) • Who responds with data? – Owner token • TokenCMP uses Owner token for Inter-CMP responses – Other heuristics • For TokenCMP intra-CMP responses, cache responds if it has extra tokens Slide 32 Improving Multiple-CMP Systems using Token Coherence Transient Requests May Reduce Complexity • Processor holds the only required state about request • L2 controller in TokenCMP very simple: – Re-broadcasts L1 request message on a miss – Re-broadcasts or filters external request messages – Possible states: • no tokens (I) • all tokens (M) • some tokens (S) – Bounce unexpected tokens to memory • DirectoryCMP’s L2 controller is complex – – – – Slide 33 Allocates MSHR on miss and forward Issues invalidates and receives acks Orders all intra-CMP requests and writebacks 57 states in our L2 implementation! Improving Multiple-CMP Systems using Token Coherence Writebacks • DirectoryCMP uses “3-phase writebacks” – – – – L1 issues writeback request L2 enters transient state or blocks request L2 responds with writeback ack L1 sends data • TokenCMP uses “fire-and-forget” writebacks – Immediately send tokens and data – Heuristic: Only send data if # tokens > 1 Slide 34 Improving Multiple-CMP Systems using Token Coherence Outline • Motivation and Background • Token Coherence: Flat for Correctness • Token Coherence: Hierarchical for Performance • Evaluation – Model checking – Performance w/ commercial workloads – Robustness Slide 35 Improving Multiple-CMP Systems using Token Coherence TokenCMP Evaluation • Simple? – Some anecdotal examples and comparisons – Model checking • Fast? – Full-system simulation w/ commercial workloads • Robust? – Micro-benchmarks to simulate high contention Slide 36 Improving Multiple-CMP Systems using Token Coherence Complexity Evaluation with Model Checking This work performed by Jesse Bingham and Alan Hu of the University of British Columbia • Methods: – TLA+ and TLC – DirectoryCMP omits all intra-CMP details – TokenCMP’s correctness substrate modeled • Result: – Complexity similar between TokenCMP and non-hierarchical DirectoryCMP – Correctness Substrate verified to be correct and deadlock-free – All possible performance protocols correct Slide 37 Improving Multiple-CMP Systems using Token Coherence Performance Evaluation • Target System: – 4 CMPs, 4 procs/cmp – 2GHz OoO SPARC, 8MB shared L2 per chip – Directly connected interconnect • Methods: Multifacet GEMS simulator – Simics augmented with timing models – Released soon: http://www.cs.wisc.edu/gems • Benchmarks: – Performance: Apache, Spec, OLTP – Robustness: Locking uBenchmark Slide 38 Improving Multiple-CMP Systems using Token Coherence Full-system Simulation: Runtime – TokenCMP performs 9-50% faster than DirectoryCMP Slide 39 Improving Multiple-CMP Systems using Token Coherence Full-system Simulation: Runtime – TokenCMP performs 9-50% faster than DirectoryCMP DRAM Directory Perfect L2 Slide 40 Improving Multiple-CMP Systems using Token Coherence Full-system Simulation: Inter-CMP Traffic – TokenCMP traffic is reasonable (or better) • DirectoryCMP control overhead greater than broadcast for small system Slide 41 Improving Multiple-CMP Systems using Token Coherence Full-system Simulation: Intra-CMP Traffic Slide 42 Improving Multiple-CMP Systems using Token Coherence Performance Robustness Locking micro-benchmark (correctness substrate only) more contention Slide 43 less contention Improving Multiple-CMP Systems using Token Coherence Performance Robustness Locking micro-benchmark (correctness substrate only) more contention Slide 44 less contention Improving Multiple-CMP Systems using Token Coherence Performance Robustness Locking micro-benchmark more contention Slide 45 less contention Improving Multiple-CMP Systems using Token Coherence Summary • Microprocessor Chip Multiprocessor (CMP) • Symmetric Multiprocessor (SMP) Multiple CMPs • Problem: Coherence with Multiple CMPs • Old Solution: Hierarchical Directory Complex & Slow • New Solution: Apply Token Coherence – Developed for glueless multiprocessor [2003] – Keep: Flat for Correctness – Exploit: Hierarchical for performance • Less Complex & Faster than Hierarchical Directory Slide 46 Improving Multiple-CMP Systems using Token Coherence