Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007 Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution: DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution: Virtual Hierarchies 2 Outline Introduction and Motivation • Multicore Trends Virtual Hierarchies • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion 3 Is SMP + On-chip Integration == Multicore? Multicore P0 $ P1 $ bus $ P2 $ P3 memory controller 4 Multicore Trends Multicore P0 $ P1 $ bus $ P2 $ P3 memory controller Trend: On-chip Interconnect • Competes for same resources as cores, caches • Ring an emerging multicore interconnect 5 Multicore Trends Multicore P1 P0 $ Shared $$ bus Shared $$ $ P2 P3 memory controller Trend: latency/bandwidth tradeoffs • Increasing on-chip wire delay, memory latency • Coherence protocol interacts with shared-cache hierarchy 6 Multicore Trends Multicore P0 $ Multicore P1 $ P0 $ bus $ P2 $ P3 bus memory controller Multicore P0 $ $ P2 $ P3 memory controller Multicore P1 $ P0 $ bus $ P2 P1 $ $ P3 P1 $ bus memory controller $ P2 $ P3 memory controller Trend: Multicore is the basic building block • Multiple-CMP systems instead of SMPs • Hierarchical systems required 7 Multicore Trends Multicore VM 1 P0 $ P1 $ bus $ P2 $ P3 VM 2 VM 3 memory controller Trend: Workload Consolidation w/ Space Sharing • More cores, more workload consolidation • Space sharing instead of time sharing • Opportunities to optimize caching, coherence 8 Outline Introduction and Motivation Virtual Hierarchies [ISCA 2007, IEEE Micro Top Pick 2008] • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion 9 APP 1 APP 2 APP 3 APP 4 Space-sharing APP 1 Virtual Hierarchy Motivations Server (workload) consolidation Tiled architectures 10 Motivation: Server Consolidation www server database server #2 64-core CMP L2 Cache Core L1 database server #1 middleware server #1 middleware server #1 11 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 12 Motivation: Server Consolidation www server 64-core CMP database server #2 data database server #1 data middleware server #1 middleware server #1 Optimize Performance 13 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Isolate Performance 14 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Dynamic Partitioning 15 Motivation: Server Consolidation www server 64-core CMP database server #2 data database server #1 middleware server #1 VMWare’s Content-based Page Sharing Up to 60% reduced memory middleware server #1 Inter-VM Sharing 16 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 17 Tiled Architecture Memory System L2 Cache L1 Memory Controller Core global broadcast too expensive 18 TAG-DIRECTORY A fwd 2 3 Read A data 1 getM A duplicate tag directory 19 STATIC-BANK-DIRECTORY A 3 Read A data 1 fwd 2 getM A A 20 STATIC-BANK-DIRECTORY with hypervisor-managed cache 2 A fwd getM A A Read A 1 3 data 21 Goals STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {STATIC-BANK, TAG}-DIRECTORY Optimize Performance No Yes Isolate Performance No Yes Allow Dynamic Partitioning Yes ? Support Inter-VM Sharing Yes Yes Hypervisor/OS Simplicity Yes No 22 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 23 Virtual Hierarchies Key Idea: Overlay 2-level Cache & Coherence Hierarchy - First level harmonizes with VM/Workload - Second level allows inter-VM sharing, migration, reconfig 24 VH: First-Level Protocol Goals: • Exploit locality from space affinity • Isolate resources Strategy: Directory protocol • Interleave directories across first-level tiles • Store L2 block at first-level directory tile Questions: • How to name directories? • How to name sharers? getM INV 25 VH: Naming First-level Directory Select Dynamic Home Tile with VM Config Table • Hardware VM Config Table at each tile • Set by hypervisor during scheduling Example: p12 p13 p14 per-Tile VM Config Table L2 Cache Core Address ……000101 offset 6 L1 0 1 2 3 4 5 p12 p13 p14 p12 p13 Dynamic Home Tile: p14 p14 63 p12 26 VH: Dynamic Home Tile Actions Dynamic Home Tile either: • Returns data cached at L2 bank • Generates forwards/invalidates • Issues second-level request getM Stable First-level States (a subset): • Typical: M, E, S, I • Atypical: ILX: L2 Invalid, points to exclusive tile SLS: L2 Shared, other tiles share SLSX: L2 Shared, other tiles share, exclusive to first level 27 VH: Naming First-level Sharers Any tile can share the block getM INV Solution: full bit-vector • 64-bits for 64-tile system • Names multiple sharers or single exclusive Alternatives: • First-level broadcast • (Dynamic) coarse granularity 28 Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s) 29 Protocol VHA Directory as Second-level Protocol • Any tile can act as first-level directory • How to track and name first-level directories? Full bit-vector of sharers to name any tile • State stored in DRAM • Possibly cache on-chip + Maximum scalability, message efficiency - DRAM State ( ~ 12.5% overhead ) 30 VHA Example 2 A getM A 6 directory/memory controller getM A data 1 data 3 5 Fwd Fwd A 4 data A 31 VHA: Handling Races Blocking Directories • Handles races within same protocol • Requires blocking buffer + wakeup/replay logic blocked A getM A getM A Inter-Intra Races • Naïve blocking leads to deadlock! getM A getM A blocked FWD A getM A blocked blocked A A getM A 32 VHA: Handling Races(cont) blocked getM A FWD A getM A blocked blocked A A getM A getM A Possible Solution: • Always handle second-level message at first-level • But this causes explosion of state space Second-level may interrupt first-level actions: • First-level indirections, invalidations, writebacks 33 VHA: Handling Races(cont) Reduce the state-space explosion w/ Safe States: • Subset of transient states • Immediately handle second-level message • Limit concurrency between protocols Algorithm: • Level-one requests either complete, or enter safe-state before issuing level-two request • Level-one directories handle level-two forwards when a safe state reached (they may stall) • Level-two requests eventually handled by Level-two directory • Completion messages unblock directories 34 Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s) 35 Protocol VHB Broadcast as Second-level Protocol • Locate first-level directory tiles • Memory controller tracks outstanding second-level requestor Attach token count for each block • T tokens for each block. One token to read, all to write • Allows 1-bit at memory per block • Eliminates system-wide ACK responses 36 Protocol VHB: Token Coalescing Memory logically holds all or none tokens: • Enables 1-bit token count Replacing tile sends tokens to memory controller: • Message usually contains all tokens Process: • Tokens held in Token Holding Buffer (THB) • FIND broadcast initiated to locate other first-level directory with tokens • First-level directories respond to THB, tokens sent • Repeat for race 37 VHB Example 2 A getM A getM A memory controller 1 global getM A 3 Data+tokens Fwd A 4 5 A 38 Goals Virtual Hierarchies: VHA and VHB STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {DRAM, STATIC-BANK, TAG}-DIRECTORY Optimize Performance No Yes Yes Isolate Performance No Yes Yes Allow Dynamic Partitioning Yes ? Yes Support Inter-VM Sharing Yes Yes Yes Hypervisor/OS Simplicity Yes No Yes 39 VHNULL Are two levels really necessary? VHNULL: first level only Implications: • • • • • • Many OS modifications for single-OS environment Dynamic Partitioning requires cache flushes Inter-VM Sharing difficult Hypervisor complexity increases Requires atomic updates of VM Config Tables Limits optimized placement policies 40 VH: Capacity/Latency Trade-off Maximize Capacity • Store only L2 copy at dynamic home tile • But, L2 access time penalized • Especially for large VMs Minimize L2 access latency/bandwidth: • Replicate data in local L2 slice • Selective/Adaptive Replication well-studied ASR [Beckmann et al.], CC [Chang et al.] • But, dynamic home tile still needed for first-level Can we exploit virtual hierarchy for placement? 41 VH: Data Placement Optimization Policy Data from memory placed in tile’s local L2 bank • Tag not allocated at dynamic home tile Use second-level coherence on first sharing miss • Then allocate tag at dynamic home tile for future sharing misses Benefits: • Private data allocates in tile’s local L2 bank • Overhead of replicating data reduced • Fast, first-level sharing for widely shared data 42 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 43 VH Evaluation Methods Wisconsin GEMS Target System: 64-core tiled CMP • In-order SPARC cores • 1 MB, 16-way L2 cache per tile, 10-cycle access • 2D mesh interconnect, 16-byte links, 5-cycle link latency • Eight on-chip memory controllers, 275-cycle DRAM latency 44 VH Evaluation: Simulating Consolidation Challenge: bring-up of consolidated workloads Solution: approximate virtualization • Combine existing Simics checkpoints 8p checkpoint Memory0 P0-P7 PCI0, DISK0 64p checkpoint P0-P63 script VM0_Memory0 VM0_PCI0, VM0_DISK0 VM1_Memory0 VM1_PCI0, VM1_DISK0 45 VH Evaluation: Simulating Consolidation At simulation-time, Ruby handles mapping: • Converts <Processor ID, 32-bit Address> to <36-bit address> • Schedules VMs to adjacent cores by sending Simics requests to appropriate L1 controllers • Memory controllers evenly interleaved Bottom-line: • Static scheduling • No hypervisor execution simulated • No content-based page sharing 46 VH Evaluation: Workloads OLTP, SpecJBB, Apache, Zeus • Separate instance of Solaris for each VM Homogenous Consolidation • Simulate same-size workload N times • Unit of work identical across all workloads • (each workload staggered by 1,000,000+ ins) Heterogeneous Consolidation • Simulate different-size, different workloads • Cycles-per-Transaction for each workload 47 VH Evaluation: Baseline Protocols DRAM-DIRECTORY: • 1 MB directory cache per controller • Each tile nominally private, but replication limited TAG-DIRECTORY: • 3-cycle central tag directory (1024 ways). Nonpipelined • Replication limited STATIC-BANK-DIRECTORY • Home tiles interleave by frame address • Home tile stores only L2 copy 48 VH Evaluation: VHA and VHB Protocols VHA • Based on DirectoryCMP implementation • Dynamic Home Tile stores only L2 copy VHB with optimizations • Private data placement optimization policy (shared data stored at home tile, private data is not) • Can violate inclusiveness (evict L2 tag w/ sharers) • Memory data returned directly to requestor 49 Micro-benchmark: Sharing Latency average sharing latency (cycles) 120 100 80 Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B 60 40 20 0 0 10 20 30 40 50 60 processors per VM 50 Result: Runtime for 8x8p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 51 Result: Memory Stall Cycles for 8x8p Homogenous Consolidation Off-chip Local L2 Remote L1 Remote L2 Normalized Memory Stall Cycles 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 52 Result: Runtime for 16x4p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 53 Result: Runtime for 4x16p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 54 Result: Heterogeneous Consolidation mixed1 configuration Dram-Dir Static-Bank-Dir VM1: Apache VM2: OLTP Tag-Dir VH_A VH_B cycles-per-transaction (CPT) 1.2 1 0.8 0.6 0.4 0.2 0 VM0: Apache VM3: OLTP VM4: JBB VM5: JBB VM6: JBB 55 Result: Heterogeneous Consolidation mixed2 configuration Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B cycles-per-transaction (CPT) 1.2 1 0.8 0.6 0.4 0.2 0 VM0: Apache VM1: Apache VM2: Apache VM3: Apache VM4: OLTP VM5: OLTP VM6: OLTP VM7: OLTP 56 Effect of Replication Treat tile’s L2 bank as private Apache 8x8p OLTP 8x8p Zeus 8x8p JBB 8x8p 19.7% 14.4% 9.29% 0.06% STATIC-BANK-DIR -33.0% 3.31% -7.02% -11.2% TAG-DIR 1.27% 3.91% 1.63% -0.22% VHA n/a n/a n/a n/a VHB -11.0% -5.22% -0.98% -0.12% DRAM-DIR 57 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 58 Virtual Hierarchies: Related Work Commercial systems usually support partitioning Sun (Starfire and others) • Physical partitioning • No coherence between partitions IBM’s LPAR • Logical partitions, time-slicing of processors • Global coherence, but doesn’t optimize space-sharing 59 Virtual Hierarchies: Related Work Systems Approaches to Space Affinity: • Cellular Disco, Managing L2 via OS [Cho et al.] Shared L2 Cache Partitioning • Way-based, replacement-based • Molecular Caches ( ~ VHnull ) Cache Organization and Replication • D-NUCA, NuRapid, Cooperative Caching, ASR Quality-of-Service • Virtual Private Caches [Nesbit et al.] • More 60 Virtual Hierarchies: Related Work Coherence protocol implementations • Token coherence w/ multicast • Multicast snooping Two-level directory • Compaq Piranha • Pruning caches [Scott et al.] 61 Summary: Virtual Hierarchies Contribution: Virtual Hierarchy Idea • Alternative to physical hard-wired hierarchies • Optimize for space sharing and workload consolidation Contribution: VHA and VHB implementations • Two-level virtual hierarchy implementations Published in ISCA 2007 and 2008 Top Picks 62 Outline Introduction and Motivation Virtual Hierarchies Ring-based Coherence [MICRO 2006] • Skip, 5-minute versions, or 15-minute versions? Multiple-CMP Coherence [HPCA 2005] • Skip, 5-minute versions, or 15-minute versions? Conclusion 63 Contribution: Ring-based Coherence Problem: Order of Bus != Order of Ring • Cannot apply bus-based snooping protocols Existing Solutions • Use unbounded retries to handle contention • Use a performance-costly ordering point Contribution: RING-ORDER • Exploits round-robin order of ring • Fast and stable performance Appears in MICRO 2006 64 Contribution: Multiple-CMP Coherence Hierarchy now the default, increases complexity • Most prior hierarchical protocols use bus-based nodes Contribution: DirectoryCMP • Two-level directory protocol Contribution: TokenCMP • Extend token coherence to Multiple-CMPs • Flat for correctness, hierarchical for performance Appears in HPCA 2005 65 Other Research and Contributions Wisconsin GEMS • ISCA ’05 tutorial, CMP development, release, support Amdahl’s Law in the Multicore Era • Mark D. Hill and Michael R. Marty, to appear IEEE Computer ASR: Adaptive Selective Replication for CMP Caches • Beckmann et al., MICRO 2006 LogTM-SE: Decoupling Hardware Transactional Memory from Caches, • Yen et al., HPCA 2007 66 Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution: DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution: Virtual Hierarchies 67 Backup Slides 68 What about Physical Hierarchy / Clusters? P P P P L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ Shared L2 Shared L2 Shared L2 Shared L2 L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ P P P P 69 Physical Hierarchy / Clusters middleware server #1 www server P P P P L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ Shared L2 Shared L2 Shared L2 Shared L2 L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ P P P P database server #1 Interference between workloads in shared caches Lots of prior work on partitioning single Shared L2 Cache 70 Protocol VHNULL Example: Steps for VM Migration • from Tiles {M} to {N} 1. 2. 3. 4. Stop all threads on {M} Flush {M} caches Update {N} VM Config Tables Start threads on {N} 71 Protocol VHNULL Example: Inter-VM Content-based Page Sharing • Is read-only sharing possible with VHNULL? VMWare’s Implementation: • Global hash table to store hashes of pages • Guest pages scanned by VMM, hashes computed • Full comparison of pages on hash match Potential VHNULL Implementation: • How does hypervisor scan guest pages? Are they modified in cache? • Even read-only pages must initially be written at some point 72 5-minute Ring Coherence 73 Ring Interconnect • Why? Short, fast point-to-point links Fewer (data) ports Less complex than packet-switched Simple, distributed arbitration Exploitable ordering for coherence 74 Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different 75 Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A} 76 Ring-based Coherence Existing Solutions: 1. ORDERING-POINT • • Establishes total order Extra latency and control message overhead 2. GREEDY-ORDER • • Fast in common case Unbounded retries Ideal Solution • • Fast for average case Stable for worse-case (no retries) 77 New Approach: RING-ORDER + Requests complete in order of ring position • Fully exploits ring ordering + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance Key: Use token counting • All tokens to write, one token to read 78 RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6 79 RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 FurthestDest = P9 P9 P3 Store P8 P4 P7 P5 P6 80 RING-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 Store P3 FurthestDest = P9 P8 P4 P7 P5 P6 Store 81 RING-ORDER Example P12 P11 P1 P10 P2 P9 P3 Store Complete P8 P4 P7 P5 FurthestDest = P9 P6 Store Complete 82 Ring-based Coherence: Results Summary System: 8-core with private L2s and shared L3 Key Results: • RING-ORDER outperforms ORDERING-POINT by 7-86% with in-order cores • RING-ORDER offers similar, or slightly better, performance than GREEDY-ORDER • Pathological starvation did occur with GREEDY-ORDER 83 5-minute Multiple-CMP Coherence 84 Problem: Hierarchical Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity • explodes state space, especially without bus CMP 2 CMP 1 Inter-CMP Coherence interconnect Intra-CMP Coherence CMP 3 CMP 4 85 Hierarchical Coherence Example: Sun Wildfire Memory interface bus $ CPU $ CPU $ CPU • First-level bus-based snooping protocol • Second-level directory protocol • Interface is key: • Accesses directory state • Asserts “bus ignore” signal if necessary • Replays bus request when second-level completes 86 Solution #1: DirectoryCMP Two-level directory protocol for Multiple-CMPs • Arbitrary interconnect ordering for on- and off-chip • Non-nacking. Safe States to help resolve races • Design of DirectoryCMP led to VHA Advantages: • Powerful, scalable, solid baseline Disadvantages: • Complex (~63 states at interface), not model-checked? • Second-level indirections slow without directory cache 87 Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but Low Complexity Fast • Hierarchical for performance Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 CMP 4 88 Solution #2: TokenCMP Extend token coherence to Multiple-CMPs • Flat for correctness, hierarchical for performance • Enables model-checkable solution Flat Correctness: • Global set of T tokens, pass to individual caches • End-to-end token counting • Keep flat persistent request scheme 89 TokenCMP Performance Policies TokenCMPA: • • • • Two-level broadcast L2 broadcasts off-chip on miss Local cache responds if it has extra tokens Responses from off-chip carry extra tokens TokenCMPB: • On-chip broadcast on L2 miss only (local indirection) TokenCMPC: Extra states for further filtering TokenCMPA-PRED: persistent request prediction 90 M-CMP Coherence: Summary of Results System: Four, 4-core CMP Notable Results: • TokenCMP 2-32% faster than DirectoryCMP w/ inorder cores • TokenCMPA, TokenCMPB, TokenCMPC all perform similarly • Persistent request prediction greatly helps Zeus • TokenCMP gains diminished with out-of-order cores 91