Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty 12/19/2007 Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution: DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution: Virtual Hierarchies 2 Outline Introduction and Motivation • Multicore Trends Virtual Hierarchies • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion 3 Is SMP + On-chip Integration == Multicore? Multicore P0 $ P1 $ bus $ P2 $ P3 memory controller 4 Multicore Trends Multicore P0 $ P1 $ bus $ P2 $ P3 memory controller Trend: On-chip Interconnect • Competes for same resources as cores, caches • Ring an emerging multicore interconnect 5 Multicore Trends Multicore P1 P0 $ Shared $$ bus Shared $$ $ P2 P3 memory controller Trend: latency/bandwidth tradeoffs • Increasing on-chip wire delay, memory latency • Coherence protocol interacts with shared-cache hierarchy 6 Multicore Trends Multicore P0 $ Multicore P1 $ P0 $ bus $ P2 $ P3 bus memory controller Multicore P0 $ $ P2 $ P3 memory controller Multicore P1 $ P0 $ bus $ P2 P1 $ $ P3 P1 $ bus memory controller $ P2 $ P3 memory controller Trend: Multicore is the basic building block • Multiple-CMP systems instead of SMPs • Hierarchical systems required 7 Multicore Trends Multicore VM 1 P0 $ P1 $ bus $ P2 $ P3 VM 2 VM 3 memory controller Trend: Workload Consolidation w/ Space Sharing • More cores, more workload consolidation • Space sharing instead of time sharing • Opportunities to optimize caching, coherence 8 Outline Introduction and Motivation Virtual Hierarchies [ISCA 2007, IEEE Micro Top Pick 2008] • Focus of presentation Multiple-CMP Coherence Ring-based Coherence Conclusion 9 APP 1 APP 2 APP 3 APP 4 Space-sharing APP 1 Virtual Hierarchy Motivations Server (workload) consolidation Tiled architectures 10 Motivation: Server Consolidation www server database server #2 64-core CMP L2 Cache Core L1 database server #1 middleware server #1 middleware server #1 11 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 12 Motivation: Server Consolidation www server 64-core CMP database server #2 data database server #1 data middleware server #1 middleware server #1 Optimize Performance 13 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Isolate Performance 14 Motivation: Server Consolidation www server 64-core CMP database server #2 database server #1 middleware server #1 middleware server #1 Dynamic Partitioning 15 Motivation: Server Consolidation www server 64-core CMP database server #2 data database server #1 middleware server #1 VMWare’s Content-based Page Sharing  Up to 60% reduced memory middleware server #1 Inter-VM Sharing 16 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 17 Tiled Architecture Memory System L2 Cache L1 Memory Controller Core global broadcast too expensive 18 TAG-DIRECTORY A fwd 2 3 Read A data 1 getM A duplicate tag directory 19 STATIC-BANK-DIRECTORY A 3 Read A data 1 fwd 2 getM A A 20 STATIC-BANK-DIRECTORY with hypervisor-managed cache 2 A fwd getM A A Read A 1 3 data 21 Goals STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {STATIC-BANK, TAG}-DIRECTORY Optimize Performance No Yes Isolate Performance No Yes Allow Dynamic Partitioning Yes ? Support Inter-VM Sharing Yes Yes Hypervisor/OS Simplicity Yes No 22 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 23 Virtual Hierarchies Key Idea: Overlay 2-level Cache & Coherence Hierarchy - First level harmonizes with VM/Workload - Second level allows inter-VM sharing, migration, reconfig 24 VH: First-Level Protocol Goals: • Exploit locality from space affinity • Isolate resources Strategy: Directory protocol • Interleave directories across first-level tiles • Store L2 block at first-level directory tile Questions: • How to name directories? • How to name sharers? getM INV 25 VH: Naming First-level Directory Select Dynamic Home Tile with VM Config Table • Hardware VM Config Table at each tile • Set by hypervisor during scheduling Example: p12 p13 p14 per-Tile VM Config Table L2 Cache Core Address ……000101 offset 6 L1 0 1 2 3 4 5 p12 p13 p14 p12 p13 Dynamic Home Tile: p14 p14 63 p12 26 VH: Dynamic Home Tile Actions Dynamic Home Tile either: • Returns data cached at L2 bank • Generates forwards/invalidates • Issues second-level request getM Stable First-level States (a subset): • Typical: M, E, S, I • Atypical: ILX: L2 Invalid, points to exclusive tile SLS: L2 Shared, other tiles share SLSX: L2 Shared, other tiles share, exclusive to first level 27 VH: Naming First-level Sharers Any tile can share the block getM INV Solution: full bit-vector • 64-bits for 64-tile system • Names multiple sharers or single exclusive Alternatives: • First-level broadcast • (Dynamic) coarse granularity 28 Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s) 29 Protocol VHA Directory as Second-level Protocol • Any tile can act as first-level directory • How to track and name first-level directories? Full bit-vector of sharers to name any tile • State stored in DRAM • Possibly cache on-chip + Maximum scalability, message efficiency - DRAM State ( ~ 12.5% overhead ) 30 VHA Example 2 A getM A 6 directory/memory controller getM A data 1 data 3 5 Fwd Fwd A 4 data A 31 VHA: Handling Races Blocking Directories • Handles races within same protocol • Requires blocking buffer + wakeup/replay logic blocked A getM A getM A Inter-Intra Races • Naïve blocking leads to deadlock! getM A getM A blocked FWD A getM A blocked blocked A A getM A 32 VHA: Handling Races(cont) blocked getM A FWD A getM A blocked blocked A A getM A getM A Possible Solution: • Always handle second-level message at first-level • But this causes explosion of state space Second-level may interrupt first-level actions: • First-level indirections, invalidations, writebacks 33 VHA: Handling Races(cont) Reduce the state-space explosion w/ Safe States: • Subset of transient states • Immediately handle second-level message • Limit concurrency between protocols Algorithm: • Level-one requests either complete, or enter safe-state before issuing level-two request • Level-one directories handle level-two forwards when a safe state reached (they may stall) • Level-two requests eventually handled by Level-two directory • Completion messages unblock directories 34 Virtual Hierarchies Two Solutions for Global Coherence: VHA and VHB memory controller(s) 35 Protocol VHB Broadcast as Second-level Protocol • Locate first-level directory tiles • Memory controller tracks outstanding second-level requestor Attach token count for each block • T tokens for each block. One token to read, all to write • Allows 1-bit at memory per block • Eliminates system-wide ACK responses 36 Protocol VHB: Token Coalescing Memory logically holds all or none tokens: • Enables 1-bit token count Replacing tile sends tokens to memory controller: • Message usually contains all tokens Process: • Tokens held in Token Holding Buffer (THB) • FIND broadcast initiated to locate other first-level directory with tokens • First-level directories respond to THB, tokens sent • Repeat for race 37 VHB Example 2 A getM A getM A memory controller 1 global getM A 3 Data+tokens Fwd A 4 5 A 38 Goals Virtual Hierarchies: VHA and VHB STATIC-BANK-DIRECTORY w/ hypervisor-managed cache {DRAM, STATIC-BANK, TAG}-DIRECTORY Optimize Performance No Yes Yes Isolate Performance No Yes Yes Allow Dynamic Partitioning Yes ? Yes Support Inter-VM Sharing Yes Yes Yes Hypervisor/OS Simplicity Yes No Yes 39 VHNULL Are two levels really necessary? VHNULL: first level only Implications: • • • • • • Many OS modifications for single-OS environment Dynamic Partitioning requires cache flushes Inter-VM Sharing difficult Hypervisor complexity increases Requires atomic updates of VM Config Tables Limits optimized placement policies 40 VH: Capacity/Latency Trade-off Maximize Capacity • Store only L2 copy at dynamic home tile • But, L2 access time penalized • Especially for large VMs Minimize L2 access latency/bandwidth: • Replicate data in local L2 slice • Selective/Adaptive Replication well-studied ASR [Beckmann et al.], CC [Chang et al.] • But, dynamic home tile still needed for first-level Can we exploit virtual hierarchy for placement? 41 VH: Data Placement Optimization Policy Data from memory placed in tile’s local L2 bank • Tag not allocated at dynamic home tile Use second-level coherence on first sharing miss • Then allocate tag at dynamic home tile for future sharing misses Benefits: • Private data allocates in tile’s local L2 bank • Overhead of replicating data reduced • Fast, first-level sharing for widely shared data 42 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 43 VH Evaluation Methods Wisconsin GEMS Target System: 64-core tiled CMP • In-order SPARC cores • 1 MB, 16-way L2 cache per tile, 10-cycle access • 2D mesh interconnect, 16-byte links, 5-cycle link latency • Eight on-chip memory controllers, 275-cycle DRAM latency 44 VH Evaluation: Simulating Consolidation Challenge: bring-up of consolidated workloads Solution: approximate virtualization • Combine existing Simics checkpoints 8p checkpoint Memory0 P0-P7 PCI0, DISK0 64p checkpoint P0-P63 script VM0_Memory0 VM0_PCI0, VM0_DISK0 VM1_Memory0 VM1_PCI0, VM1_DISK0 45 VH Evaluation: Simulating Consolidation At simulation-time, Ruby handles mapping: • Converts <Processor ID, 32-bit Address> to <36-bit address> • Schedules VMs to adjacent cores by sending Simics requests to appropriate L1 controllers • Memory controllers evenly interleaved Bottom-line: • Static scheduling • No hypervisor execution simulated • No content-based page sharing 46 VH Evaluation: Workloads OLTP, SpecJBB, Apache, Zeus • Separate instance of Solaris for each VM Homogenous Consolidation • Simulate same-size workload N times • Unit of work identical across all workloads • (each workload staggered by 1,000,000+ ins) Heterogeneous Consolidation • Simulate different-size, different workloads • Cycles-per-Transaction for each workload 47 VH Evaluation: Baseline Protocols DRAM-DIRECTORY: • 1 MB directory cache per controller • Each tile nominally private, but replication limited TAG-DIRECTORY: • 3-cycle central tag directory (1024 ways). Nonpipelined • Replication limited STATIC-BANK-DIRECTORY • Home tiles interleave by frame address • Home tile stores only L2 copy 48 VH Evaluation: VHA and VHB Protocols VHA • Based on DirectoryCMP implementation • Dynamic Home Tile stores only L2 copy VHB with optimizations • Private data placement optimization policy (shared data stored at home tile, private data is not) • Can violate inclusiveness (evict L2 tag w/ sharers) • Memory data returned directly to requestor 49 Micro-benchmark: Sharing Latency average sharing latency (cycles) 120 100 80 Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B 60 40 20 0 0 10 20 30 40 50 60 processors per VM 50 Result: Runtime for 8x8p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 51 Result: Memory Stall Cycles for 8x8p Homogenous Consolidation Off-chip Local L2 Remote L1 Remote L2 Normalized Memory Stall Cycles 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 52 Result: Runtime for 16x4p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 53 Result: Runtime for 4x16p Homogenous Consolidation 1.2 1 0.8 0.6 0.4 0.2 0 OLTP Apache Zeus SpecJBB 54 Result: Heterogeneous Consolidation mixed1 configuration Dram-Dir Static-Bank-Dir VM1: Apache VM2: OLTP Tag-Dir VH_A VH_B cycles-per-transaction (CPT) 1.2 1 0.8 0.6 0.4 0.2 0 VM0: Apache VM3: OLTP VM4: JBB VM5: JBB VM6: JBB 55 Result: Heterogeneous Consolidation mixed2 configuration Dram-Dir Static-Bank-Dir Tag-Dir VH_A VH_B cycles-per-transaction (CPT) 1.2 1 0.8 0.6 0.4 0.2 0 VM0: Apache VM1: Apache VM2: Apache VM3: Apache VM4: OLTP VM5: OLTP VM6: OLTP VM7: OLTP 56 Effect of Replication Treat tile’s L2 bank as private Apache 8x8p OLTP 8x8p Zeus 8x8p JBB 8x8p 19.7% 14.4% 9.29% 0.06% STATIC-BANK-DIR -33.0% 3.31% -7.02% -11.2% TAG-DIR 1.27% 3.91% 1.63% -0.22% VHA n/a n/a n/a n/a VHB -11.0% -5.22% -0.98% -0.12% DRAM-DIR 57 Outline Introduction and Motivation Virtual Hierarchies • • • • • Expanded Motivation Non-hierarchical approaches Proposed Virtual Hierarchies Evaluation Related Work Ring-based and Multiple-CMP Coherence Conclusion 58 Virtual Hierarchies: Related Work Commercial systems usually support partitioning Sun (Starfire and others) • Physical partitioning • No coherence between partitions IBM’s LPAR • Logical partitions, time-slicing of processors • Global coherence, but doesn’t optimize space-sharing 59 Virtual Hierarchies: Related Work Systems Approaches to Space Affinity: • Cellular Disco, Managing L2 via OS [Cho et al.] Shared L2 Cache Partitioning • Way-based, replacement-based • Molecular Caches ( ~ VHnull ) Cache Organization and Replication • D-NUCA, NuRapid, Cooperative Caching, ASR Quality-of-Service • Virtual Private Caches [Nesbit et al.] • More 60 Virtual Hierarchies: Related Work Coherence protocol implementations • Token coherence w/ multicast • Multicast snooping Two-level directory • Compaq Piranha • Pruning caches [Scott et al.] 61 Summary: Virtual Hierarchies Contribution: Virtual Hierarchy Idea • Alternative to physical hard-wired hierarchies • Optimize for space sharing and workload consolidation Contribution: VHA and VHB implementations • Two-level virtual hierarchy implementations Published in ISCA 2007 and 2008 Top Picks 62 Outline Introduction and Motivation Virtual Hierarchies Ring-based Coherence [MICRO 2006] • Skip, 5-minute versions, or 15-minute versions? Multiple-CMP Coherence [HPCA 2005] • Skip, 5-minute versions, or 15-minute versions? Conclusion 63 Contribution: Ring-based Coherence Problem: Order of Bus != Order of Ring • Cannot apply bus-based snooping protocols Existing Solutions • Use unbounded retries to handle contention • Use a performance-costly ordering point Contribution: RING-ORDER • Exploits round-robin order of ring • Fast and stable performance Appears in MICRO 2006 64 Contribution: Multiple-CMP Coherence Hierarchy now the default, increases complexity • Most prior hierarchical protocols use bus-based nodes Contribution: DirectoryCMP • Two-level directory protocol Contribution: TokenCMP • Extend token coherence to Multiple-CMPs • Flat for correctness, hierarchical for performance Appears in HPCA 2005 65 Other Research and Contributions Wisconsin GEMS • ISCA ’05 tutorial, CMP development, release, support Amdahl’s Law in the Multicore Era • Mark D. Hill and Michael R. Marty, to appear IEEE Computer ASR: Adaptive Selective Replication for CMP Caches • Beckmann et al., MICRO 2006 LogTM-SE: Decoupling Hardware Transactional Memory from Caches, • Yen et al., HPCA 2007 66 Key Contributions Trend: Multicore ring interconnects emerging Challenge: Order of ring != order of bus Contribution: New protocol exploits ring order Trend: Multicore now the basic building block Challenge: Hierarchical coherence for Multiple-CMP is complex Contribution: DirectoryCMP and TokenCMP Trend: Workload consolidation w/ space sharing Challenge: Physical hierarchies often do not match workloads Contribution: Virtual Hierarchies 67 Backup Slides 68 What about Physical Hierarchy / Clusters? P P P P L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ Shared L2 Shared L2 Shared L2 Shared L2 L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ P P P P 69 Physical Hierarchy / Clusters middleware server #1 www server P P P P L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ Shared L2 Shared L2 Shared L2 Shared L2 L1 $ L1 $ L1 $ L1 $ P P P P L1 $ L1 $ L1 $ L1 $ P P P P database server #1 Interference between workloads in shared caches Lots of prior work on partitioning single Shared L2 Cache 70 Protocol VHNULL Example: Steps for VM Migration • from Tiles {M} to {N} 1. 2. 3. 4. Stop all threads on {M} Flush {M} caches Update {N} VM Config Tables Start threads on {N} 71 Protocol VHNULL Example: Inter-VM Content-based Page Sharing • Is read-only sharing possible with VHNULL? VMWare’s Implementation: • Global hash table to store hashes of pages • Guest pages scanned by VMM, hashes computed • Full comparison of pages on hash match Potential VHNULL Implementation: • How does hypervisor scan guest pages? Are they modified in cache? • Even read-only pages must initially be written at some point 72 5-minute Ring Coherence 73 Ring Interconnect • Why? Short, fast point-to-point links Fewer (data) ports Less complex than packet-switched Simple, distributed arbitration Exploitable ordering for coherence 74 Cache Coherence for a Ring • Ring is broadcast and offers ordering • Apply existing bus-based snooping protocols? • NO! • Order properties of ring are different 75 Ring Order != Bus Order {A, B} P12 A P9 P3 B P6 {B, A} 76 Ring-based Coherence Existing Solutions: 1. ORDERING-POINT • • Establishes total order Extra latency and control message overhead 2. GREEDY-ORDER • • Fast in common case Unbounded retries Ideal Solution • • Fast for average case Stable for worse-case (no retries) 77 New Approach: RING-ORDER + Requests complete in order of ring position • Fully exploits ring ordering + Initial requests always succeeds • No retries, No ordering point • Fast, stable, predictable performance Key: Use token counting • All tokens to write, one token to read 78 RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 P9 P3 Store P8 P4 P7 P5 P6 79 RING-ORDER Example = token P12 P11 = priority token P1 P9 getM P10 P2 FurthestDest = P9 P9 P3 Store P8 P4 P7 P5 P6 80 RING-ORDER Example P12 P11 P1 P10 P2 P6 getM P9 Store P3 FurthestDest = P9 P8 P4 P7 P5 P6 Store 81 RING-ORDER Example P12 P11 P1 P10 P2 P9 P3 Store Complete P8 P4 P7 P5 FurthestDest = P9 P6 Store Complete 82 Ring-based Coherence: Results Summary System: 8-core with private L2s and shared L3 Key Results: • RING-ORDER outperforms ORDERING-POINT by 7-86% with in-order cores • RING-ORDER offers similar, or slightly better, performance than GREEDY-ORDER • Pathological starvation did occur with GREEDY-ORDER 83 5-minute Multiple-CMP Coherence 84 Problem: Hierarchical Coherence Intra-CMP protocol for coherence within CMP Inter-CMP protocol for coherence between CMPs Interactions between protocols increase complexity • explodes state space, especially without bus CMP 2 CMP 1 Inter-CMP Coherence interconnect Intra-CMP Coherence CMP 3 CMP 4 85 Hierarchical Coherence Example: Sun Wildfire Memory interface bus $ CPU $ CPU $ CPU • First-level bus-based snooping protocol • Second-level directory protocol • Interface is key: • Accesses directory state • Asserts “bus ignore” signal if necessary • Replays bus request when second-level completes 86 Solution #1: DirectoryCMP Two-level directory protocol for Multiple-CMPs • Arbitrary interconnect ordering for on- and off-chip • Non-nacking. Safe States to help resolve races • Design of DirectoryCMP led to VHA Advantages: • Powerful, scalable, solid baseline Disadvantages: • Complex (~63 states at interface), not model-checked? • Second-level indirections slow without directory cache 87 Improving Multiple CMP Systems with Token Coherence • Token Coherence allows Multiple-CMP systems to be... • Flat for correctness, but Low Complexity Fast • Hierarchical for performance Correctness Substrate CMP 2 CMP 1 Performance Protocol interconnect CMP 3 CMP 4 88 Solution #2: TokenCMP Extend token coherence to Multiple-CMPs • Flat for correctness, hierarchical for performance • Enables model-checkable solution Flat Correctness: • Global set of T tokens, pass to individual caches • End-to-end token counting • Keep flat persistent request scheme 89 TokenCMP Performance Policies TokenCMPA: • • • • Two-level broadcast L2 broadcasts off-chip on miss Local cache responds if it has extra tokens Responses from off-chip carry extra tokens TokenCMPB: • On-chip broadcast on L2 miss only (local indirection) TokenCMPC: Extra states for further filtering TokenCMPA-PRED: persistent request prediction 90 M-CMP Coherence: Summary of Results System: Four, 4-core CMP Notable Results: • TokenCMP 2-32% faster than DirectoryCMP w/ inorder cores • TokenCMPA, TokenCMPB, TokenCMPC all perform similarly • Persistent request prediction greatly helps Zeus • TokenCMP gains diminished with out-of-order cores 91

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty

Related documents

Products

Support

Cache Coherence Techniques for Multicore Processors Dissertation Defense Mike Marty

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib