18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems Prof. Onur Mutlu and Gennady Pekhimenko Carnegie Mellon University Fall 2012, 10/01/2012 Review: Multi-core Issues in Caching • How does the cache hierarchy change in a multi-core system? • Private cache: Cache belongs to one core • Shared cache: Cache is shared by multiple cores CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE CORE 0 CORE 1 CORE 2 L2 CACHE DRAM MEMORY CONTROLLER DRAM MEMORY CONTROLLER 2 CORE 3 Outline • Multi-cores and Caching: Review • Utility-based partitioning • Cache compression – Frequent value – Frequent pattern – Base-Delta-Immediate • Main memory compression – IBM MXT – Linearly Compressed Pages (LCP) 3 Review: Shared Caches Between Cores • Advantages: – Dynamic partitioning of available cache space • No fragmentation due to static partitioning – Easier to maintain coherence – Shared data and locks do not ping pong between caches • Disadvantages – Cores incur conflict misses due to other cores’ accesses • Misses due to inter-core interference • Some cores can destroy the hit rate of other cores – What kind of access patterns could cause this? – Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?) – High bandwidth harder to obtain (N cores N ports?) 4 Shared Caches: How to Share? • Free-for-all sharing – Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) – Not thread/application aware – An incoming block evicts a block regardless of which threads the blocks belong to • Problems – A cache-unfriendly application can destroy the performance of a cache friendly application – Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit – Reduced performance, reduced fairness 5 Problem with Shared Caches Processor Core 1 ←t1 Processor Core 2 L1 $ L1 $ L2 $ …… 6 Problem with Shared Caches t2→ Processor Core 1 L1 $ Processor Core 2 L1 $ L2 $ …… 7 Problem with Shared Caches Processor Core 1 ←t1 t2→ L1 $ Processor Core 2 L1 $ L2 $ …… t2’s throughput is significantly reduced due to unfair cache sharing. 8 Controlled Cache Sharing • Utility based cache partitioning – Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. – Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. • Fair cache partitioning – Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. • Shared/private mixed cache mechanisms – Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. 9 Utility Based Shared Cache Partitioning • Goal: Maximize system throughput • Observation: Not all threads/applications benefit equally from caching simple LRU replacement not good for system throughput • Idea: Allocate more cache space to applications that obtain the most benefit from more space • The high-level idea can be applied to other shared resources as well. • Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. • Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. 10 Utility Based Cache Partitioning (I) Misses per 1000 instructions Utility Uab = Misses with a ways – Misses with b ways Low Utility High Utility Saturating Utility Num ways from 16-way 1MB L2 11 Misses per 1000 instructions (MPKI) Utility Based Cache Partitioning (II) equake vpr UTIL LRU Idea: Give more cache to the application that benefits more from cache 12 Utility Based Cache Partitioning (III) PA UMON1 Core1 I$ D$ UMON2 Shared L2 cache Main Memory Three components: Utility Monitors (UMON) per core Partitioning Algorithm (PA) Replacement support to enforce partitions 13 I$ D$ Core2 Cache Capacity • How to get more cache without making it physically larger? • Idea: Data compression for on chip-caches 14 Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu , Todd C. Mowry Phillip B. Gibbons* Michael A. Kozuch* * Executive Summary • Off-chip memory latency is high – Large caches can help, but at significant cost • Compressing data in cache enables larger cache at low cost • Problem: Decompression is on the execution critical path • Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio • Observation: Many cache lines have low dynamic range data • Key Idea: Encode cachelines as a base + multiple differences • Solution: Base-Delta-Immediate compression with low decompression latency and high compression ratio – Outperforms three state-of-the-art compression mechanisms 16 Motivation for Cache Compression Significant redundancy in data: 0x00000000 0x0000000B 0x00000003 0x00000004 … How can we exploit this redundancy? – Cache compression helps – Provides effect of a larger cache without making it physically larger 17 Background on Cache Compression Hit CPU L1 Cache Decompression Uncompressed L2 Cache Uncompressed Compressed • Key requirements: – Fast (low decompression latency) – Simple (avoid complex hardware changes) – Effective (good compression ratio) 18 Zero Value Compression • Advantages: – Low decompression latency – Low complexity • Disadvantages: – Low average compression ratio 19 Shortcomings of Prior Work Compression Mechanisms Zero Decompression Complexity Latency Compression Ratio 20 Frequent Value Compression • Idea: encode cache lines based on frequently occurring values • Advantages: – Good compression ratio • Disadvantages: – Needs profiling – High decompression latency – High complexity 21 Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value 22 Frequent Pattern Compression • Idea: encode cache lines based on frequently occurring patterns, e.g., half word is zero • Advantages: – Good compression ratio • Disadvantages: – High decompression latency (5-8 cycles) – High complexity (for some designs) 23 Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern / 24 Shortcomings of Prior Work Compression Mechanisms Decompression Complexity Latency Compression Ratio Zero Frequent Value Frequent Pattern / Our proposal: BΔI 25 Outline • • • • Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 26 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x00000000 0x00000000 0x00000000 0x00000000 … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Narrow Values: small values stored in a big data type 0x00000000 0x0000000B 0x00000003 0x00000004 … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 … 27 How Common Are These Patterns? 100% 80% 60% 40% Zero Repeated Values Other Patterns 0% 43% of the cache lines belong to key patterns Average 20% libquantum lbm mcf tpch17 sjeng omnetpp tpch2 sphinx3 xalancbmk bzip2 tpch6 leslie3d apache gromacs astar gobmk soplex gcc hmmer wrf h264ref zeusmp cactusADM GemsFDTD Cache Coverage (%) SPEC2006, databases, web workloads, 2MB L2 cache “Other Patterns” include Narrow Values 28 Key Data Patterns in Real Applications Zero Values: initialization, sparse matrices, NULL pointers 0x00000000 0x00000000 0x00000000 0x00000000 Low Dynamic Range: … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Differences between valuesinare significantly Narrow Values: small values stored a big data type than0x00000003 the values0x00000004 themselves 0x00000000 smaller 0x0000000B … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 … 29 Key Idea: Base+Delta (B+Δ) Encoding 4 bytes 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8 0xC04039C0 Base 0x00 0x08 0x10 1 byte 1 byte 0x38 12-byte Compressed Cache Line 1 byte Fast Decompression: 20 bytes saved vector addition … Simple Hardware: arithmetic and comparison Effective: good compression ratio 30 B+Δ: Compression Ratio SPEC2006, databases, web workloads, L2 2MB cache Good average compression ratio (1.40) But some benchmarks have low compression ratio 31 Can We Do Better? • Uncompressible cache line (with a single base): 0x00000000 0x09A40178 0x0000000B 0x09A4A838 … • Key idea: Use more bases, e.g., two instead of one • Pro: – More cache lines can be compressed • Cons: – Unclear how to find these bases efficiently – Higher overhead (due to additional bases) 32 B+Δ with Multiple Arbitrary Bases Compression Ratio 2.2 2 1 2 3 4 8 10 16 1.8 1.6 1.4 1.2 1 GeoMean 2 bases – the best option based on evaluations 33 How to Find Two Bases Efficiently? 1. First base - first element in the cache line Base+Delta part 2. Second base - implicit base of 0 Immediate part Advantages over 2 arbitrary bases: – Better compression ratio – Simpler compression logic Base-Delta-Immediate (BΔI) Compression 34 B+Δ (with two arbitrary bases) vs. BΔI Average compression ratio is close, but BΔI is simpler 35 BΔI Implementation • Decompressor Design – Low latency • Compressor Design – Low cost and complexity • BΔI Cache Organization – Modest complexity 36 BΔI Decompressor Design Compressed Cache Line B0 Δ0 Δ2 Δ3 B0 B0 B0 B0 + + + + V0 V0 Δ1 V1 V2 V1 V3 V2 Vector addition V3 Uncompressed Cache Line 37 BΔI Compressor Design 32-byte Uncompressed Cache Line 8-byte B0 1-byte Δ CU 8-byte B0 2-byte Δ CU CFlag & CCL 8-byte B0 4-byte Δ CU CFlag & CCL 4-byte B0 1-byte Δ CU CFlag & CCL 4-byte B0 2-byte Δ CU CFlag & CCL 2-byte B0 1-byte Δ CU CFlag & CCL CFlag & CCL Zero CU Rep. Values CU CFlag & CFlag & CCL CCL Compression Selection Logic (based on compr. size) Compression Flag & Compressed Cache Line Compressed Cache Line 38 BΔI Compression Unit: 8-byte B0 1-byte Δ 32-byte Uncompressed Cache Line 8 bytes V0 V0 V1 V2 V3 B0= V0 B0 B0 B0 B0 - - - - Δ0 Δ1 Δ2 Δ3 Within 1-byte range? Within 1-byte range? Within 1-byte range? Within 1-byte range? Is every element within 1-byte range? B0 Δ0 Δ1 Δ2 Δ3 Yes No 39 BΔI Cache Organization Tag Storage: Set0 Data Storage: 32 bytes Conventional 2-way cache with 32-byte cache lines … … Set1 Tag0 Tag1 … Set0 … … Set1 Data0 Data1 … … … Way0 Way1 Way0 Way1 BΔI: 4-way cache with 8-byte segmented data 8 bytes Tag Storage: Set0 Set1 … … Tag0 Tag1 … … Set0 … … … … … … … … Tag2 Tag3 Set1 S0 S1 S2 S3 S4 S5 S6 S7 … … … … … … … … … … C … … C - Compr. encoding bits Way0 Way1 Way2 Way3 2.3% overhead for 2 segments MB cache Twice asTags many tags map to multiple adjacent 40 Qualitative Comparison with Prior Work • Zero-based designs – ZCA [Dusser+, ICS’09]: zero-content augmented cache – ZVC [Islam+, PACT’09]: zero-value cancelling – Limited applicability (only zero values) • FVC [Yang+, MICRO’00]: frequent value compression – High decompression latency and complexity • Pattern-based compression designs – FPC [Alameldeen+, ISCA’04]: frequent pattern compression • High decompression latency (5 cycles) and complexity – C-pack [Chen+, T-VLSI Systems’10]: practical implementation of FPC-like algorithm • High decompression latency (8 cycles) 41 Outline • • • • Motivation & Background Key Idea & Our Mechanism Evaluation Conclusion 42 Methodology • Simulator – x86 event-driven simulator based on Simics [Magnusson+, Computer’02] • Workloads – SPEC2006 benchmarks, TPC, Apache web server – 1 – 4 core simulations for 1 billion representative instructions • System Parameters – L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] – 4GHz, x86 in-order core, 512kB - 16MB L2, simple memory model (300-cycle latency for row-misses) 43 2 ZCA FVC FPC 1.8 1 GeoMean 2.2 lbm wrf hmmer sphinx3 tpch17 libquantum leslie3d gromacs sjeng mcf h264ref tpch2 omnetpp apache bzip2 xalancbmk astar tpch6 cactusADM gcc soplex gobmk zeusmp GemsFDTD Compression Ratio Compression Ratio: BΔI vs. Prior Work SPEC2006, databases, web workloads, 2MB L2 BΔI 1.53 1.6 1.4 1.2 BΔI achieves the highest compression ratio 44 1.5 1.4 1.3 1.2 1.1 1 0.9 Baseline (no compr.) BΔI 8.1% 4.9% 5.1% 5.2% 3.6% 5.6% L2 cache size Normalized MPKI Normalized IPC Single-Core: IPC and MPKI 1 0.8 0.6 0.4 0.2 0 Baseline (no compr.) BΔI 16% 24% 21% 13% 19% 14% L2 cache size BΔI achieves the performance of a 2X-size cache Performance improves due to the decrease in MPKI 45 Fixed L2 cache latency GeoMean astar bzip2 soplex xalancbmk mcf omnetpp tpch2 gromacs apache sphinx3 h264ref gobmk leslie3d zeusmp lbm 2.3% 1.7% 1.3% tpch6 hmmer gcc cactusADM GemsFDTD wrf sjeng 512kB-2way 512kB-4way-BΔI 1MB-4way 1MB-8way-BΔI 2MB-8way 2MB-16way-BΔI 4MB-16way tpch17 2.1 2 1.9 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 libquantum Normalized IPC Single-Core: Effect on Cache Capacity BΔI achieves performance close to the upper bound 46 Multi-Core Workloads • Application classification based on Compressibility: effective cache size increase (Low Compr. (LC) < 1.40, High Compr. (HC) >= 1.40) Sensitivity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB) • Three classes of applications: – LCLS, HCLS, HCHS, no LCHS applications • For 2-core - random mixes of each possible class pairs (20 each, 120 total workloads) 47 Multi-Core Workloads 48 Multi-Core: Weighted Speedup Normalized Weighted Speedup 1.20 ZCA FVC FPC BΔI 16.5% 18.0% 1.15 10.9% 1.10 1.05 4.5% 3.4% 9.5% 4.3% 1.00 0.95 LCLS - LCLS LCLS - HCLS HCLS - HCLS LCLS - HCHS HCLS - HCHS HCHS - HCHS Low Sensitivity High Sensitivity GeoMean IfBΔI at least one application is sensitive, then(9.5%) the performance improvement is the highest performance improves 49 Other Results in Paper • Sensitivity study of having more than 2X tags – Up to 1.98 average compression ratio • Effect on bandwidth consumption – 2.31X decrease on average • Detailed quantitative comparison with prior work • Cost analysis of the proposed changes – 2.3% L2 cache area increase 50 Conclusion • A new Base-Delta-Immediate compression mechanism • Key insight: many cache lines can be efficiently represented using base + delta encoding • Key properties: – Low latency decompression – Simple hardware implementation – High compression ratio with high coverage • Improves cache hit ratio and performance of both singlecore and multi-core workloads – Outperforms state-of-the-art cache compression techniques: FVC and FPC 51 Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons*, Michael A. Kozuch*, Todd C. Mowry Executive Summary Main memory is a limited shared resource Observation: Significant data redundancy Idea: Compress data in main memory Problem: How to avoid latency increase? Solution: Linearly Compressed Pages (LCP): fixed-size cache line granularity compression 1. Increases capacity (69% on average) 2. Decreases bandwidth consumption (46%) 3. Improves overall performance (9.5%) 53 Challenges in Main Memory Compression 1. Address Computation 2. Mapping and Fragmentation 3. Physically Tagged Caches 54 Address Computation Cache Line (64B) Uncompressed Page Address Offset 0 Compressed Page Address Offset L1 L0 L1 L0 0 128 64 ? L2 ? LN-1 ... L2 (N-1)*64 LN-1 ... ? 55 Mapping and Fragmentation Virtual Page (4kB) Virtual Address Physical Address Physical Page (? kB) Fragmentation 56 Physically Tagged Caches Core Virtual Address Critical Path TLB L2 Cache Lines tag tag tag Address Translation Physical Address data data data 57 Shortcomings of Prior Work Compression Access Decompression Complexity Compression Mechanisms Latency Latency Ratio IBM MXT [IBM J.R.D. ’01] 58 Shortcomings of Prior Work Compression Access Decompression Complexity Compression Mechanisms Latency Latency Ratio IBM MXT [IBM J.R.D. ’01] Robust Main Memory Compression [ISCA’05] 59 Shortcomings of Prior Work Compression Access Decompression Complexity Compression Mechanisms Latency Latency Ratio IBM MXT [IBM J.R.D. ’01] Robust Main Memory Compression LCP: Our Proposal [ISCA’05] 60 Linearly Compressed Pages (LCP): Key Idea Uncompressed Page (4kB: 64*64B) 64B 64B 64B 64B ... 64B 4:1 Compression ... Compressed Data (1kB) M E Exception Storage Metadata (64B): ? (compressible) 61 LCP Overview • Page Table entry extension – compression type and size – extended physical base address • Operating System management support – 4 memory pools (512B, 1kB, 2kB, 4kB) • Changes to cache tagging logic – physical page base address + cache line index (within a page) • Handling page overflows • Compression algorithms: BDI [PACT’12] , FPC [ISCA’04] 62 LCP Optimizations • Metadata cache – Avoids additional requests to metadata • Memory bandwidth reduction: 64B 64B 64B 64B 1 transfer instead of 4 • Zero pages and zero cache lines – Handled separately in TLB (1-bit) and in metadata (1-bit per cache line) • Integration with cache compression – BDI and FPC 63 Methodology • Simulator – x86 event-driven simulators • Simics-based [Magnusson+, Computer’02] for CPU • Multi2Sim [Ubal+, PACT’12] for GPU • Workloads – SPEC2006 benchmarks, TPC, Apache web server, GPGPU applications • System Parameters – L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08] – 512kB - 16MB L2, simple memory model 64 Compression Ratio Comparison Compression Ratio SPEC2006, databases, web workloads, 2MB L2 cache 3.5 3 Zero Page LCP (BDI) MXT FPC LCP (BDI+FPC-fixed) LZ 2.60 2.5 2.31 2 1.5 1 1.59 1.62 1.69 1.30 GeoMean LCP-based frameworks achieve competitive average compression ratios with prior work 65 Bandwidth Consumption Decrease Normalized BPKI Better SPEC2006, databases, web workloads, 2MB L2 cache 1.2 1 0.8 0.6 0.4 0.2 0 FPC-cache FPC-memory (FPC, FPC) (BDI, LCP-BDI+FPC-fixed) 0.92 BDI-cache (None, LCP-BDI) (BDI, LCP-BDI) 0.89 0.57 0.63 0.54 0.55 0.54 GeoMean LCP frameworks significantly reduce bandwidth (46%) 66 Performance Improvement Cores LCP-BDI (BDI, LCP-BDI) (BDI, LCP-BDI+FPC-fixed) 1 6.1% 9.5% 9.3% 2 13.9% 23.7% 23.6% 4 10.7% 22.6% 22.5% LCP frameworks significantly improve performance 67 Conclusion • A new main memory compression framework called LCP(Linearly Compressed Pages) – Key idea: fixed size for compressed cache lines within a page and fixed compression algorithm per page • LCP evaluation: – – – – Increases capacity (69% on average) Decreases bandwidth consumption (46%) Improves overall performance (9.5%) Decreases energy of the off-chip bus (37%) 68