Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison 1 2 3 Communication vs. Computation ~200X Keckler Micro 2011 Improving cache utilization is critical for energy-efficiency! Compressed Cache: Compress and Compact Blocks + Higher effective cache size + Small area overhead + Higher system performance + Lower system energy Previous work limit compression effectiveness: - Limited number of tags - High internal fragmentation - Energy expensive re-compaction Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Non-Contiguous Sub-Blocks Decoupled Super-Blocks Previous work limit compression effectiveness: - Limited number of tags - High Internal Fragmentation - Energy expensive re-compaction 6 Decoupled Compressed Cache (DCC) Saving system energy by improving LLC utilization through cache compression. Outperform 2X LLC 1.08X LLC area 14% higher performance 12% lower energy 7 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 8 Uncompressed Caching A fixed one-to-one tag/data mapping Tags Data 9 Compressed Caching Compact Add morecompressed Compress tags to increase cache blocks, effective blocks. to make capacity. room. Tags Data 10 Compression (1) Compression: how to compress blocks? • There are different compression algorithms. • Not the focus of this work. • But, which algorithm matters! Compressor 64 bytes 20 bytes 11 Compression Potentials 3.9 We use C-PACK+Z for the2.8rest of the talk! 1.5 Cycles to Decompress Compression Algorithm Compression Ratio = Original Size / Compressed Size High compression ratio potentially large normalized effective cache capacity. 12 Compaction (2) Compaction: how to store and find blocks? • Sized Critical to achieve the(FixedC) compression potentials. Fixed Compressed Cache [Kim’02, WMPI, Yang Micro 02] • This work focuses on compaction. Internal Fragmentation! Tags Data 13 Compaction (2) Compaction: how to store and find blocks? Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA 2002] Data Tags Sub-block 14 Previous Compressed Caches Potential: 3.9 16B 10B 3.1 2.6 2.3 2.0 1.7 Normalized Effective Capacity = (Limit 1) Limited Tag/Metadata LLC Number of Overhead Valid Blocks / MAX Number – High Area by Adding 4X/more Tagsof (Uncompressed) Blocks (Limit 2) Internal Fragmentation – Low Cache Capacity Utilization 15 (Limit 3) Energy-Expensive Re-Compaction VSC requires energy-expensive re-compaction. BUpdate needs B2 sub-blocks Tags Data 3X higher LLC dynamic energy! 16 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 17 Decoupled Compressed Cache (1) Exploiting Spatial Locality Low Area Overhead (2) Decoupling tag/data mapping Eliminate energy expensive re-compaction Reduce internal fragmentation (3) Co-DCC: Dynamically co-compacting super-blocks Further reduce internal fragmentation 18 (1) Exploiting Spatial Locality Neighboring blocks co-reside in LLC. 89% 19 (1) Exploiting Spatial Locality DCC tracks LLC blocks at Super-Block granularity. Data state D state C Super-Block Tag Q state A 4X Tags 2X Tags state B Super Tags Tags Quad (Q): A, B, C, D Singleton (S): E Up to 4X blocks with low area overheads! 20 (2) Decoupling tag/data mapping DCC decouples mapping to eliminate re-compaction. Update Allocation B Flexible Super Tags Quad (Q): A, B, C, D Singleton (S): E 21 (2) Decoupling tag/data mapping Back pointers identify the owner block of each sub-block. Back Super Tags Pointers Tag ID Blk ID Data Quad (Q): A, B, C, D Singleton (S): E 22 (3) Co-compacting super-blocks Co-DCC dynamically co-compacts super-blocks. Reducing internal fragmentation A sub-block Quad (Q): A, B, C, D 23 Outline Motivation Compressed caching Our Proposals: Decoupled compressed cache Experimental Results Conclusions 24 Experimental Methodology Integrated DCC with AMD Bulldozer Cache. – We model the timing and allocation constraints of sequential regions at LLC in detail. – No need for an alignment network. Verilog implementation and synthesis of the tag match and sub-block selection logic. – One additional cycle of latency due to sub-block selection. 25 Experimental Methodology Full-system simulation with a simulator based on GEMS. Wide range of applications with different level of cache sensitivities: – Commercial workloads: apache, jbb, oltp, zeus – Spec-OMP: ammp, applu, equake, mgrid, wupwise – Parsec: blackscholes, canneal, freqmine – Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astarbwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactusbzip, omnetpp-lbm Cores L1I$/L1D$ L2$ L3$ Main Memory Eight OOO cores, 3.2 GHz Private, 32-KB, 8-way Private, 256-KB, 8-way Shared, 8-MB, 16-way, 8 banks 4GB, 16 Banks, 800 MHz bus frequency DDR3 26 Normalized LLC Area Effective LLC Capacity 2 2X Baseline Co-DCC 1 FixedC VSC DCC Baseline 1 2 3 Normalized Effective LLC Capacity Components FixedC/VSC-2X DCC Co-DCC Tag Array 6.3% 2.1% 11.3% Back Pointer Array (De-)Compressors 0 4.4% 5.4% 1.8% 1.8% 1.8% Total Area Overhead 8.1% 8.3% 18.5% 27 (Co-)DCC Performance 0.96 0.95 0.93 0.90 0.86 (Co-)DCC boost system performance significantly. 28 (Co-)DCC Energy Consumption 0.96 0.97 0.93 0.91 0.88 (Co-)DCC reduce system energy by reducing number of accesses to the main memory. 29 Summary Analyze the limits of compressed caching • Limited number of tags • Internal fragmentation • Energy-expensive re-compaction Decoupled Compressed Cache • Improving performance and energy of compressed caching • Decoupled super-blocks • Non-contiguous sub-blocks Co-DCC further reduces internal fragmentation Practical designs [details in the paper] 30 Backup (De-)Compression overhead DCC data array organization with AMD Bulldozer DCC Timing DCC Lookup Applications Co-DCC design LLC effective capacity LLC miss rate Memory dynamic energy LLC dynamic energy 31 (De-)Compression Overhead Parameters Pipeline Depth Compressor 6 Decompressor 2 Latency (cycles) 16 9 Area (𝒎𝒎𝟐 ) 0.016 0.016 Power Consumption (mW) 25.84 19.01 32 DCC Data Array Organization AMD Bulldozer SR 2 SR3 SR 1 SR0 A0.1 B1.0 A0.0 C3.1 A Phase Flop B1.1 A Phase Flop A0.2 B Phase Flop A0.3 C3.0 B Phase Flop N Set Addr Read Data 4 SR0 Addr 4 SR1 Addr 4 SR2 Addr 4 SR3 Addr A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks 33 DCC Timing 34 DCC Lookup 1. Access Super Tags and Back Pointers in parallel 2. Find the matched Back Pointers 3. Read corresponding sub-blocks and decompress Back Pointers 11 11 Super 1 Q Tags 0 S Read C Data 1 1 Quad (Q): A, B, C, D Singleton (S): E 35 Applications Sensitive to Sensitive to Cache Capacity Cache and Latency Latency Cache Insensitive Spec2006 (m1-m8) Sensitive to Cache Capacity bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm 36 Co-DCC Design Sub-block 7 Sub-block 6 Sub-block 5 Sub-blocks 4-2 Sub-block 1 A2.1 … A2.0 A1 A0.1 … Sub-block 0 A2.2 A0.0 A: <A2,A1,A0> A-END Tag ID 3b 1b 7b 3b1b 7b 3b 1b 7b 3b1b 7b 7b 1b Sharers A0-Begin END SuperBlock Tag Cstate3 Comp3 Begin3 Cstate2 Comp2 Begin2 Cstate1 Comp1 Begin1 Cstate0 Comp0 Begin0 A1-Begin 4b 37 LLC Effective Cache Capacity Norm LLC Effective Capacity 4.0 3.5 3.0 2.5 2.0 1.5 1.0 38 LLC Miss Rate Norm LLC Miss Rate 1.00 0.80 0.60 0.40 0.20 39 Memory Dynamic Energy Norm Memory Dynamic Energy 1.00 0.90 0.80 0.70 0.60 0.50 0.40 40 LLC Dynamic Energy Norm LLC Dynamic Energy 6 5 4 3 2 1 0 41