CAMEO A Cache-Like Memory Organization for 3D memory systems 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech EXECUTIVE SUMMARY • How to use Stacked DRAM: Cache or Memory? • Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity • Memory: larger memory capacity, but softwaresupport, coarse-grained data transfer • CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity • Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 2 MEMORY BANDWIDTH WALL Computer systems face memory bandwidth wall. Stacked DRAM High Bandwidth Memory Bandwidth Latency 2-8X 0.5-1X Hybrid Memory Cube Stacked DRAM helps overcome bandwidth wall Courtesy: JEDEC, Intel, 3 HYBRID MEMORY SYSTEM Stacked DRAM 1-4 GB Commodity DRAM Stacked DRAM8-16 GB Commodity DRAM Stacked DRAM Capacity @ 0.25X Hybrid Memory System How to use Stacked DRAM: Cache or Main Memory? Courtesy: JEDEC, Intel, 4 AGENDA • Introduction • Background – Cache – Two-Level Memory • CAMEO – Concept – Implementation • Methodology • Results • Summary 5 Memory Hierarchy HARDWARE-MANAGED CACHE fast CPU CPU L1$ L2$ L1$ L2$ L3$ DRAM Cache Off-chip Stacked DRAM DRAM slow Stacked DRAM is architected as DRAM Cache 6 HARDWARE-MANAGED CACHE Shared L3 Cache L4 Cache L3 Miss L4 Miss OS Off-chip memory 64B Cache: software-transparency, fine-grained data transfer, but no capacity benefits 7 3D DRAM AS A CACHE CPUs CPU DRAM $ Stacked 4GB DRAM Off-chip memory Commodity 12GB DRAM (Cache) 12GB 16GB Cache TLM CAMEO Need OS Support No Yes No Data Transfer @ 64B 4KB 64B Memory Capacity No 3D Plus 3D += 3D 8 AGENDA • Introduction • Background – Cache – Two-Level Memory • CAMEO – Concept – Implementation • Methodology • Results • Summary 9 TWO-LEVEL MEMORY (TLM) 4GB 12GB 16GB CPU CPU L1$ L2$ L1$ L2$ OS L3$ Off-chip DRAM Stacked DRAM Stacked DRAM is architected as part of OSvisible memory space (Two-Level Memory) 10 TWO-LEVEL MEMORY (NO MIGRATION) Shared L3 Cache Page OS Page 4GB 25% Pages 12GB 75% Pages Static page mapping does not exploit locality 11 TWO-LEVEL MEMORY (WITH MIGRATION) Shared L3 Cache L3 Miss 64B Page Page OS support Page Migration (4KB Transfer) TLM: OS support and inefficient use of bandwidth 12 MOTIVATION Speedup Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM 2 1.8 1.6 1.4 1.2 1 Cache TLM DoubleUse 4+12 4+12 4+16 Small WS Large WS Overall (<12GB) (>12GB) Small WS: Small Working Set (<12GB) 13 MOTIVATION Speedup Baseline: 12GB off-chip DRAM w/ 4GB stacked DRAM 2 1.8 1.6 1.4 1.2 1 Cache TLM 4+12 4+12 DoubleUse 4+1631% Cache performs in LargeOverall WS Small WS poorly Large WS workloads, as TLM in (>12GB) Small WS workloads (<12GB) 14 OVERVIEW OS-visible Memory Space CPUs CPUs DRAM $ Stacked DRAM Off-chip DRAM Off-chip DRAM Cache TLM Ideal Need OS Support No Yes No Data Transfer @ 64B 4KB 64B Memory Capacity No 3D Plus 3D Plus 3D 15 AGENDA • Introduction • Background – Cache – Two-Level Memory • CAMEO – Concept – Implementation • Methodology • Results • Summary 16 CAMEO A CAche-Like MEmory Organization 4GB 12GB 16GB Shared L3 Cache Stacked Page DRAM OS Commodity Page DRAM Hardware performs data migration SW get full capacity; HW does data migration 17 CAMEO A CAche-Like MEmory Organization Shared L3 Cache Stacked memory 64B L3 Miss 64B Off-chip memory HW swaps lines (fine-grained transfer) CAMEO transfers only 64B cache lines 18 CAMEO – CONGRUENCE GROUP 4GB 12GB Stacked memory Off-chip memory 0 A N-1 3N 2N N B C D 2N-1 3N-1 4N-1 Congruence group 19 MIGRATION IN CONGRUENCE GROUP A B C D Request to B, B, and C: • Request to B: Swap line A and B B A C D • Request to B: Hit in Stacked DRAM C A B D • Request to C: Swap line C and B A B C D Swapping changes line’s location, and requires indexing structure to keep track of the location. 20 LINE LOCATION TABLE (LLT) Location Table for Congruence Group C A B D 00 4 Location 01 10 11 Request Line Physical Location A 01 B 10 C 00 D 11 21 LINE LOCATION TABLE (LLT) Size of Location Table Per Congruence Group C A B D 00 01 10 11 Log2(4)=2 bits 4 lines = 8 bits (1 byte) 64M groups (64MB) Storing LLT in SRAM is impractical 22 LLT IN DRAM • LLT in DRAM incurs serialization Latency – Optimizing for common case: Hit in stacked DRAM – Co-locate Line Location Table of each L3 Miss 1.5% capacity group loss with data in stacked DRAM congruence LLT 1 byte LLT 2KB 64 byte Data LEAD Hits Stacked DRAM Location 31Entry LEADAnd Data 23 AVOID LLT LOOKUP LATENCY FOR HIT • Avoiding LLT Lookup Latency on Stacked DRAM Hit (lines in stacked memory) – Co-locate Line Location Table of each congruence group with data in stacked DRAM Addr Data Stacked DRAM Hit: one access Co-Locate LLT to avoid latency on hits 24 AVOID LLT LOOKUP LATENCY FOR MISS • Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) – Use Line Location Predictor to fetch data from possible location in parallel Addr Line Location Predictor Parallel Access to Always Possible Location B C D A LEAD: verify the location when both are ready. 25 AVOID LLT LOOKUP LATENCY FOR MISS • Avoiding LLT Lookup Latency on Stacked DRAM Miss (lines in off-chip memory) – LLP makes M-ary prediction – LLP uses instruction address and last location to make prediction Stacked Add r Line Location Predictor Off-chip #1 Off-chip #2 Off-chip #3 64 byte per core Predictor Always Stacked LLP Accuracy 70% 92% 26 AVOIDING LLT LATENCY OVERHEAD On Hit in Stacked DRAM • Co-locate LLT of each congruence group with data in stacked DRAM Stacked A On Miss in Stacked DRAM • Use Line Location Predictor to fetch data from possible location in parallel Off-chip B C D Line Location Table Stacked Add r Line Location Predictor Off-chip #1 Off-chip #2 Off-chip #3 We co-locate Line Location Table and use Line Location Predictor to mitigate latency overhead 27 AGENDA • Introduction • Background – Cache – Two-Level Memory • CAMEO – Concept – Implementation • Methodology • Results • Summary 28 METHODOLOGY CPU • Stacked DRAM Commodity DRAM SSD Core Chip 3.2GHz 2-wide out-of-order core 32 cores, 32MB 32-way L3 shared cache 29 METHODOLOGY CPU Stacked DRAM Commodity DRAM SSD Stacked DRAM Commodity DRAM Capacity Bus 4GB DDR3.2GHz, 128-bit 12GB DDR1.6GHz, 64-bit Latency 22ns 44ns Channels 16 channels, 16 banks/channel 8 channels 8 banks/channels 30 METHODOLOGY CPU Stacked DRAM Commodity DRAM SSD SSD Latency: 32 micro seconds •• Baseline: 12GB off-chip DRAM • Cache: Alloy Cache [MICRO’12] • Two-Level Memory: Page Migration enabled • SPEC2006: rate mode; Small Working Set (<12GB) and Large Working Set(> 12GB) PERFORMANCE IMPROVEMENT Small WSet GMEAN astar dealII bzip2 sphinx3 leslie omnetpp Small WS xalanc 1 libq 1.2 soplex 1.4 CAMEO as good as Cache in Small WS apps milc 1.6 Speedup Speedup 1.8 3.5 3 2.5 2 1.5 1 0.5 0 gcc 2 32 PERFORMANCE IMPROVEMENT Speedup 1.4 1.2 1 Large WS GMEAN Small WS 1.6 zeusmp 1 1.8 cactus 1.2 28% 1.8 CAMEO 1.6 outperforms 1.4in Large TLM WS1.2 apps bwaves 1.4 2 Gems 1.6 2 lbm Speedup Speedup 1.8 Speedup 4 3.5 3 2.5 2 1.5 1 0.5 0 mcf 2 1 Overall CAMEO outperforms both Cache and TLM, and veryLarge closeWSet to DoubleUse 33 EXECUTIVE SUMMARY • How to use Stacked DRAM: Cache or Memory? • Cache: software-transparent, fine-grained data transfer, but sacrifices memory capacity • Memory: larger memory capacity, but softwaresupport, coarse-grained data transfer • CAMEO: software-transparent, fine-grained data transfer and almost full memory capacity • Results: CAMEO outperforms both Cache (50%) and Two-Level Memory (50%) by providing 78% speedup 34 Thank You! 35 CAMEO A Cache-Like Memory Organization for 3D memory system 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K. Qureshi, Georgia Tech Backup slides 37 LINE LOCATION TABLE Size of Location Table Per Congruence Group B A 4 Location # Locations Size 4 1 byte 2.5 byte 3 byte 6 8 C D Log2(4)=2 bits 4 lines 8 bits (1 byte) 38 POWER AND ENERGY Normalized to Baseline Cache TLM CAMEO 1.5 34% 14% 1 0.5 0 Power EDP 39