Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Intel Mainak Chaudhuri IIT Kanpur Talk in Two Slides (1/2) Memory footprint of data-intensive workloads is ever-increasing – We explore compression to reduce memory pressure in a medium-scale DSM multi Dirty blocks evicted from last-level of cache is sent to home node – Compress in home memory controller A last-level cache miss request from a node is sent to home node – Decompress in home memory controller Memory Compression and Decompression Talk in Two Slides (2/2) No modification in the processor – Cache hierarchy sees decompressed blocks All changes are confined to the directorybased cache coherence protocol – Leverage spare core(s) to execute compression-enabled protocols in software – Extend directory structure for compression book-keeping Use hybrid of two compression algorithms – On 16 nodes for seven scientific computing workloads, 73% storage saving on average with at most 15% increase in execution time Memory Compression and Decompression Contributions Two major contributions – First attempt to look at compression/decompression as directory protocol extensions in mid-range servers – First proposal to execute a compressionenabled directory protocol in software on spare core(s) of a multi-core die Makes the solution attractive in many-core systems Memory Compression and Decompression Sketch Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Programmable Protocol Core Past studies have considered off-die programmable protocol processors – Offers flexibility in choice of coherence protocols compared to hardwired FSMs, but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …] With on-die integration of memory controller and availability of large number of on-die cores, programmable protocol cores may become an attractive design – Recent studies show almost no performance loss [IEEE TPDS, Aug’07] Memory Compression and Decompression Programmable Protocol Core In our simulated system, each node contains – One complex out-of-order issue core which runs the application thread – One or two simple in-order static dual issue programmable protocol core(s) which run the directory-based cache coherence protocol in software – On-die integrated memory controller, network interface, and router Compression/decompression algorithms are integrated into the directory protocol software Memory Compression and Decompression Programmable Protocol Core OOO Core AT In-order Core PT IL1 DL1 Protocol Core/ Protocol Processor IL1 DL1 L2 Memory Control Router Memory Compression and Decompression SDRAM Network Anatomy of a Protocol Handler On arrival of a coherence transaction at the memory controller of a node, a protocol handler is scheduled on the protocol core of that node – Calculates the directory address if home node (simple hash function on transaction address) – Reads 64-bit directory entry if home node – Carries out simple integer arithmetic operations to figure out coherence actions – May send messages to remote nodes – May initiate transactions to local OOO core Memory Compression and Decompression Baseline Directory Protocol Invalidation-based three-state (MSI) bitvector protocol – Derived from SGI Origin MESI protocol and improved to handle early and late intervention races better 64-bit datapath 4 States: L, M, two busy 44 Unused Memory Compression and Decompression 16 Sharer vector Sketch Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Directory Protocol Extensions Compression support – All handlers that update memory blocks need extension with compression algorithm – Two major categories: writeback handlers and GET intervention response handlers Latter involves a state demotion from M to S and hence requires an update of memory block at home GETX interventions do not require memory update as they involve ownership hand-off only Decompression support – All handlers that access memory in response to last-level cache miss requests Memory Compression and Decompression Directory Protocol Extensions Compression support (writeback cases) WB SPP WB WB_ACK SP Memory Compression and Decompression HPP DRAM Compress Directory Protocol Extensions Compression support (writeback cases) HP WB Memory Compression and Decompression HPP DRAM Compress Directory Protocol Extensions Compression support (intervention cases) GET RPP GET DRAM GET HPP Compress DP SWB PUT RP PUT Memory Compression and Decompression Directory Protocol Extensions Compression support (intervention cases) GET RPP DRAM GET HPP Compress PUT GET (Uncompressed) PUT PUT RP Memory Compression and Decompression HP Directory Protocol Extensions Compression support (intervention cases) GET HP DRAM GET HPP Compress PUT (Uncompressed) PUT Memory Compression and Decompression DP Directory Protocol Extensions Decompression support GET/GETX RPP PUT/PUTX GET/GETX PUT/PUTX RP Memory Compression and Decompression HPP DRAM Decompress Directory Protocol Extensions Decompression support GET/GETX HP PUT/PUTX Memory Compression and Decompression HPP DRAM Decompress Sketch Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Compression Algorithms Consider each 64-bit chunk at a time of a 128-byte cache block Algorithm I Original Compressed All zero Zero byte MS 4 bytes zero LS 4 bytes MS 4 bytes = LS 4 bytes LS 4 bytes None 64 bits Encoding 00 01 10 11 Algorithm II Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes. Memory Compression and Decompression Compression Algorithms Ideally want to compute compressed size by both the algorithms for each of the 16 double-words in a cache block and pick the best – Overhead is too high Trade-off#1 – Speculate based on the first 64 bits – If MS 32 bits ^ LS 32 bits = 0, use Algorithm I (covers two cases of Algorithm I) – If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II) Memory Compression and Decompression Compression Algorithms Trade-off#2 – If compression ratio is low, it is better to avoid decompression overhead Decompression is fully on the critical path – After compressing every 64 bits, compare the running compressed size against a threshold maxCsz (best: 48 bytes) – Abort compression and store entire block uncompressed as soon as the threshold is crossed Memory Compression and Decompression Compression Algorithms Meta-data – Required for decompression – Most meta-data are stored in the unused 44 bits of the directory entry – Cache controller generates uncompressed block address; so directory address computation remains unchanged – 32 bits to locate the compressed block Compressed block size is a multiple of 4 bytes, but we extend it to next 8-byte boundary to have a cushion for future use 32 bits allow us to address 32 GB of compressed memory Memory Compression and Decompression Compression Algorithms Meta-data – Two bits to know the compression algorithm Algorithm I, Algorithm II, uncompressed, all zero All zero blocks do not store anything in memory – For each 64 bits need to know one of four encodings Maintained in a 32-bit header (two bits for each of the 16 double words) – Optimization to speed up relocation: store the size of the compressed block in directory entry Requires four bits (16 double words maximum) – 70 bits of meta-data per compressed block Memory Compression and Decompression Decompression Example Directory entry information – 32-bit address: 0x4fd1276a Actual address = 0x4fd1276a << 3 – Compression state: 01 Algorithm II was used – Compressed size: 0101 Actual size=40 bytes (not used in decompression) Header information – 32-bit header: 00 11 10 00 00 01… Upper 64 bits used encoding 00 of Algorithm II Next 64 bits used encoding 11 of Algorithm II Memory Compression and Decompression Performance Optimization Protocol thread occupancy is critical – Two protocol cores – Out-of-order NI scheduling to improve protocol core utilization – Cached message buffer (filled with writeback payload) 16 uncached loads/stores needed to message buffer if not cached during compression Caching requires invalidating the buffer contents at the end of compression (coherence issue) Flushing dirty contents occupies the datapath; so we allow only cached loads – Compression ratio remains unaffected Memory Compression and Decompression Sketch Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Storage Saving 80% 73% 66% 60% 40% 20% 0% 16% 21% Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression Slowdown 1.60 2% 5% 7% 1% 11% 1PP 2PP 2PP+OOO NI 2PP+OOO NI+CLS 2PP+OOO NI+CL 15% 8% 1.45 1.30 1.15 1.00 Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression Memory Stall Cycles Memory Compression and Decompression Protocol Core Occupancy Dynamic instruction count and handler occupancy w/o compression w/ compression Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns) FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns) FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns) LU 11.4 M (6.3 ns) 29.2 M (14.8 ns) Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns) Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns) Water 62.4 M (5.5 ns) 137.3 M (8.8 ns) Occupancy still hidden under fastest memory access (40 ns) Memory Compression and Decompression Sketch Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Related Work Dictionary-based – IBM MXT – X-Match – X-RL – Not well-suited for cache block grain Frequent pattern-based – Applied to on-chip cache blocks Zero-aware compression – Applied to memory blocks See paper for more details Memory Compression and Decompression Summary Explored memory compression and decompression as coherence protocol extensions in DSM multiprocessors The compression-enabled handlers run on simple core(s) of a multi-core node The protocol core occupancy increases significantly, but still can be hidden under memory access latency On seven scientific computing workloads, our best design saves 16% to 73% memory while slowing down execution by at most 15% Memory Compression and Decompression Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors THANK YOU! Lakshmana R Vittanala Intel Mainak Chaudhuri IIT Kanpur