PPT - IIT Kanpur

Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors Lakshmana R Vittanala Intel Mainak Chaudhuri IIT Kanpur Talk in Two Slides (1/2)  Memory footprint of data-intensive workloads is ever-increasing – We explore compression to reduce memory pressure in a medium-scale DSM multi  Dirty blocks evicted from last-level of cache is sent to home node – Compress in home memory controller  A last-level cache miss request from a node is sent to home node – Decompress in home memory controller Memory Compression and Decompression Talk in Two Slides (2/2)  No modification in the processor – Cache hierarchy sees decompressed blocks  All changes are confined to the directorybased cache coherence protocol – Leverage spare core(s) to execute compression-enabled protocols in software – Extend directory structure for compression book-keeping  Use hybrid of two compression algorithms – On 16 nodes for seven scientific computing workloads, 73% storage saving on average with at most 15% increase in execution time Memory Compression and Decompression Contributions  Two major contributions – First attempt to look at compression/decompression as directory protocol extensions in mid-range servers – First proposal to execute a compressionenabled directory protocol in software on spare core(s) of a multi-core die  Makes the solution attractive in many-core systems Memory Compression and Decompression Sketch      Background: Programmable Protocol Core Directory Protocol Extensions Compression/Decompression Algorithms Simulation Results Related Work and Summary Memory Compression and Decompression Programmable Protocol Core  Past studies have considered off-die programmable protocol processors – Offers flexibility in choice of coherence protocols compared to hardwired FSMs, but suffers from performance loss [Sun S3.mp, Sequent STiNG, Stanford FLASH, Piranha, …]  With on-die integration of memory controller and availability of large number of on-die cores, programmable protocol cores may become an attractive design – Recent studies show almost no performance loss [IEEE TPDS, Aug’07] Memory Compression and Decompression Programmable Protocol Core  In our simulated system, each node contains – One complex out-of-order issue core which runs the application thread – One or two simple in-order static dual issue programmable protocol core(s) which run the directory-based cache coherence protocol in software – On-die integrated memory controller, network interface, and router  Compression/decompression algorithms are integrated into the directory protocol software Memory Compression and Decompression Programmable Protocol Core OOO Core AT In-order Core PT IL1 DL1 Protocol Core/ Protocol Processor IL1 DL1 L2 Memory Control Router Memory Compression and Decompression SDRAM Network Anatomy of a Protocol Handler  On arrival of a coherence transaction at the memory controller of a node, a protocol handler is scheduled on the protocol core of that node – Calculates the directory address if home node (simple hash function on transaction address) – Reads 64-bit directory entry if home node – Carries out simple integer arithmetic operations to figure out coherence actions – May send messages to remote nodes – May initiate transactions to local OOO core Memory Compression and Decompression Baseline Directory Protocol  Invalidation-based three-state (MSI) bitvector protocol – Derived from SGI Origin MESI protocol and improved to handle early and late intervention races better 64-bit datapath 4 States: L, M, two busy 44 Unused Memory Compression and Decompression 16 Sharer vector Sketch Background: Programmable Protocol Core  Directory Protocol Extensions  Compression/Decompression Algorithms  Simulation Results  Related Work and Summary  Memory Compression and Decompression Directory Protocol Extensions  Compression support – All handlers that update memory blocks need extension with compression algorithm – Two major categories: writeback handlers and GET intervention response handlers Latter involves a state demotion from M to S and hence requires an update of memory block at home  GETX interventions do not require memory update as they involve ownership hand-off only   Decompression support – All handlers that access memory in response to last-level cache miss requests Memory Compression and Decompression Directory Protocol Extensions  Compression support (writeback cases) WB SPP WB WB_ACK SP Memory Compression and Decompression HPP DRAM Compress Directory Protocol Extensions  Compression support (writeback cases) HP WB Memory Compression and Decompression HPP DRAM Compress Directory Protocol Extensions  Compression support (intervention cases) GET RPP GET DRAM GET HPP Compress DP SWB PUT RP PUT Memory Compression and Decompression Directory Protocol Extensions  Compression support (intervention cases) GET RPP DRAM GET HPP Compress PUT GET (Uncompressed) PUT PUT RP Memory Compression and Decompression HP Directory Protocol Extensions  Compression support (intervention cases) GET HP DRAM GET HPP Compress PUT (Uncompressed) PUT Memory Compression and Decompression DP Directory Protocol Extensions  Decompression support GET/GETX RPP PUT/PUTX GET/GETX PUT/PUTX RP Memory Compression and Decompression HPP DRAM Decompress Directory Protocol Extensions  Decompression support GET/GETX HP PUT/PUTX Memory Compression and Decompression HPP DRAM Decompress Sketch Background: Programmable Protocol Core  Directory Protocol Extensions  Compression/Decompression Algorithms  Simulation Results  Related Work and Summary  Memory Compression and Decompression Compression Algorithms  Consider each 64-bit chunk at a time of a 128-byte cache block Algorithm I Original Compressed All zero Zero byte MS 4 bytes zero LS 4 bytes MS 4 bytes = LS 4 bytes LS 4 bytes None 64 bits Encoding 00 01 10 11 Algorithm II Differs in encoding 10: LS 4 bytes zero. Compressed block stores the MS 4 bytes. Memory Compression and Decompression Compression Algorithms  Ideally want to compute compressed size by both the algorithms for each of the 16 double-words in a cache block and pick the best – Overhead is too high  Trade-off#1 – Speculate based on the first 64 bits – If MS 32 bits ^ LS 32 bits = 0, use Algorithm I (covers two cases of Algorithm I) – If MS 32 bits & LS 32 bits = 0, use Algorithm II (covers three cases of Algorithm II) Memory Compression and Decompression Compression Algorithms  Trade-off#2 – If compression ratio is low, it is better to avoid decompression overhead  Decompression is fully on the critical path – After compressing every 64 bits, compare the running compressed size against a threshold maxCsz (best: 48 bytes) – Abort compression and store entire block uncompressed as soon as the threshold is crossed Memory Compression and Decompression Compression Algorithms  Meta-data – Required for decompression – Most meta-data are stored in the unused 44 bits of the directory entry – Cache controller generates uncompressed block address; so directory address computation remains unchanged – 32 bits to locate the compressed block Compressed block size is a multiple of 4 bytes, but we extend it to next 8-byte boundary to have a cushion for future use  32 bits allow us to address 32 GB of compressed memory  Memory Compression and Decompression Compression Algorithms  Meta-data – Two bits to know the compression algorithm Algorithm I, Algorithm II, uncompressed, all zero  All zero blocks do not store anything in memory  – For each 64 bits need to know one of four encodings  Maintained in a 32-bit header (two bits for each of the 16 double words) – Optimization to speed up relocation: store the size of the compressed block in directory entry  Requires four bits (16 double words maximum) – 70 bits of meta-data per compressed block Memory Compression and Decompression Decompression Example  Directory entry information – 32-bit address: 0x4fd1276a  Actual address = 0x4fd1276a << 3 – Compression state: 01  Algorithm II was used – Compressed size: 0101   Actual size=40 bytes (not used in decompression) Header information – 32-bit header: 00 11 10 00 00 01… Upper 64 bits used encoding 00 of Algorithm II  Next 64 bits used encoding 11 of Algorithm II  Memory Compression and Decompression Performance Optimization  Protocol thread occupancy is critical – Two protocol cores – Out-of-order NI scheduling to improve protocol core utilization – Cached message buffer (filled with writeback payload) 16 uncached loads/stores needed to message buffer if not cached during compression  Caching requires invalidating the buffer contents at the end of compression (coherence issue)  Flushing dirty contents occupies the datapath; so we allow only cached loads  – Compression ratio remains unaffected Memory Compression and Decompression Sketch Background: Programmable Protocol Core  Directory Protocol Extensions  Compression/Decompression Algorithms  Simulation Results  Related Work and Summary  Memory Compression and Decompression Storage Saving 80% 73% 66% 60% 40% 20% 0% 16% 21% Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression Slowdown 1.60 2% 5% 7% 1% 11% 1PP 2PP 2PP+OOO NI 2PP+OOO NI+CLS 2PP+OOO NI+CL 15% 8% 1.45 1.30 1.15 1.00 Barnes FFT FFTW LU Ocean Radix Water Memory Compression and Decompression Memory Stall Cycles Memory Compression and Decompression Protocol Core Occupancy  Dynamic instruction count and handler occupancy w/o compression w/ compression Barnes 29.1 M (7.5 ns) 215.5 M (31.9 ns) FFT 82.7 M (6.7 ns) 185.6 M (16.7 ns) FFTW 177.8 M (10.5 ns) 417.6 M (22.7 ns) LU 11.4 M (6.3 ns) 29.2 M (14.8 ns) Ocean 376.6 M (6.7 ns) 1553.5 M (24.1 ns) Radix 24.7 M (8.1 ns) 87.0 M (36.9 ns) Water 62.4 M (5.5 ns) 137.3 M (8.8 ns) Occupancy still hidden under fastest memory access (40 ns) Memory Compression and Decompression Sketch Background: Programmable Protocol Core  Directory Protocol Extensions  Compression/Decompression Algorithms  Simulation Results  Related Work and Summary  Memory Compression and Decompression Related Work  Dictionary-based – IBM MXT – X-Match – X-RL – Not well-suited for cache block grain  Frequent pattern-based – Applied to on-chip cache blocks  Zero-aware compression – Applied to memory blocks  See paper for more details Memory Compression and Decompression Summary     Explored memory compression and decompression as coherence protocol extensions in DSM multiprocessors The compression-enabled handlers run on simple core(s) of a multi-core node The protocol core occupancy increases significantly, but still can be hidden under memory access latency On seven scientific computing workloads, our best design saves 16% to 73% memory while slowing down execution by at most 15% Memory Compression and Decompression Integrating Memory Compression and Decompression with Coherence Protocols in DSM Multiprocessors THANK YOU! Lakshmana R Vittanala Intel Mainak Chaudhuri IIT Kanpur

PPT - IIT Kanpur

Related documents

Products

Support

PPT - IIT Kanpur

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib