PPT - IIT Kanpur

advertisement
Integrating Memory
Compression and Decompression
with Coherence Protocols in DSM
Multiprocessors
Lakshmana R Vittanala
Intel
Mainak Chaudhuri
IIT Kanpur
Talk in Two Slides (1/2)

Memory footprint of data-intensive
workloads is ever-increasing
– We explore compression to reduce memory
pressure in a medium-scale DSM multi

Dirty blocks evicted from last-level of
cache is sent to home node
– Compress in home memory controller

A last-level cache miss request from a
node is sent to home node
– Decompress in home memory controller
Memory Compression and Decompression
Talk in Two Slides (2/2)

No modification in the processor
– Cache hierarchy sees decompressed blocks

All changes are confined to the directorybased cache coherence protocol
– Leverage spare core(s) to execute
compression-enabled protocols in software
– Extend directory structure for compression
book-keeping

Use hybrid of two compression algorithms
– On 16 nodes for seven scientific computing
workloads, 73% storage saving on average
with at most 15% increase in execution time
Memory Compression and Decompression
Contributions

Two major contributions
– First attempt to look at
compression/decompression as directory
protocol extensions in mid-range servers
– First proposal to execute a compressionenabled directory protocol in software on
spare core(s) of a multi-core die

Makes the solution attractive in many-core
systems
Memory Compression and Decompression
Sketch





Background: Programmable Protocol Core
Directory Protocol Extensions
Compression/Decompression Algorithms
Simulation Results
Related Work and Summary
Memory Compression and Decompression
Programmable Protocol Core

Past studies have considered off-die
programmable protocol processors
– Offers flexibility in choice of coherence
protocols compared to hardwired FSMs, but
suffers from performance loss [Sun S3.mp,
Sequent STiNG, Stanford FLASH, Piranha, …]

With on-die integration of memory
controller and availability of large number
of on-die cores, programmable protocol
cores may become an attractive design
– Recent studies show almost no performance
loss [IEEE TPDS, Aug’07]
Memory Compression and Decompression
Programmable Protocol Core

In our simulated system, each node
contains
– One complex out-of-order issue core which
runs the application thread
– One or two simple in-order static dual issue
programmable protocol core(s) which run the
directory-based cache coherence protocol in
software
– On-die integrated memory controller,
network interface, and router

Compression/decompression algorithms
are integrated into the directory protocol
software
Memory Compression and Decompression
Programmable Protocol Core
OOO Core
AT
In-order Core
PT
IL1 DL1
Protocol Core/
Protocol Processor
IL1 DL1
L2
Memory
Control
Router
Memory Compression and Decompression
SDRAM
Network
Anatomy of a Protocol Handler

On arrival of a coherence transaction at
the memory controller of a node, a
protocol handler is scheduled on the
protocol core of that node
– Calculates the directory address if home
node (simple hash function on transaction
address)
– Reads 64-bit directory entry if home node
– Carries out simple integer arithmetic
operations to figure out coherence actions
– May send messages to remote nodes
– May initiate transactions to local OOO core
Memory Compression and Decompression
Baseline Directory Protocol

Invalidation-based three-state (MSI)
bitvector protocol
– Derived from SGI Origin MESI protocol and
improved to handle early and late
intervention races better
64-bit datapath
4
States: L, M,
two busy
44
Unused
Memory Compression and Decompression
16
Sharer vector
Sketch
Background: Programmable Protocol Core
 Directory Protocol Extensions
 Compression/Decompression Algorithms
 Simulation Results
 Related Work and Summary

Memory Compression and Decompression
Directory Protocol Extensions

Compression support
– All handlers that update memory blocks need
extension with compression algorithm
– Two major categories: writeback handlers
and GET intervention response handlers
Latter involves a state demotion from M to S and
hence requires an update of memory block at
home
 GETX interventions do not require memory
update as they involve ownership hand-off only


Decompression support
– All handlers that access memory in response
to last-level cache miss requests
Memory Compression and Decompression
Directory Protocol Extensions

Compression support (writeback cases)
WB
SPP
WB
WB_ACK
SP
Memory Compression and Decompression
HPP DRAM
Compress
Directory Protocol Extensions

Compression support (writeback cases)
HP
WB
Memory Compression and Decompression
HPP DRAM
Compress
Directory Protocol Extensions

Compression support (intervention cases)
GET
RPP
GET
DRAM
GET
HPP
Compress
DP
SWB
PUT
RP
PUT
Memory Compression and Decompression
Directory Protocol Extensions

Compression support (intervention cases)
GET
RPP
DRAM
GET
HPP Compress
PUT
GET (Uncompressed)
PUT
PUT
RP
Memory Compression and Decompression
HP
Directory Protocol Extensions

Compression support (intervention cases)
GET
HP
DRAM
GET
HPP Compress
PUT
(Uncompressed)
PUT
Memory Compression and Decompression
DP
Directory Protocol Extensions

Decompression support
GET/GETX
RPP
PUT/PUTX
GET/GETX
PUT/PUTX
RP
Memory Compression and Decompression
HPP DRAM
Decompress
Directory Protocol Extensions

Decompression support
GET/GETX
HP
PUT/PUTX
Memory Compression and Decompression
HPP DRAM
Decompress
Sketch
Background: Programmable Protocol Core
 Directory Protocol Extensions
 Compression/Decompression Algorithms
 Simulation Results
 Related Work and Summary

Memory Compression and Decompression
Compression Algorithms

Consider each 64-bit chunk at a time of a
128-byte cache block
Algorithm I
Original
Compressed
All zero
Zero byte
MS 4 bytes zero
LS 4 bytes
MS 4 bytes = LS 4 bytes LS 4 bytes
None
64 bits
Encoding
00
01
10
11
Algorithm II
Differs in encoding 10: LS 4 bytes zero.
Compressed block stores the MS 4 bytes.
Memory Compression and Decompression
Compression Algorithms

Ideally want to compute compressed size
by both the algorithms for each of the 16
double-words in a cache block and pick
the best
– Overhead is too high

Trade-off#1
– Speculate based on the first 64 bits
– If MS 32 bits ^ LS 32 bits = 0, use Algorithm
I (covers two cases of Algorithm I)
– If MS 32 bits & LS 32 bits = 0, use Algorithm
II (covers three cases of Algorithm II)
Memory Compression and Decompression
Compression Algorithms

Trade-off#2
– If compression ratio is low, it is better to
avoid decompression overhead

Decompression is fully on the critical path
– After compressing every 64 bits, compare
the running compressed size against a
threshold maxCsz (best: 48 bytes)
– Abort compression and store entire block
uncompressed as soon as the threshold is
crossed
Memory Compression and Decompression
Compression Algorithms

Meta-data
– Required for decompression
– Most meta-data are stored in the unused 44
bits of the directory entry
– Cache controller generates uncompressed
block address; so directory address
computation remains unchanged
– 32 bits to locate the compressed block
Compressed block size is a multiple of 4 bytes,
but we extend it to next 8-byte boundary to have
a cushion for future use
 32 bits allow us to address 32 GB of compressed
memory

Memory Compression and Decompression
Compression Algorithms

Meta-data
– Two bits to know the compression algorithm
Algorithm I, Algorithm II, uncompressed, all zero
 All zero blocks do not store anything in memory

– For each 64 bits need to know one of four
encodings

Maintained in a 32-bit header (two bits for each
of the 16 double words)
– Optimization to speed up relocation: store
the size of the compressed block in directory
entry

Requires four bits (16 double words maximum)
– 70 bits of meta-data per compressed block
Memory Compression and Decompression
Decompression Example

Directory entry information
– 32-bit address: 0x4fd1276a

Actual address = 0x4fd1276a << 3
– Compression state: 01

Algorithm II was used
– Compressed size: 0101


Actual size=40 bytes (not used in decompression)
Header information
– 32-bit header: 00 11 10 00 00 01…
Upper 64 bits used encoding 00 of Algorithm II
 Next 64 bits used encoding 11 of Algorithm II

Memory Compression and Decompression
Performance Optimization

Protocol thread occupancy is critical
– Two protocol cores
– Out-of-order NI scheduling to improve
protocol core utilization
– Cached message buffer (filled with writeback
payload)
16 uncached loads/stores needed to message
buffer if not cached during compression
 Caching requires invalidating the buffer contents
at the end of compression (coherence issue)
 Flushing dirty contents occupies the datapath; so
we allow only cached loads

– Compression ratio remains unaffected
Memory Compression and Decompression
Sketch
Background: Programmable Protocol Core
 Directory Protocol Extensions
 Compression/Decompression Algorithms
 Simulation Results
 Related Work and Summary

Memory Compression and Decompression
Storage Saving
80%
73%
66%
60%
40%
20%
0%
16% 21%
Barnes FFT FFTW LU Ocean Radix Water
Memory Compression and Decompression
Slowdown
1.60
2%
5%
7% 1%
11%
1PP
2PP
2PP+OOO NI
2PP+OOO NI+CLS
2PP+OOO NI+CL
15%
8%
1.45
1.30
1.15
1.00
Barnes FFT FFTW LU Ocean Radix Water
Memory Compression and Decompression
Memory Stall Cycles
Memory Compression and Decompression
Protocol Core Occupancy

Dynamic instruction count and handler
occupancy
w/o compression
w/ compression
Barnes 29.1 M (7.5 ns)
215.5 M (31.9 ns)
FFT
82.7 M (6.7 ns)
185.6 M (16.7 ns)
FFTW 177.8 M (10.5 ns)
417.6 M (22.7 ns)
LU
11.4 M (6.3 ns)
29.2 M (14.8 ns)
Ocean 376.6 M (6.7 ns)
1553.5 M (24.1 ns)
Radix
24.7 M (8.1 ns)
87.0 M (36.9 ns)
Water
62.4 M (5.5 ns)
137.3 M (8.8 ns)
Occupancy still hidden under fastest memory
access (40 ns)
Memory Compression and Decompression
Sketch
Background: Programmable Protocol Core
 Directory Protocol Extensions
 Compression/Decompression Algorithms
 Simulation Results
 Related Work and Summary

Memory Compression and Decompression
Related Work

Dictionary-based
– IBM MXT
– X-Match
– X-RL
– Not well-suited for cache block grain

Frequent pattern-based
– Applied to on-chip cache blocks

Zero-aware compression
– Applied to memory blocks

See paper for more details
Memory Compression and Decompression
Summary




Explored memory compression and
decompression as coherence protocol
extensions in DSM multiprocessors
The compression-enabled handlers run
on simple core(s) of a multi-core node
The protocol core occupancy increases
significantly, but still can be hidden under
memory access latency
On seven scientific computing workloads,
our best design saves 16% to 73%
memory while slowing down execution by
at most 15%
Memory Compression and Decompression
Integrating Memory
Compression and Decompression
with Coherence Protocols in DSM
Multiprocessors
THANK YOU!
Lakshmana R Vittanala
Intel
Mainak Chaudhuri
IIT Kanpur
Download