B - University of Wisconsin–Madison

advertisement
Decoupled Compressed Cache:
Exploiting Spatial Locality for Energy-Optimized Compressed Caching
Somayeh Sardashti and David A. Wood
University of Wisconsin-Madison
1
2
3
Communication vs. Computation
~200X
Keckler Micro 2011
Improving cache utilization is critical for energy-efficiency!
Compressed Cache: Compress and Compact Blocks
+ Higher effective cache size
+ Small area overhead
+ Higher system performance
+ Lower system energy
Previous work limit compression effectiveness:
- Limited number of tags
- High internal fragmentation
- Energy expensive re-compaction
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC
utilization through cache compression.
Non-Contiguous
Sub-Blocks
Decoupled Super-Blocks
Previous work limit compression effectiveness:
- Limited number of tags
- High Internal Fragmentation
- Energy expensive re-compaction
6
Decoupled Compressed Cache (DCC)
Saving system energy by improving LLC
utilization through cache compression.
Outperform 2X LLC
1.08X LLC area
14% higher performance
12% lower energy
7
Outline





Motivation
Compressed caching
Our Proposals: Decoupled compressed cache
Experimental Results
Conclusions
8
Uncompressed Caching
A fixed one-to-one tag/data mapping
Tags
Data
9
Compressed Caching
Compact
Add morecompressed
Compress
tags to increase
cache
blocks,
effective
blocks.
to make
capacity.
room.
Tags
Data
10
Compression
(1) Compression: how to compress blocks?
• There are different compression algorithms.
• Not the focus of this work.
• But, which algorithm matters!
Compressor
64 bytes
20 bytes
11
Compression Potentials
3.9
We use C-PACK+Z for the2.8rest of the talk!
1.5
Cycles to Decompress
Compression Algorithm
Compression Ratio = Original Size / Compressed Size
High compression ratio  potentially large
normalized effective cache capacity.
12
Compaction
(2) Compaction: how to store and find blocks?
• Sized
Critical
to achieve
the(FixedC)
compression
potentials.
Fixed
Compressed
Cache
[Kim’02, WMPI,
Yang Micro 02]
• This work focuses on compaction.
Internal Fragmentation!
Tags
Data
13
Compaction
(2) Compaction: how to store and find blocks?
Variable Sized Compressed Cache (VSC) [Alameldeen, ISCA
2002]
Data
Tags
Sub-block
14
Previous Compressed Caches
Potential: 3.9
16B
10B
3.1
2.6
2.3
2.0
1.7
Normalized
Effective
Capacity =
(Limit 1) Limited
Tag/Metadata
LLC Number
of Overhead
Valid Blocks
/ MAX
Number
– High Area
by Adding
4X/more
Tagsof (Uncompressed) Blocks
(Limit 2) Internal Fragmentation
– Low Cache Capacity Utilization
15
(Limit 3) Energy-Expensive Re-Compaction
VSC requires energy-expensive re-compaction.
BUpdate
needs B2 sub-blocks
Tags
Data
3X higher LLC dynamic energy!
16
Outline





Motivation
Compressed caching
Our Proposals: Decoupled compressed cache
Experimental Results
Conclusions
17
Decoupled Compressed Cache
(1) Exploiting Spatial Locality
Low Area Overhead
(2) Decoupling tag/data mapping
Eliminate energy expensive re-compaction
Reduce internal fragmentation
(3) Co-DCC: Dynamically co-compacting super-blocks
Further reduce internal fragmentation
18
(1) Exploiting Spatial Locality
Neighboring blocks co-reside in LLC.
89%
19
(1) Exploiting Spatial Locality
DCC tracks LLC blocks at Super-Block granularity.
Data
state D
state C
Super-Block Tag Q
state A
4X Tags
2X Tags
state B
Super
Tags
Tags
Quad (Q): A, B, C, D
Singleton (S): E
Up to 4X blocks with low area overheads!
20
(2) Decoupling tag/data mapping
DCC decouples mapping to eliminate re-compaction.
Update Allocation
B
Flexible
Super Tags
Quad (Q): A, B, C, D
Singleton (S): E
21
(2) Decoupling tag/data mapping
Back pointers identify the owner block of each sub-block.
Back
Super Tags Pointers
Tag ID Blk ID
Data
Quad (Q): A, B, C, D
Singleton (S): E
22
(3) Co-compacting super-blocks
Co-DCC dynamically co-compacts super-blocks.
 Reducing internal fragmentation
A sub-block
Quad (Q): A, B, C, D
23
Outline





Motivation
Compressed caching
Our Proposals: Decoupled compressed cache
Experimental Results
Conclusions
24
Experimental Methodology
 Integrated DCC with AMD Bulldozer Cache.
– We model the timing and allocation constraints of sequential
regions at LLC in detail.
– No need for an alignment network.
 Verilog implementation and synthesis of the tag
match and sub-block selection logic.
– One additional cycle of latency due to sub-block selection.
25
Experimental Methodology
 Full-system simulation with a simulator based on GEMS.
 Wide range of applications with different level of cache
sensitivities:
– Commercial workloads: apache, jbb, oltp, zeus
– Spec-OMP: ammp, applu, equake, mgrid, wupwise
– Parsec: blackscholes, canneal, freqmine
– Spec 2006 mixes (m1-m8): bzip2, libquantum-bzip2, libquantum, gcc, astarbwaves, cactus-mcf-milc-bwaves, gcc-omnetpp-mcf-bwaves-lbm-milc-cactusbzip, omnetpp-lbm
Cores
L1I$/L1D$
L2$
L3$
Main Memory
Eight OOO cores, 3.2 GHz
Private, 32-KB, 8-way
Private, 256-KB, 8-way
Shared, 8-MB, 16-way, 8 banks
4GB, 16 Banks, 800 MHz bus frequency DDR3
26
Normalized LLC Area
Effective LLC Capacity
2
2X Baseline
Co-DCC
1
FixedC VSC DCC
Baseline
1
2
3
Normalized Effective LLC Capacity
Components
FixedC/VSC-2X
DCC
Co-DCC
Tag Array
6.3%
2.1%
11.3%
Back Pointer Array
(De-)Compressors
0
4.4%
5.4%
1.8%
1.8%
1.8%
Total Area Overhead
8.1%
8.3%
18.5%
27
(Co-)DCC Performance
0.96
0.95
0.93
0.90
0.86
(Co-)DCC boost system performance significantly.
28
(Co-)DCC Energy Consumption
0.96
0.97
0.93
0.91
0.88
(Co-)DCC reduce system energy by reducing number
of accesses to the main memory.
29
Summary
 Analyze the limits of compressed caching
• Limited number of tags
• Internal fragmentation
• Energy-expensive re-compaction
 Decoupled Compressed Cache
• Improving performance and energy of compressed caching
• Decoupled super-blocks
• Non-contiguous sub-blocks
 Co-DCC further reduces internal fragmentation
 Practical designs [details in the paper]
30
Backup










(De-)Compression overhead
DCC data array organization with AMD Bulldozer
DCC Timing
DCC Lookup
Applications
Co-DCC design
LLC effective capacity
LLC miss rate
Memory dynamic energy
LLC dynamic energy
31
(De-)Compression Overhead
Parameters
Pipeline Depth
Compressor
6
Decompressor
2
Latency (cycles)
16
9
Area (𝒎𝒎𝟐 )
0.016
0.016
Power Consumption (mW)
25.84
19.01
32
DCC Data Array Organization
AMD Bulldozer
SR 2
SR3
SR 1
SR0
A0.1
B1.0
A0.0
C3.1
A Phase Flop
B1.1
A Phase Flop
A0.2
B Phase Flop
A0.3
C3.0
B Phase Flop
N
Set Addr
Read Data
4 SR0 Addr
4 SR1 Addr
4 SR2 Addr
4 SR3 Addr
A0: uncompressed; B1 and C2 are compressed to 2 sub-blocks
33
DCC Timing
34
DCC Lookup
1. Access Super Tags and Back Pointers in parallel
2. Find the matched Back Pointers
3. Read corresponding sub-blocks and decompress
Back Pointers
11
11
Super 1 Q
Tags
0 S
Read C
Data
1
1
Quad (Q): A, B, C, D
Singleton (S): E
35
Applications
Sensitive to
Sensitive to
Cache Capacity
Cache
and Latency
Latency
Cache
Insensitive
Spec2006
(m1-m8)
Sensitive to
Cache Capacity
bzip2, libquantum-bzip2, libquantum, gcc, astar-bwaves, cactus-mcf-milc-bwaves,
gcc-omnetpp-mcf-bwaves-lbm-milc-cactus-bzip, omnetpp-lbm
36
Co-DCC Design
Sub-block 7
Sub-block 6 Sub-block 5 Sub-blocks 4-2 Sub-block 1
A2.1
…
A2.0 A1 A0.1
…
Sub-block 0
A2.2
A0.0
A: <A2,A1,A0>
A-END
Tag
ID
3b 1b 7b 3b1b 7b 3b 1b 7b 3b1b 7b 7b
1b
Sharers
A0-Begin
END
SuperBlock Tag
Cstate3
Comp3
Begin3
Cstate2
Comp2
Begin2
Cstate1
Comp1
Begin1
Cstate0
Comp0
Begin0
A1-Begin
4b
37
LLC Effective Cache Capacity
Norm LLC Effective Capacity
4.0
3.5
3.0
2.5
2.0
1.5
1.0
38
LLC Miss Rate
Norm LLC Miss Rate
1.00
0.80
0.60
0.40
0.20
39
Memory Dynamic Energy
Norm Memory Dynamic Energy
1.00
0.90
0.80
0.70
0.60
0.50
0.40
40
LLC Dynamic Energy
Norm LLC Dynamic Energy
6
5
4
3
2
1
0
41
Download