18-742 Parallel Computer Architecture Lecture 11: Caching in Multi-Core Systems

advertisement
18-742
Parallel Computer Architecture
Lecture 11: Caching in Multi-Core Systems
Prof. Onur Mutlu and Gennady Pekhimenko
Carnegie Mellon University
Fall 2012, 10/01/2012
Review: Multi-core Issues in Caching
• How does the cache hierarchy change in a multi-core
system?
• Private cache: Cache belongs to one core
• Shared cache: Cache is shared by multiple cores
CORE 0
CORE 1
CORE 2
CORE 3
L2
CACHE
L2
CACHE
L2
CACHE
L2
CACHE
CORE 0
CORE 1
CORE 2
L2
CACHE
DRAM MEMORY CONTROLLER
DRAM MEMORY CONTROLLER
2
CORE 3
Outline
• Multi-cores and Caching: Review
• Utility-based partitioning
• Cache compression
– Frequent value
– Frequent pattern
– Base-Delta-Immediate
• Main memory compression
– IBM MXT
– Linearly Compressed Pages (LCP)
3
Review: Shared Caches Between Cores
• Advantages:
– Dynamic partitioning of available cache space
• No fragmentation due to static partitioning
– Easier to maintain coherence
– Shared data and locks do not ping pong between caches
• Disadvantages
– Cores incur conflict misses due to other cores’ accesses
• Misses due to inter-core interference
• Some cores can destroy the hit rate of other cores
– What kind of access patterns could cause this?
– Guaranteeing a minimum level of service (or fairness) to each
core is harder (how much space, how much bandwidth?)
– High bandwidth harder to obtain (N cores  N ports?)
4
Shared Caches: How to Share?
• Free-for-all sharing
– Placement/replacement policies are the same as a single core
system (usually LRU or pseudo-LRU)
– Not thread/application aware
– An incoming block evicts a block regardless of which threads
the blocks belong to
• Problems
– A cache-unfriendly application can destroy the performance
of a cache friendly application
– Not all applications benefit equally from the same amount of
cache: free-for-all might prioritize those that do not benefit
– Reduced performance, reduced fairness
5
Problem with Shared Caches
Processor Core 1
←t1
Processor Core 2
L1 $
L1 $
L2 $
……
6
Problem with Shared Caches
t2→
Processor Core 1
L1 $
Processor Core 2
L1 $
L2 $
……
7
Problem with Shared Caches
Processor Core 1
←t1
t2→
L1 $
Processor Core 2
L1 $
L2 $
……
t2’s throughput is significantly reduced due to unfair cache sharing.
8
Controlled Cache Sharing
• Utility based cache partitioning
– Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, HighPerformance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006.
– Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling
and Partitioning,” HPCA 2002.
• Fair cache partitioning
– Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor
Architecture,” PACT 2004.
• Shared/private mixed cache mechanisms
– Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,”
HPCA 2009.
9
Utility Based Shared Cache Partitioning
• Goal: Maximize system throughput
• Observation: Not all threads/applications benefit equally from caching
 simple LRU replacement not good for system throughput
• Idea: Allocate more cache space to applications that obtain the most
benefit from more space
• The high-level idea can be applied to other shared resources as well.
• Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead,
High-Performance, Runtime Mechanism to Partition Shared Caches,”
MICRO 2006.
• Suh et al., “A New Memory Monitoring Scheme for Memory-Aware
Scheduling and Partitioning,” HPCA 2002.
10
Utility Based Cache Partitioning (I)
Misses per 1000 instructions
Utility Uab = Misses with a ways – Misses with b ways
Low Utility
High Utility
Saturating Utility
Num ways from 16-way 1MB L2
11
Misses per 1000 instructions (MPKI)
Utility Based Cache Partitioning (II)
equake
vpr
UTIL
LRU
Idea: Give more cache to the application that
benefits more from cache
12
Utility Based Cache Partitioning (III)
PA
UMON1
Core1
I$
D$
UMON2
Shared
L2 cache
Main Memory
Three components:
 Utility Monitors (UMON) per core
 Partitioning Algorithm (PA)
 Replacement support to enforce partitions
13
I$
D$
Core2
Cache Capacity
• How to get more cache without making it
physically larger?
• Idea: Data compression for on chip-caches
14
Base-Delta-Immediate
Compression:
Practical Data Compression
for On-Chip Caches
Gennady Pekhimenko
Vivek Seshadri
Onur Mutlu , Todd C. Mowry
Phillip B. Gibbons*
Michael A. Kozuch*
*
Executive Summary
• Off-chip memory latency is high
– Large caches can help, but at significant cost
• Compressing data in cache enables larger cache at low
cost
• Problem: Decompression is on the execution critical path
• Goal: Design a new compression scheme that has
1. low decompression latency, 2. low cost, 3. high compression ratio
• Observation: Many cache lines have low dynamic range
data
• Key Idea: Encode cachelines as a base + multiple differences
• Solution: Base-Delta-Immediate compression with low
decompression latency and high compression ratio
– Outperforms three state-of-the-art compression mechanisms
16
Motivation for Cache Compression
Significant redundancy in data:
0x00000000 0x0000000B
0x00000003
0x00000004
…
How can we exploit this redundancy?
– Cache compression helps
– Provides effect of a larger cache without
making it physically larger
17
Background on Cache Compression
Hit
CPU
L1
Cache
Decompression
Uncompressed
L2
Cache
Uncompressed
Compressed
• Key requirements:
– Fast (low decompression latency)
– Simple (avoid complex hardware changes)
– Effective (good compression ratio)
18
Zero Value Compression
• Advantages:
– Low decompression latency
– Low complexity
• Disadvantages:
– Low average compression ratio
19
Shortcomings of Prior Work
Compression
Mechanisms
Zero
Decompression Complexity
Latency


Compression
Ratio

20
Frequent Value Compression
• Idea: encode cache lines based on frequently
occurring values
• Advantages:
– Good compression ratio
• Disadvantages:
– Needs profiling
– High decompression latency
– High complexity
21
Shortcomings of Prior Work
Compression
Mechanisms
Decompression Complexity
Latency
Compression
Ratio
Zero



Frequent Value



22
Frequent Pattern Compression
• Idea: encode cache lines based on frequently
occurring patterns, e.g., half word is zero
• Advantages:
– Good compression ratio
• Disadvantages:
– High decompression latency (5-8 cycles)
– High complexity (for some designs)
23
Shortcomings of Prior Work
Compression
Mechanisms
Decompression Complexity
Latency
Compression
Ratio
Zero



Frequent Value



Frequent Pattern

/

24
Shortcomings of Prior Work
Compression
Mechanisms
Decompression Complexity
Latency
Compression
Ratio
Zero



Frequent Value



Frequent Pattern

/

Our proposal:
BΔI



25
Outline
•
•
•
•
Motivation & Background
Key Idea & Our Mechanism
Evaluation
Conclusion
26
Key Data Patterns in Real Applications
Zero Values: initialization, sparse matrices, NULL pointers
0x00000000
0x00000000
0x00000000
0x00000000
…
Repeated Values: common initial values, adjacent pixels
0x000000FF
0x000000FF
0x000000FF
0x000000FF
…
Narrow Values: small values stored in a big data type
0x00000000 0x0000000B 0x00000003
0x00000004
…
Other Patterns: pointers to the same memory region
0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8
…
27
How Common Are These Patterns?
100%
80%
60%
40%
Zero
Repeated Values
Other Patterns
0%
43% of the cache lines belong to key patterns
Average
20%
libquantum
lbm
mcf
tpch17
sjeng
omnetpp
tpch2
sphinx3
xalancbmk
bzip2
tpch6
leslie3d
apache
gromacs
astar
gobmk
soplex
gcc
hmmer
wrf
h264ref
zeusmp
cactusADM
GemsFDTD
Cache Coverage (%)
SPEC2006, databases, web workloads, 2MB L2 cache
“Other Patterns” include Narrow Values
28
Key Data Patterns in Real Applications
Zero Values: initialization, sparse matrices, NULL pointers
0x00000000
0x00000000
0x00000000
0x00000000
Low Dynamic Range:
…
Repeated Values: common initial values, adjacent pixels
0x000000FF
0x000000FF
0x000000FF
0x000000FF
…
Differences
between
valuesinare
significantly
Narrow
Values: small
values stored
a big
data type
than0x00000003
the values0x00000004
themselves
0x00000000 smaller
0x0000000B
…
Other Patterns: pointers to the same memory region
0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8
…
29
Key Idea: Base+Delta (B+Δ) Encoding
4 bytes
32-byte Uncompressed Cache Line
0xC04039C0 0xC04039C8 0xC04039D0
…
0xC04039F8
0xC04039C0
Base
0x00 0x08 0x10
1 byte
1 byte
0x38
12-byte
Compressed Cache Line
1 byte
 Fast Decompression:
20 bytes saved
vector addition
…
 Simple Hardware:
arithmetic and comparison
 Effective: good compression ratio
30
B+Δ: Compression Ratio
SPEC2006, databases, web workloads, L2 2MB cache
Good average compression ratio (1.40)
But some benchmarks have low compression ratio
31
Can We Do Better?
• Uncompressible cache line (with a single base):
0x00000000 0x09A40178 0x0000000B 0x09A4A838
…
• Key idea:
Use more bases, e.g., two instead of one
• Pro:
– More cache lines can be compressed
• Cons:
– Unclear how to find these bases efficiently
– Higher overhead (due to additional bases)
32
B+Δ with Multiple Arbitrary Bases
Compression Ratio
2.2
2
1
2
3
4
8
10
16
1.8
1.6
1.4
1.2
1
GeoMean
 2 bases – the best option based on evaluations
33
How to Find Two Bases Efficiently?
1. First base - first element in the cache line
 Base+Delta part
2. Second base - implicit base of 0
 Immediate part
Advantages over 2 arbitrary bases:
– Better compression ratio
– Simpler compression logic
Base-Delta-Immediate (BΔI) Compression
34
B+Δ (with two arbitrary bases) vs. BΔI
Average compression ratio is close, but BΔI is simpler
35
BΔI Implementation
• Decompressor Design
– Low latency
• Compressor Design
– Low cost and complexity
• BΔI Cache Organization
– Modest complexity
36
BΔI Decompressor Design
Compressed Cache Line
B0
Δ0
Δ2
Δ3
B0
B0
B0
B0
+
+
+
+
V0
V0
Δ1
V1 V2
V1
V3
V2
Vector addition
V3
Uncompressed Cache Line
37
BΔI Compressor Design
32-byte Uncompressed Cache Line
8-byte B0
1-byte Δ
CU
8-byte B0
2-byte Δ
CU
CFlag &
CCL
8-byte B0
4-byte Δ
CU
CFlag &
CCL
4-byte B0
1-byte Δ
CU
CFlag &
CCL
4-byte B0
2-byte Δ
CU
CFlag &
CCL
2-byte B0
1-byte Δ
CU
CFlag &
CCL
CFlag &
CCL
Zero
CU
Rep.
Values
CU
CFlag & CFlag &
CCL
CCL
Compression Selection Logic (based on compr. size)
Compression Flag
& Compressed
Cache Line
Compressed Cache Line
38
BΔI Compression Unit: 8-byte B0 1-byte Δ
32-byte Uncompressed Cache Line
8 bytes
V0 V0
V1
V2
V3
B0= V0 B0
B0
B0
B0
-
-
-
-
Δ0
Δ1
Δ2
Δ3
Within 1-byte
range?
Within 1-byte
range?
Within 1-byte
range?
Within 1-byte
range?
Is every element within 1-byte range?
B0
Δ0
Δ1
Δ2
Δ3
Yes
No
39
BΔI Cache Organization
Tag Storage:
Set0
Data Storage:
32 bytes
Conventional 2-way cache with 32-byte cache lines
…
…
Set1 Tag0 Tag1
…
Set0
…
…
Set1
Data0
Data1
…
…
…
Way0 Way1
Way0
Way1
BΔI: 4-way cache with 8-byte segmented data
8 bytes
Tag Storage:
Set0
Set1
…
…
Tag0 Tag1
…
…
Set0 …
…
…
…
…
…
…
…
Tag2 Tag3 Set1 S0
S1
S2
S3
S4
S5
S6
S7
…
…
…
…
…
…
…
…
…
…
C
…
…
C - Compr. encoding bits
Way0 Way1 Way2 Way3
2.3%
overhead
for 2 segments
MB cache
Twice asTags
many tags
map to
multiple
adjacent
40
Qualitative Comparison with Prior Work
• Zero-based designs
– ZCA [Dusser+, ICS’09]: zero-content augmented cache
– ZVC [Islam+, PACT’09]: zero-value cancelling
– Limited applicability (only zero values)
• FVC [Yang+, MICRO’00]: frequent value compression
– High decompression latency and complexity
• Pattern-based compression designs
– FPC [Alameldeen+, ISCA’04]: frequent pattern compression
• High decompression latency (5 cycles) and complexity
– C-pack [Chen+, T-VLSI Systems’10]: practical implementation of
FPC-like algorithm
• High decompression latency (8 cycles)
41
Outline
•
•
•
•
Motivation & Background
Key Idea & Our Mechanism
Evaluation
Conclusion
42
Methodology
• Simulator
– x86 event-driven simulator based on Simics
[Magnusson+, Computer’02]
• Workloads
– SPEC2006 benchmarks, TPC, Apache web server
– 1 – 4 core simulations for 1 billion representative
instructions
• System Parameters
– L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08]
– 4GHz, x86 in-order core, 512kB - 16MB L2, simple
memory model (300-cycle latency for row-misses)
43
2
ZCA
FVC
FPC
1.8
1
GeoMean
2.2
lbm
wrf
hmmer
sphinx3
tpch17
libquantum
leslie3d
gromacs
sjeng
mcf
h264ref
tpch2
omnetpp
apache
bzip2
xalancbmk
astar
tpch6
cactusADM
gcc
soplex
gobmk
zeusmp
GemsFDTD
Compression Ratio
Compression Ratio: BΔI vs. Prior Work
SPEC2006, databases, web workloads, 2MB L2
BΔI
1.53
1.6
1.4
1.2
BΔI achieves the highest compression ratio
44
1.5
1.4
1.3
1.2
1.1
1
0.9
Baseline (no compr.)
BΔI
8.1%
4.9%
5.1%
5.2%
3.6%
5.6%
L2 cache size
Normalized MPKI
Normalized IPC
Single-Core: IPC and MPKI
1
0.8
0.6
0.4
0.2
0
Baseline (no compr.)
BΔI
16%
24%
21%
13%
19%
14%
L2 cache size
BΔI achieves the performance of a 2X-size cache
Performance improves due to the decrease in MPKI
45
Fixed L2 cache latency
GeoMean
astar
bzip2
soplex
xalancbmk
mcf
omnetpp
tpch2
gromacs
apache
sphinx3
h264ref
gobmk
leslie3d
zeusmp
lbm
2.3%
1.7%
1.3%
tpch6
hmmer
gcc
cactusADM
GemsFDTD
wrf
sjeng
512kB-2way
512kB-4way-BΔI
1MB-4way
1MB-8way-BΔI
2MB-8way
2MB-16way-BΔI
4MB-16way
tpch17
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9
libquantum
Normalized IPC
Single-Core: Effect on Cache Capacity
BΔI achieves performance close to the upper bound
46
Multi-Core Workloads
• Application classification based on
Compressibility: effective cache size increase
(Low Compr. (LC) < 1.40, High Compr. (HC) >= 1.40)
Sensitivity: performance gain with more cache
(Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB)
• Three classes of applications:
– LCLS, HCLS, HCHS, no LCHS applications
• For 2-core - random mixes of each possible class pairs
(20 each, 120 total workloads)
47
Multi-Core Workloads
48
Multi-Core: Weighted Speedup
Normalized Weighted Speedup
1.20
ZCA
FVC
FPC
BΔI
16.5%
18.0%
1.15
10.9%
1.10
1.05
4.5%
3.4%
9.5%
4.3%
1.00
0.95
LCLS - LCLS LCLS - HCLS HCLS - HCLS LCLS - HCHS HCLS - HCHS HCHS - HCHS
Low Sensitivity
High Sensitivity
GeoMean
IfBΔI
at least
one application
is sensitive,
then(9.5%)
the
performance
improvement
is the highest
performance improves
49
Other Results in Paper
• Sensitivity study of having more than 2X tags
– Up to 1.98 average compression ratio
• Effect on bandwidth consumption
– 2.31X decrease on average
• Detailed quantitative comparison with prior work
• Cost analysis of the proposed changes
– 2.3% L2 cache area increase
50
Conclusion
• A new Base-Delta-Immediate compression mechanism
• Key insight: many cache lines can be efficiently
represented using base + delta encoding
• Key properties:
– Low latency decompression
– Simple hardware implementation
– High compression ratio with high coverage
• Improves cache hit ratio and performance of both singlecore and multi-core workloads
– Outperforms state-of-the-art cache compression techniques:
FVC and FPC
51
Linearly Compressed Pages:
A Main Memory Compression
Framework with
Low Complexity and Low Latency
Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim,
Hongyi Xin, Onur Mutlu, Phillip B. Gibbons*,
Michael A. Kozuch*, Todd C. Mowry
Executive Summary





Main memory is a limited shared resource
Observation: Significant data redundancy
Idea: Compress data in main memory
Problem: How to avoid latency increase?
Solution: Linearly Compressed Pages (LCP):
fixed-size cache line granularity compression
1. Increases capacity (69% on average)
2. Decreases bandwidth consumption (46%)
3. Improves overall performance (9.5%)
53
Challenges in Main Memory Compression
1. Address Computation
2. Mapping and Fragmentation
3. Physically Tagged Caches
54
Address Computation
Cache Line (64B)
Uncompressed
Page
Address Offset
0
Compressed
Page
Address Offset
L1
L0
L1
L0
0
128
64
?
L2
?
LN-1
...
L2
(N-1)*64
LN-1
...
?
55
Mapping and Fragmentation
Virtual Page
(4kB)
Virtual
Address
Physical
Address
Physical Page
(? kB)
Fragmentation
56
Physically Tagged Caches
Core
Virtual
Address
Critical Path
TLB
L2 Cache
Lines
tag
tag
tag
Address Translation
Physical
Address
data
data
data
57
Shortcomings of Prior Work
Compression Access Decompression Complexity Compression
Mechanisms Latency Latency
Ratio
IBM MXT
[IBM J.R.D. ’01]




58
Shortcomings of Prior Work
Compression Access Decompression Complexity Compression
Mechanisms Latency Latency
Ratio
IBM MXT
[IBM J.R.D. ’01]




Robust Main
Memory
Compression




[ISCA’05]
59
Shortcomings of Prior Work
Compression Access Decompression Complexity Compression
Mechanisms Latency Latency
Ratio
IBM MXT
[IBM J.R.D. ’01]




Robust Main
Memory
Compression




LCP:
Our Proposal




[ISCA’05]
60
Linearly Compressed Pages (LCP): Key Idea
Uncompressed Page (4kB: 64*64B)
64B
64B
64B
64B
...
64B
4:1 Compression
...
Compressed
Data
(1kB)
M
E
Exception
Storage
Metadata
(64B):
?
(compressible)
61
LCP Overview
• Page Table entry extension
– compression type and size
– extended physical base address
• Operating System management support
– 4 memory pools (512B, 1kB, 2kB, 4kB)
• Changes to cache tagging logic
– physical page base address + cache line index
(within a page)
• Handling page overflows
• Compression algorithms: BDI [PACT’12] , FPC [ISCA’04]
62
LCP Optimizations
• Metadata cache
– Avoids additional requests to metadata
• Memory bandwidth reduction:
64B
64B
64B
64B
1 transfer
instead of 4
• Zero pages and zero cache lines
– Handled separately in TLB (1-bit) and in metadata
(1-bit per cache line)
• Integration with cache compression
– BDI and FPC
63
Methodology
• Simulator
– x86 event-driven simulators
• Simics-based [Magnusson+, Computer’02] for CPU
• Multi2Sim [Ubal+, PACT’12] for GPU
• Workloads
– SPEC2006 benchmarks, TPC, Apache web server,
GPGPU applications
• System Parameters
– L1/L2/L3 cache latencies from CACTI [Thoziyoor+, ISCA’08]
– 512kB - 16MB L2, simple memory model
64
Compression Ratio Comparison
Compression Ratio
SPEC2006, databases, web workloads, 2MB L2 cache
3.5
3
Zero Page
LCP (BDI)
MXT
FPC
LCP (BDI+FPC-fixed)
LZ
2.60
2.5
2.31
2
1.5
1
1.59
1.62
1.69
1.30
GeoMean
LCP-based frameworks achieve competitive
average compression ratios with prior work
65
Bandwidth Consumption Decrease
Normalized BPKI
Better
SPEC2006, databases, web workloads, 2MB L2 cache
1.2
1
0.8
0.6
0.4
0.2
0
FPC-cache
FPC-memory
(FPC, FPC)
(BDI, LCP-BDI+FPC-fixed)
0.92
BDI-cache
(None, LCP-BDI)
(BDI, LCP-BDI)
0.89
0.57
0.63
0.54
0.55
0.54
GeoMean
LCP frameworks significantly reduce bandwidth (46%)
66
Performance Improvement
Cores
LCP-BDI
(BDI, LCP-BDI)
(BDI, LCP-BDI+FPC-fixed)
1
6.1%
9.5%
9.3%
2
13.9%
23.7%
23.6%
4
10.7%
22.6%
22.5%
LCP frameworks significantly improve performance
67
Conclusion
• A new main memory compression framework
called LCP(Linearly Compressed Pages)
– Key idea: fixed size for compressed cache lines within
a page and fixed compression algorithm per page
• LCP evaluation:
–
–
–
–
Increases capacity (69% on average)
Decreases bandwidth consumption (46%)
Improves overall performance (9.5%)
Decreases energy of the off-chip bus (37%)
68
Download