Stacked DRAM - ECE Users Pages

advertisement
CAMEO
A Cache-Like Memory Organization
for 3D memory systems
12/15/2014 MICRO
Cambridge, UK
Chiachen Chou, Georgia Tech
Aamer Jaleel, Intel
Moinuddin K. Qureshi, Georgia Tech
EXECUTIVE SUMMARY
• How to use Stacked DRAM: Cache or Memory?
• Cache: software-transparent, fine-grained data
transfer, but sacrifices memory capacity
• Memory: larger memory capacity, but softwaresupport, coarse-grained data transfer
• CAMEO: software-transparent, fine-grained data
transfer and almost full memory capacity
• Results: CAMEO outperforms both Cache (50%) and
Two-Level Memory (50%) by providing 78% speedup
2
MEMORY BANDWIDTH WALL
Computer systems face memory bandwidth wall.
Stacked
DRAM
High Bandwidth Memory
Bandwidth
Latency
2-8X
0.5-1X
Hybrid Memory Cube
Stacked DRAM helps overcome bandwidth wall
Courtesy: JEDEC, Intel,
3
HYBRID MEMORY SYSTEM
Stacked DRAM
1-4 GB
Commodity DRAM
Stacked
DRAM8-16 GB
Commodity
DRAM
Stacked DRAM Capacity @ 0.25X
Hybrid Memory System
How to use Stacked DRAM: Cache or Main Memory?
Courtesy: JEDEC, Intel,
4
AGENDA
• Introduction
• Background
– Cache
– Two-Level Memory
• CAMEO
– Concept
– Implementation
• Methodology
• Results
• Summary
5
Memory Hierarchy
HARDWARE-MANAGED CACHE
fast
CPU
CPU
L1$
L2$
L1$
L2$
L3$
DRAM
Cache
Off-chip
Stacked
DRAM
DRAM
slow
Stacked DRAM is architected as DRAM Cache
6
HARDWARE-MANAGED CACHE
Shared
L3 Cache
L4 Cache
L3 Miss
L4 Miss
OS
Off-chip
memory
64B
Cache: software-transparency, fine-grained
data transfer, but no capacity benefits
7
3D DRAM AS A CACHE
CPUs
CPU
DRAM $
Stacked
4GB
DRAM
Off-chip
memory
Commodity
12GB
DRAM
(Cache)
12GB
16GB
Cache
TLM
CAMEO
Need OS Support
No
Yes
No
Data Transfer @
64B
4KB
64B
Memory Capacity
No 3D
Plus 3D
+= 3D
8
AGENDA
• Introduction
• Background
– Cache
– Two-Level Memory
• CAMEO
– Concept
– Implementation
• Methodology
• Results
• Summary
9
TWO-LEVEL MEMORY (TLM)
4GB
12GB
16GB
CPU
CPU
L1$
L2$
L1$
L2$
OS
L3$
Off-chip
DRAM
Stacked
DRAM
Stacked DRAM is architected as part of OSvisible memory space (Two-Level Memory)
10
TWO-LEVEL MEMORY (NO MIGRATION)
Shared L3
Cache
Page
OS
Page
4GB
25% Pages
12GB
75% Pages
Static page mapping does not exploit locality
11
TWO-LEVEL MEMORY (WITH MIGRATION)
Shared L3
Cache
L3 Miss
64B
Page
Page
OS
support
Page Migration (4KB Transfer)
TLM: OS support and inefficient use of bandwidth
12
MOTIVATION
Speedup
Baseline: 12GB off-chip DRAM
w/ 4GB stacked DRAM
2
1.8
1.6
1.4
1.2
1
Cache
TLM
DoubleUse
4+12
4+12
4+16
Small WS
Large WS
Overall
(<12GB)
(>12GB)
Small WS:
Small Working
Set (<12GB)
13
MOTIVATION
Speedup
Baseline: 12GB off-chip DRAM
w/ 4GB stacked DRAM
2
1.8
1.6
1.4
1.2
1
Cache
TLM
4+12
4+12
DoubleUse
4+1631%
Cache
performs
in LargeOverall
WS
Small
WS poorly
Large WS
workloads,
as TLM in (>12GB)
Small WS workloads
(<12GB)
14
OVERVIEW
OS-visible
Memory Space
CPUs
CPUs
DRAM $
Stacked
DRAM
Off-chip
DRAM
Off-chip
DRAM
Cache
TLM
Ideal
Need OS Support
No
Yes
No
Data Transfer @
64B
4KB
64B
Memory Capacity
No 3D
Plus 3D
Plus 3D
15
AGENDA
• Introduction
• Background
– Cache
– Two-Level Memory
• CAMEO
– Concept
– Implementation
• Methodology
• Results
• Summary
16
CAMEO A CAche-Like MEmory Organization
4GB
12GB
16GB
Shared L3
Cache
Stacked
Page
DRAM
OS
Commodity
Page
DRAM
Hardware performs data migration
SW get full capacity; HW does data migration
17
CAMEO
A CAche-Like MEmory Organization
Shared L3
Cache
Stacked
memory
64B
L3 Miss
64B
Off-chip
memory
HW swaps lines
(fine-grained transfer)
CAMEO transfers only 64B cache lines
18
CAMEO – CONGRUENCE GROUP
4GB
12GB
Stacked
memory
Off-chip
memory
0
A
N-1
3N
2N
N
B
C
D
2N-1
3N-1
4N-1
Congruence group
19
MIGRATION IN CONGRUENCE GROUP
A
B
C
D
Request to B, B, and C:
• Request to B: Swap line A and B
B
A
C
D
• Request to B: Hit in Stacked DRAM
C
A
B
D
• Request to C: Swap line C and B
A
B
C
D
Swapping changes line’s location, and requires
indexing structure to keep track of the location.
20
LINE LOCATION TABLE (LLT)
Location Table for Congruence Group
C
A
B
D
00
4 Location
01
10
11
Request Line
Physical Location
A
01
B
10
C
00
D
11
21
LINE LOCATION TABLE (LLT)
Size of Location Table Per Congruence Group
C
A
B
D
00
01
10
11
Log2(4)=2 bits
4 lines = 8 bits (1 byte)
64M groups
(64MB)
Storing LLT in SRAM is impractical
22
LLT IN DRAM
• LLT in DRAM incurs serialization Latency
– Optimizing for common case: Hit in stacked DRAM
– Co-locate Line Location Table of each
L3
Miss
1.5%
capacity group
loss with data in stacked DRAM
congruence
LLT
1 byte LLT
2KB
64 byte Data
LEAD
Hits
Stacked DRAM
Location
31Entry
LEADAnd Data
23
AVOID LLT LOOKUP LATENCY FOR HIT
• Avoiding LLT Lookup Latency on Stacked
DRAM Hit (lines in stacked memory)
– Co-locate Line Location Table of each
congruence group with data in stacked DRAM
Addr
Data
Stacked DRAM
Hit: one access
Co-Locate LLT to avoid latency on hits
24
AVOID LLT LOOKUP LATENCY FOR MISS
• Avoiding LLT Lookup Latency on Stacked
DRAM Miss (lines in off-chip memory)
– Use Line Location Predictor to fetch data
from possible location in parallel
Addr
Line Location
Predictor
Parallel Access to
Always
Possible Location
B
C
D
A
LEAD: verify the location
when both are ready.
25
AVOID LLT LOOKUP LATENCY FOR MISS
• Avoiding LLT Lookup Latency on Stacked
DRAM Miss (lines in off-chip memory)
– LLP makes M-ary prediction
– LLP uses instruction address and last
location to make prediction
Stacked
Add
r
Line
Location
Predictor
Off-chip #1
Off-chip #2
Off-chip #3
64 byte per core
Predictor
Always
Stacked
LLP
Accuracy
70%
92%
26
AVOIDING LLT LATENCY OVERHEAD
On Hit in Stacked DRAM
• Co-locate LLT of each
congruence group with
data in stacked DRAM
Stacked
A
On Miss in Stacked DRAM
• Use Line Location
Predictor to fetch data
from possible location in
parallel
Off-chip
B C D
Line Location Table
Stacked
Add
r
Line
Location
Predictor
Off-chip #1
Off-chip #2
Off-chip #3
We co-locate Line Location Table and use Line
Location Predictor to mitigate latency overhead
27
AGENDA
• Introduction
• Background
– Cache
– Two-Level Memory
• CAMEO
– Concept
– Implementation
• Methodology
• Results
• Summary
28
METHODOLOGY
CPU
•
Stacked
DRAM
Commodity
DRAM
SSD
Core Chip
 3.2GHz 2-wide out-of-order core
 32 cores, 32MB 32-way L3 shared cache
29
METHODOLOGY
CPU
Stacked
DRAM
Commodity
DRAM
SSD
Stacked DRAM
Commodity DRAM
Capacity
Bus
4GB
DDR3.2GHz, 128-bit
12GB
DDR1.6GHz, 64-bit
Latency
22ns
44ns
Channels
16 channels,
16 banks/channel
8 channels
8 banks/channels
30
METHODOLOGY
CPU
Stacked
DRAM
Commodity
DRAM
SSD
SSD Latency:
32 micro
seconds
•• Baseline:
12GB off-chip
DRAM
• Cache: Alloy Cache [MICRO’12]
• Two-Level Memory: Page Migration enabled
• SPEC2006: rate mode; Small Working Set
(<12GB) and Large Working Set(> 12GB)
PERFORMANCE IMPROVEMENT
Small WSet
GMEAN
astar
dealII
bzip2
sphinx3
leslie
omnetpp
Small WS
xalanc
1
libq
1.2
soplex
1.4
CAMEO as good as
Cache in Small WS apps
milc
1.6
Speedup
Speedup
1.8
3.5
3
2.5
2
1.5
1
0.5
0
gcc
2
32
PERFORMANCE IMPROVEMENT
Speedup
1.4
1.2
1
Large WS
GMEAN
Small WS
1.6
zeusmp
1
1.8
cactus
1.2
28%
1.8
CAMEO
1.6
outperforms
1.4in Large
TLM
WS1.2
apps
bwaves
1.4
2
Gems
1.6
2
lbm
Speedup
Speedup
1.8
Speedup
4
3.5
3
2.5
2
1.5
1
0.5
0
mcf
2
1
Overall
CAMEO outperforms both Cache and
TLM, and veryLarge
closeWSet
to DoubleUse
33
EXECUTIVE SUMMARY
• How to use Stacked DRAM: Cache or Memory?
• Cache: software-transparent, fine-grained data
transfer, but sacrifices memory capacity
• Memory: larger memory capacity, but softwaresupport, coarse-grained data transfer
• CAMEO: software-transparent, fine-grained data
transfer and almost full memory capacity
• Results: CAMEO outperforms both Cache (50%) and
Two-Level Memory (50%) by providing 78% speedup
34
Thank You!
35
CAMEO
A Cache-Like Memory Organization
for 3D memory system
12/15/2014 MICRO
Cambridge, UK
Chiachen Chou, Georgia Tech
Aamer Jaleel, Intel
Moinuddin K. Qureshi, Georgia Tech
Backup slides
37
LINE LOCATION TABLE
Size of Location Table Per Congruence Group
B
A
4 Location
# Locations
Size
4
1 byte
2.5
byte
3 byte
6
8
C
D
Log2(4)=2 bits
4 lines
8 bits (1 byte)
38
POWER AND ENERGY
Normalized to Baseline
Cache
TLM
CAMEO
1.5
34%
14%
1
0.5
0
Power
EDP
39
Download