Document 10802228

advertisement
15-740/18-740 Computer Architecture
Lecture 5: Project Example
Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 Reminder: Project Proposals
•  Project proposals due NOON on Monday 9/26 •  Two to three pages consisAng of –  Problem –  Novelty –  Idea –  Hypothesis –  Methodology –  Plan •  All the details are in the project handout 2 Agenda for Today’s Class
• 
• 
• 
• 
Brief background on hybrid main memories Project example from Fall 2010 Project pitches and feedback Q & A 3 Main Memory in Today’s Systems
CPU DRAM HDD/SSD 4 Main Memory in Today’s Systems
CPU Main memory DRAM HDD/SSD 5 DRAM
•  Pros –  Low latency –  Low cost •  Cons –  Low capacity –  High power •  Some new and important applicaAons require HUGE capacity (in the terabytes) 6 Main Memory in Today’s Systems
CPU Main memory DRAM HDD/SSD 7 Hybrid Memory (Future Systems)
Hybrid main memory DRAM (cache) CPU New memories (high capacity) HDD/SSD 8 Row Buffer Locality-Aware
Hybrid Memory Caching Policies
Jus%n Meza HanBin Yoon Rachata Ausavarungnirun Rachael Harding Onur Mutlu Motivation
•  Two conflicAng trends: 1.  ITRS predicts the end of DRAM scalability 2.  Workloads conAnue to demand more memory •  Want future memories to have –  Large capacity –  High performance –  Energy efficient •  Need scalable DRAM alternaAves 10 Motivation
•  Emerging memories can offer more scalability •  Phase change memory (PCM) –  Projected to be 3−12× denser than DRAM •  However, cannot simply replace DRAM –  Longer access latencies (4−12× DRAM) –  Higher access energies (2−40× DRAM) •  Use DRAM as a cache to large PCM memory [Mohan, HPTS ’09; Lee+, ISCA ’09] 11 Phase Change Memory (PCM)
•  Data stored in form of resistance –  High current melts cell material –  Rate of cooling determines stored resistance –  Low current used to read cell contents 12 Projected PCM Characteristics (~2013)
32 nm Cell size Read latency Write latency Read energy Write energy Durability DRAM 6 F2 60 ns 60 ns 1.2 pJ/bit 0.39 pJ/bit N/A PCM 0.5–2 F2 300–800 ns 1400 ns 2.5 pJ/bit 16.8 pJ/bit 106–108 writes Rela%ve to DRAM 3–12× denser 6–13× slower 24× slower 2× more energy 40× more energy Limited life%me [Mohan, HPTS ’09; Lee+, ISCA ’09] 13 Row Buffers and Locality
•  Memory array organized in columns and rows •  Row buffers store contents of accessed row •  Row buffers are important for mem. devices –  Device slower than bus: need to buffer data –  Fast accesses for data with spaAal locality –  DRAM: DestrucAve reads –  PCM: Writes are costly: want to coalesce 14 Row Buffers and Locality
A
D
D
R
ROW DATA Row buffer h
mit! iss! LOAD X LOAD X+1 15 Key Idea
•  Since DRAM and PCM both use row buffers, –  Row buffer hit latency same in DRAM and PCM –  Row buffer miss latency small in DRAM –  Row buffer miss latency large in PCM •  Cache data in DRAM which –  Frequently row buffer misses –  Is reused many Ames •  à because miss penalty is smaller in DRAM 16 Hybrid Memory Architecture
CPU Memory Controller DRAM Cache (Low density) PCM (High density) Memory channel 17 Hybrid Memory Architecture
CPU DRAM
Ctlr DRAM Cache (Low density) PCM Ctlr PCM (High density) 18 Hybrid Memory Architecture
CPU Tag store: 2 KB rows DRAM Cache (Low density) Memory Controller PCM (High density) 19 Hybrid Memory Architecture
LOAD X Tag store: X à DRAM CPU Memory Controller DRAM Cache (Low density) PCM (High density) 20 Hybrid Memory Architecture
LOAD Y Tag store: Y à PCM CPU Memory Controller DRAM Cache (Low density) PCM (High density) How does data get migrated to DRAM?  Caching Policy 21 Methodology
•  Simulated our system configuraAons –  Collected program traces using a tool called Pin –  Fed instrucAon trace informaAon to a Aming simulator modeling an OoO core and DDR3 memory –  Migrated data at the row (2 KB) granularity •  Collected memory traces from a standard computer architecture benchmark suite –  SPEC CPU2006 •  Used an in-­‐house simulator writen in C# 22 Conventional Caching
•  Data is migrated when first accessed •  Simple, used for many caches 23 Conventional Caching
•  Data is migrated when first accessed •  Simple, used for many caches LD Rw1 w2 Tag store: Z à PCM CPU Memory Controller DRAM Cc
ache PCM perform How does onven%onal c
aching Row Data (Low density) (High density) in a hybrid m
ain m
emory? Bus contenAon! 24 Conventional Caching
IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 25 Conventional Caching
IPC Normalized to All DRAM No Caching (All PCM) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Beneficial for some benchmarks ConvenAonal Caching 26 Conventional Caching
IPC Normalized to All DRAM No Caching (All PCM) Performance degrades due to ConvenAonal Caching bus contenAon 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 27 Conventional Caching
IPC Normalized to All DRAM No Caching (All PCM) Many row buffer hits: don’t need to ConvenAonal migrate Caching data 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 28 Conventional Caching
IPC Normalized to All DRAM No Caching (All PCM) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Want to idenAfy data which misses in row ConvenAonal Caching buffer and is reused 29 Problems with Conventional Caching
•  Performs useless migraAons –  Migrates data which are not reused –  Migrates data which hit in the row buffer •  Causes bus contenAon and DRAM polluAon –  Want to cache rows which are reused –  Want to cache rows which miss in row buffer 30 Problems with Conventional Caching
•  Performs useless migraAons –  Migrates data which are not reused –  Migrates data which hit in the row buffer •  Causes bus contenAon and DRAM polluAon –  Want to cache rows which are reused –  Want to cache rows which miss in row buffer 31 A Reuse-Aware Policy
•  Keep track of the number of accesses to a row •  Cache row in DRAM when accesses ≥ A –  Reset accesses every Q cycles •  Similar to CHOP [Jiang+, HPCA ’10] –  Cached “hot” (reused) pages in on-­‐chip DRAM –  To reduce off-­‐chip bandwidth requirements •  We call this policy A-­‐COUNT 32 A Reuse-Aware Policy
IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A-­‐COUNT.4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 33 A Reuse-Aware Policy
IPC Normalized to All DRAM No Caching (All PCM) Performs fewer migraAons: reduces ConvenAonal Caching channel A-­‐COUNT.4 contenAon 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 34 IPC Normalized to All DRAM A Reuse-Aware Policy
Too few migraAons: too many aConvenAonal ccesses go Caching No Caching (All PCM) to PCM 1 A-­‐COUNT.4 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 35 A Reuse-Aware Policy
IPC Normalized to All DRAM No Caching (All PCM) Rows with many hits sAll needlessly ConvenAonal Caching migrated A-­‐COUNT.4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 36 Problems with Reuse-Aware Policy
•  AgnosAc of DRAM/PCM access latencies –  May keep data which row buffer misses in PCM –  Missed opportunity: could save cycles in DRAM 37 rt No. 2011-005
(September
5, 2011)
Problems
with Reuse-Aware
Policy
•  AgnosAc of DRAM/PCM access latencies Data with frequent row buffer hits
Time
PCM
Miss
Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit
DRAM Miss Hit Hit Hit Hit Hit Hit Hit Hit Hit Hit
Saved cycles if placed in DRAM
Data with frequent row buffer misses
Time
PCM
Miss
Hit
Miss
Hit
Miss
DRAM Miss Hit Miss Hit Miss
Saved cycles if placed in DRAM
Figure 2: Data placement affects service time.
38 Row Buffer Locality-Aware Policy
•  Cache rows which benefit from being in DRAM –  I.e., those with frequent row buffer misses •  Keep track of number of misses to a row •  Cache row in DRAM when misses ≥ M –  Reset misses every Q cycles •  We call this policy M-­‐COUNT 39 Row Buffer Locality-Aware Policy
IPC Normalized to All DRAM No Caching (All PCM) ConvenAonal Caching A-­‐COUNT.4 M-­‐COUNT.2 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 40 Row Buffer Locality-Aware
Policy
Recognizes rows with IPC Normalized to All DRAM No Caching (All PCM) many hits and does not migrate them M-­‐COUNT.2 ConvenAonal Caching A-­‐COUNT.4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 41 Row Buffer Locality-Aware Policy
IPC Normalized to All DRAM Lots oPf CM) data w
ith just enough isses to No Caching (All ConvenAonal Caching mA-­‐COUNT.4 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 get cached but litle reuse azer being cached à need to also track reuse M-­‐COUNT.2 42 Combined Reuse/Locality Approach
•  Cache rows with reuse and which frequently miss in the row buffer –  Use A-­‐COUNT as predictor of future reuse and –  M-­‐COUNT as predictor of future row buffer misses •  Cache row if accesses ≥ A and misses ≥ M •  We call this policy AM-­‐COUNT 43 Normalized to All DRAM Combined Reuse/Locality Approach
No Caching (All PCM) ConvenAonal Caching A-­‐COUNT.4 M-­‐COUNT.2 AM-­‐COUNT.4.2 1 0.8 0.6 0.4 0.2 0 44 Normalized to All DRAM Combined Reuse/Locality Approach
Reduces useless migraAons No Caching (All PCM) ConvenAonal Caching A-­‐COUNT.4 M-­‐COUNT.2 AM-­‐COUNT.4.2 1 0.8 0.6 0.4 0.2 0 45 Combined Reuse/Locality Approach
No And Caching (All Pw
CM) ConvenAonal Caching A-­‐COUNT.4 data ith litle Normalized to All DRAM reuse kept out AM-­‐COUNT.4.2 of M-­‐COUNT.2 1 DRAM 0.8 0.6 0.4 0.2 0 46 Dynamic Reuse/Locality Approach
•  Previously menAoned policies require profiling –  To determine the best A and M thresholds •  We propose a dynamic threshold policy –  Performs a cost-­‐benefit analysis every Q cycles –  Simple hill-­‐climbing algorithm to maximize benefit –  (Side note: we simplify the problem slightly by just finding the best A threshold, because we observe that M = 2 performs the best for a given A.) 47 Cost-Benefit Analysis
•  Each quantum, we measure the first-­‐order costs and benefits of the current A threshold –  Cost = cycles of bus contenAon due to migraAons –  Benefit = cycles saved at the banks by servicing a request in DRAM versus PCM •  Cost = MigraAons × tmigraAon •  Benefit = ReadsDRAM × (tread,PCM − tread,DRAM) + WritesDRAM × (twrite,PCM − twrite,DRAM) 48 SAFARI Technical Report No. 2
Cost-Benefit Maximization Algorithm
Each quantum (10 million cycles):
= Benefit
Cost
// net benefit 1 Net
< 0 then
// too many migrations? 2 if Net
3 A++
// increase threshold // last A beneficial 4 else
5 if Net > PreviousNet
then
// increasing benefit? 6 A++
// try next A 7 else
// decreasing benefit 8 A - // too strict, reduce 9
end
10 end
11 PreviousNet = Net
49 Figure 6: Cost-benefit maximization algorithm.
IPC Normalized to All DRAM Dynamic Policy Performance
No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic 1 0.8 0.6 0.4 0.2 0 50 IPC Normalized to All DRAM Dynamic Policy Performance
No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic 1 0.8 0.6 0.4 0.2 0 29% improvement over All PCM, Within 18% of All DRAM 51 Evaluation Methodology/Metrics
•  16-­‐core system •  Averaged across 100 randomly-­‐generated workloads of varying working set size –  LARGE = working set size > main memory size IPCtogether
•  Weighted speedup (performance) = ∑
IPCalone
IPCalone
•  Maximum slowdown (fairness) = max
IPCtogether
52 the working sets do not reside mainly in the DRAM cache. Longer
ones.
3 We
find least frequently used to perform the best, however, the performance of l
16-core Performance & Fairness
4
2
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
(a) Weighted speedup.
30
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
0.3
Harmonic Speedup
6
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
40
Maximum Slowdown
Weighted Speedup
8
20
10
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
0.2
0.1
0
(b) Maximum slowdown.
53 Figure 8: Performance, fairness, and power r
the working sets do not reside mainly in the DRAM cache. Longer
ones.
3 We
find least frequently used to perform the best, however, the performance of l
16-core Performance & Fairness
AM-COUNT
DAM-COUNT
4
2
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
(a) Weighted speedup.
30
0.3
Harmonic Speedup
6
40
More c
ontenAon Conventional Caching
Conventional Caching
A-COUNT
à more benefit A-COUNT
AM-COUNT
Maximum Slowdown
Weighted Speedup
8
DAM-COUNT
20
10
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
0.2
0.1
0
(b) Maximum slowdown.
54 Figure 8: Performance, fairness, and power r
the working sets do not reside mainly in the DRAM cache. Longer
ones.
3 We
find least frequently used to perform the best, however, the performance of l
16-core Performance & Fairness
4
2
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
30
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
0.3
Harmonic Speedup
6
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
40
Maximum Slowdown
Weighted Speedup
8
20
10
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
Dynamic policy can (a) Weighted speedup.
(b) Maximum slowdown.
adjust to different workloads 55 0.2
0.1
0
Figure 8: Performance, fairness, and power r
Versus All PCM and All DRAM
•  Compared to an All PCM main memory –  17% performance improvement –  21% fairness improvement •  Compared to an All DRAM main memory –  Within 21% of performance –  Within 53% of fairness 56 No. 2011-005 (September 5, 2011)
4
2
0
Power (W)
6
Normalized Weighted Speedup
Robustness to System Configuration
3
16 cores
8 cores
4 cores
2 cores
2.5
2
1.5
1
0.5
0
0
50
100
150
Sorted Workload Number
Figure 10: Number of cores.
200
57 Implementation/Hardware Cost
•  Requires a tag store in memory controller –  We currently assume 36 KB of storage per 16 MB of DRAM –  We are invesAgaAng ways to miAgate this overhead •  Requires a sta1s1cs store –  To keep track of accesses and misses 58 Conclusions
•  DRAM scalability is nearing its limit –  Emerging memories (e.g. PCM) offer scalability –  Problem: must address high latency and energy •  We propose a dynamic, row buffer locality-­‐
aware caching policy for hybrid memories –  Cache rows which miss frequently in row buffer –  Cache rows which are reused many Ames •  17/21% perf/fairness improvement vs. all PCM •  Within 21/53% perf/fairness of all DRAM system 59 Thank you! Questions?
60 Backup Slides
61 0
50
100
150
Sorted Workload Number
200
Figure
10:
Number
of
cores.
Related Work
6
Weighted Speedup
PCM
M-COUNT
8⇥
g Factor
M latency.
DIP
Probabilistic
Probabilistic+RBL
DAM-COUNT
4
2
0
A
62 Figure 13: Related techniques.
0
WS
MS
0
P
HS
0
Versus DRAM/PCM.
PCM Latency
g
6 MB 512 MB
e Size
Figure 10: Numbe
Weighted Speedup
4
2
0
6
All PCM
DAM-COUNT
6
Weighted Speedup
50
100
Sorted Workload
1⇥
2⇥
4⇥
8⇥
DIP
Pro
Pro
DA
4
2
0
PCM Latency Scaling Factor
DRAM size. Figure 12: Effects of PCM latency.
Figure 13:
63 0
0
0
WS
MS
P
HS
Figure 9: Versus DRAM/PCM.
DRAM Cache Size
Conventional Caching
DAM-COUNT
3
2
1
0 64 MB 128 MB 256 MB 512 MB
DRAM Cache Size
6
Weighted Speedup
Weighted Speedup
4
4
2
0
1⇥
PCM
Figure 11: Effects of DRAM size. Figure 12: E
64 SAFARI Technical Report No. 2011-
Versus All DRAM and All PCM
4
2
0
20
10
0
0.2
All PCM
Conventional Caching
DAM-COUNT
All DRAM
4
0.15
0.1
2
0.05
0
WS
MS
HS
Figure 9: Versus DRAM/PCM.
4
Power (W)
6
30
Harmonic Speedup
Maximum Slowdown
Weighted Speedup
8
0.25
6
P
0
65 Performance vs. Statistics Store Size
(8 ways, LRU) 512-­‐entry (0.2 KB) 1024-­‐entry (0.4 KB) 2048-­‐entry (0.8 KB) IPC Normalized to All DRAM 4096-­‐entry (1.6 KB) ∞-­‐entry 1 0.8 0.6 0.4 0.2 0 66 Performance vs. Statistics Store Size
(8 ways, LRU) 512-­‐entry (0.2 KB) 1024-­‐entry (0.4 KB) 2048-­‐entry (0.8 KB) IPC Normalized to All DRAM 4096-­‐entry (1.6 KB) ∞-­‐entry 1 0.8 0.6 0.4 0.2 0 Within ~1% of infinite storage with 200 B of storage 67 All DRAM 8 Banks
IPC Normalized to All DRAM with 8 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 68 All DRAM 16 Banks
IPC Normalized to All DRAM with 16 Banks No Caching (All PCM) ConvenAonal Caching Best StaAc Dynamic 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 69 Simulation Parameters
70 Overview
•  DRAM is reaching its scalability limits –  Yet, memory capacity requirements are increasing •  Emerging memory devices offer scalability –  Phase-­‐change, resisAve, ferroelectric, etc. –  But, have worse latency/energy than DRAM •  We propose a scalable hybrid memory arch. –  Use DRAM as a cache to phase change memory –  Cache data based on row buffer locality and reuse 71 Methodology
•  Core model –  3-­‐wide issue with 128-­‐entry instrucAon window –  32 KB L1 D-­‐cache per core –  512 KB shared L2 cache per core •  Memory model –  16 MB DRAM / 512 MB PCM per core •  Scaled based on workload trace size and access paterns to be smaller than working set –  DDR3 800 MHz, single channel, 8 banks per device –  Row buffer hit: 40 ns –  Row buffer miss: 80 ns (DRAM); 128, 368 ns (PCM) –  Migrate data at 2 KB row granularity 72 Outline
• 
• 
• 
• 
• 
• 
Overview MoAvaAon/Background Methodology Caching Policies MulAcore EvaluaAon Conclusions 73 the working sets do not reside mainly in the DRAM cache. Longer
ones. TODO: change diagram to two channels so that this can be explained 3 We
find least frequently used to perform the best, however, the performance of l
16-core Performance & Fairness
4
2
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
30
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
0.3
Harmonic Speedup
6
Conventional Caching
A-COUNT
AM-COUNT
DAM-COUNT
40
Maximum Slowdown
Weighted Speedup
8
20
10
0
0% 25% 50% 75% 100%
Fraction of LARGE Benchmarks
DistribuAng data (a) Weighted speedup.
(b) Maximum slowdown.
benefits small working sets, too 74 0.2
0.1
0
Figure 8: Performance, fairness, and power r
Download