Designing On-chip Memory
Systems for Throughput
Architectures
Ph.D. Proposal
Jeff Diamond
Advisor: Stephen Keckler
“We’ll be seeing a lot more than 2-4 cores per chip really quickly” – Bill Mark, 2005
AMD - TRINITY
NVIDIA Tegra 3
Intel – Ivy Bridge 1
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
◦ Architectural Enhancements
Thread Scheduling
Cache Policies
◦ Methodology
Proposed Work
2
Key Features:
◦ Break single application into threads
Use explicit parallelism
◦ Optimize hardware for performance density
Not single thread performance
Benefits:
◦ Drop voltage, peak frequency
quadratic improvement in power efficiency
◦ Cores smaller, more energy efficient
Further amortize through multithreading, SIMD
Less need for OO, register renaming, branch prediction, fast synchronization, low latency ALUs
3
Architecture Continuum:
◦ Multithreading
Large number of threads mask long latency
Small amount of cache primarily for bandwidth
◦ Caching
Large amounts of cache to reduce latency
Small number of threads
Can we get benefits of both?
Power 7
4 threads/core
~1MB/thread
SPARC T4
8 threads/core
~80KB/thread
GTX 580
48 threads/core
~2KB/thread
4
Computation is cheap, data movement is expensive
◦ Hit in L1 cache, 2.5x power of 64-bit FMADD
◦ Move across chip, 50x power
◦ Fetch from DRAM, 320x power
Limited off-chip bandwidth
◦ Exponential growth in cores saturates BW
◦ Performance capped
DRAM latency currently hundreds of cycles
◦ Need hundreds of threads/core in flight to cover
DRAM latency
5
Little’s Law
◦ Threads needed is proportional to average latency
◦ Opportunity cost in on-chip resources
Thread contexts
In flight memory accesses
Too many threads – negative feedback
◦ Adding threads to cover latency increases latency
◦ Slower register access, thread scheduling
◦ Reduced Locality
Reduces bandwidth and DRAM efficiency
Reduces effectiveness of caching
◦ Parallel starvation
6
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
◦ Architectural Enhancements
Thread Scheduling
Cache Policies
◦ Methodology
Proposed Work
7
Problem: Too Many Threads!
◦ Increase Parallel Efficiency, i.e.
Number of threads for given level of performance
Improves throughput performance
Apply low latency caches
◦ Leverage upwards spiral
Difficult to mix multithreading and caching
Typically used just for bandwidth amplification
◦ Important ancillary factors
Thread scheduling
Instruction Scheduling (per thread parallelism)
8
Quantifying the impact of single thread performance on throughput performance
Developing a mathematical analysis of throughput performance
Building a novel hybrid-trace based simulation infrastructure
Demonstrating unique architectural enhancements in thread scheduling and cache policies
9
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
The Valley
◦ Architectural Enhancements
Thread Throttling
Cache Policies
◦ Methodology
Proposed Work
10
Why take a mathematical approach?
◦ Be very precise about what we want to optimize
◦ Understand the relationships and sensitivities to throughput performance
Single thread performance
Cache improvements
Application characteristics
◦ Rapid evaluation of design space
◦ Suggest most fruitful architectural improvements
11
Modeling Throughput Performance
P
CHIP
P
ST
N
T
L
AVG
= Total throughput performance
= Single thread performance
= Total active threads
= Average instruction latency
P
Chip
=
N
T
×
P
ST
P
ST
(( Ins / Sec ) / Thread )
=
ILP ( Ins / Thread )
L
AVG
( Sec )
L
AVG
= ( )
Power
CHIP
= E
AVG
(Joules/Ins)xP
CHIP
How can caches help throughput performance?
12
Area comparison:
FPU = 2-11KB SRAM, 8-40KB eDRAM
Active power: 20pJ / Op
Leakage power: 1 watt/mm 2
FMADD
SRAM
Active power: 50pJ/L1 access, 1.1nJ/L2 access
Leakage power: 70 milliwatts/mm 2
Make loads 150x faster, 300x more energy efficient
Use10-15x less power/mm^2 than FPUs
Leakage power comparison:
One FPU = ~64KB SRAM / 256KB eDRAM
Key: How much does a thread need?
13
Ignore changes to DRAM latency & off-chip BW
◦ We will simulate these
Assume ideal caches (frequency)
What is the maximum performance benefit?
A = Arithmetic intensity of application (fraction of non-memory instructions)
N
T
= Total active threads on chip
L = Latency
L improvement
( ) =
(1
-
A )
·
( L miss
-
L hit
)
· ( )
Memory Intensity, M=1-A
D
L
C
For power, replace L with E, the average energy per instruction
Qualitatively identical, but differences more dramatic
14
Hit rate depends on amount of cache, application working set
Store items used the most times
◦ This is the concept of “frequency”
Once we know an application’s memory access characteristics, we can model throughput performance
15
F(c) H(c) P
ST
(c)
( ) =
0 c
F c dc
P
ST
( ) =
L
NC
1
-
L
CI
=
( A
·
L
ALU
1
+
M
·
L
MISS
)
-
M
D
L
· ( )
16
Hit Rate(c)
100%
80%
60%
40%
20%
0%
0,0 0,2 0,4 0,6 0,8
Cache Per Thread (Fraction of
WS)
1,0
100%
Hit Rate(N
T
)
80%
60%
40%
20%
0%
0 500 1000
Total Active Threads
P
ST
(N
T
)
1,0
0,8
0,6
0,4
0,2
0,0
0 500 1000
Total Active Threads
( ) = const
=
1
WorkingSet
( ) = ò ( ) dc
= c
( ) =
1
N
T
¶
¶ t
( ) = -
1
N
T
2
P
S
(t) is a steep reciprocal
17
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
“The Valley”
◦ Architectural Enhancements
Thread Throttling
Cache Policies
◦ Methodology
Proposed Work
18
P
Chip
=
N
T
×
P
ST
High Performance
N
T
=
1 c
P
ST
(flat access)
=
X
19
Cache
Valley MT
Regime
Regime
Width
Cache
No Cache
20
Hong et al, 2009, 2010
◦ Simple, cacheless GPU models
◦ Used to predict “MT peak”
Guz et al, 2008, 2010
◦ Graphed throughput performance with assumed cache profile
◦ Identified “valley” structure
◦ Validated against PARSEC benchmarks
◦ No mathematical analysis
◦ Minimal focus on bandwidth limited regime
◦ CMP benchmarks
Galal et al, 2011
◦ Excellent mathematical analysis
◦ Focused on FPU+Register design
21
Cache
Valley MT
Regime
Regime
Width t
æ
è
1
t
1
ø
Cache
No Cache
H(c)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
1.0
22
! " #$ #%&'( ) * &+,- ,. /
! " #$%&'%(&) *) +', - -
! " #$%&'/0 12 2
0 56) '57) +, ( - 8''. '9 9
1: : ) 88'; <='>?10
0 56) '57) +, ( - ', : +588': A%7
B CC#: A%7'D %+)
?) , - 'C+59 '2 ?10
E'=%FF'2 , FFGH'IJ 2 J >'<) G( 5&) H'34. .
! "#$%&' ( ) %$( &' *+,' *' - . &( /0*+,12-
!" #$ %&
!) #$ %&
!, #$ %&
- . / 0 #122344
) '
, ( (
"
*
' (
+'
" ( ( (
" *( ( (
.
34
3!
@4
. 444
@44
. ! 444
23
1,600
1,400
100%
H(c)
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0
0.2
0.4
0.6
0.8
1.0
80%
1,200
1,000
800
600
400
200
0
60%
40%
20%
0%
Energy/Op
Hit Rate
Total Ac ve Threads
24
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
The Valley
◦ Architectural Enhancements
Thread Throttling
Cache Policies
◦ Methodology
Proposed Work
25
Have real time information:
◦ Arithmetic intensity
◦ Bandwidth utilization
◦ Current hit rate
Conservatively approximate locality
Approximate optimum operating points
Shut down / Activate threads to increase performance
◦ Concentrate power and overclock
◦ Clock off unused cache if no benefit
26
Many studies in CMP and GPU area scale back threads
◦ CMP – miss rates get too high
◦ GPU – off-chip bandwidth is saturated
◦ Simple to hit, unidirectional
Valley is much more complex
◦ Two points to hit
◦ Three different operating regimes
Mathematical analysis lets us approximate both points with as little as two samples
Both off-chip bandwidth and reciprocal of hit rate are nearly linear for a wide range of applications
27
Cache
Regime
Valley MT
Regime
Width t
æ
è
1
t
1
ø
Cache
No Cache
28
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
The Valley
◦ Architectural Enhancements
Thread Throttling
Cache Policies (Indexing, replacement)
◦ Methodology
Proposed Work
29
Need to work like LFU cache
◦ Hard to implement in practice
Still very little cache per thread
◦ Policies make big differences for small caches
◦ Associativity a big issue for small caches
Cannot cache every line referenced
◦ Beyond “dead line” prediction
◦ Stream lines with lower reuse
30
Contribution – Odd Set Indexing
Conflict misses pathological issue
◦ Most often seen with power of 2 strides
◦ Idea: map to 2 N -1 sets/banks instead
True “Silver Bullet”
◦ Virtually eliminates conflict misses in every setting we’ve tried
Reduced scratchpad banks from 32 to 7 at same level of bank conflicts
Fastest, most efficient implementation
◦ Adds just a few gate delays
◦ Logic area < 4% 32-bit integer multiply
◦ Can still access last bank
31
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
PARSEC L2 with 64 threads
DM w Prime Banking
32
Prime number of banks/sets thought ideal
◦ No efficient implementation
◦ Mersenne Primes not so convenient:
3, 7, 15 , 31, 63 , 127, 255
We demonstrated wasn’t an issue
Yang, ISCA ‘92 - prime strides for vector computers
Showed 3x speedup
We get correct offset for free
Kharbutli, HPCA 04 – showed prime sets as hash function for caches worked well
Our implementation faster, more features
Couldn’t appreciate benefits for SPEC
33
Not all data should be cached
◦ Recent papers for LLC caches
◦ Hard drive cache algorithms
Frequency over Recency
◦ Frequency hard to implement
◦ ARC good compromise
Direct Mapping Replacement dominates
◦ Look for explicit approaches
◦ Priority Classes
◦ Epochs
34
Belady – solved it all
◦ Three hierarchies of methods
◦ Best utilized information of prior line usage
◦ Light on implementation details
Approximations
◦ Hallnor & Reinhardt, ISCA 2000
Generational Replacement
◦ Meggido, Usenet 2003, ARC cache
ghost entries
recency and frequency groups
◦ Qureshi, 2006, 2007 – Adaptive Insertion policies
◦ Multiqueue, LR-K, D-NUCA, etc.
35
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
The Valley
◦ Architectural Enhancements
Thread Throttling
Cache Policies (Indexing, replacement)
◦ Methodology (Applications, Simulation)
Proposed Work
36
Initially studied regular HPC kernels/applications in CMP environment
◦ Dense Matrix Multiply
◦ Fast Fourier Transform
◦ Homme weather simulation
Added CUDA throughput benchmarks
◦ Parboil – old school MPI, coarse grained
◦ Rodinia – fine grained, varied
Benchmarks typical of historical GPGPU applications
Will add irregular benchmarks
◦ SparseMM, Adaptive Finite Elements, Photon mapping
37
100%
80%
60%
40%
20%
0%
Dynamic Instruction Mix
Operations
Scratchpad Access
Cache Access
38
Most of the benchmarks should benefit:
◦ Small working sets
◦ Concentrated working sets
◦ Hit rate curves easy to predict
39
Typical Concentration of Locality
4,096
2,048
1,024
512
256
128
64
32
16
8
4
2
1
0
HWT Working Set Per Task
100%
500,000
Working Set (Bytes)
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
1,000,000
Reuse Count
Hit Rate
40
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Frac on of Scratchpad Accesses
One Addr
2-128 addresses
41
C++/CUDA Simulate Different Architecture Than Traced
NVCC
PTX
Intermediate
Assembly Listing
Dynamic Trace Blocks
Attachment Points
Modify
Ocelot Functional Sim
Custom Trace Module
Compressed
Trace Data
Custom Simulator
Goals: Fast simulation, Overcome compiler issues for reasonable base case
42
Introduction
The Problem
◦ Throughput Architectures
◦ Dissertation Goals
The Solution
◦ Modeling Throughput Performance
Cache Performance
The Valley
◦ Architectural Enhancements
Thread Throttling
Cache Policies (Indexing, replacement)
◦ Methodology (Applications, Simulation)
Proposed Work
43
Looked at GEMM, FFT & Homme in CMP setting
◦ Learned implementation algorithms, alternative algorithms
◦ Expertise allows for credible throughput analysis
◦ Valuable Lessons in multithreading and caching
Dense Matrix Multiply
◦ Blocking to maximize arithmetic intensity
◦ Enough contexts to cover latency
Fast Fourier Transform
◦ Pathologically hard on memory system
◦ Communication & synchronization
HOMME – weather modeling
◦ Intra-chip scaling incredibly difficult
◦ Memory system performance variation
◦ Replacing data movement with computation
First author publications:
◦ PPoPP 2008, ISPASS 2011 (Best Paper)
44
Phase 2 – Benchmark Characterization
Memory Access Characteristics of
Rodinia and Parboil benchmarks
Apply Mathematical Analysis
◦ Validate model
◦ Find optimum operating points for benchmarks
◦ Find optimum TA topology for benchmarks
NEARLY COMPLETE
45
Phase 3 – Evaluate Enhancements
Automatic Thread Throttling
Low latency hierarchical cache
Benefits of odd-sets/odd-banking
Benefits of explicit placement
(Priority/Epoch)
NEED FINAL EVALUATION and explicit placement study
46
Study regular HPC applications in throughput setting
Add at least two irregular benchmarks
◦ Less likely to benefit from caching
◦ New opportunities for enhancement
Explore impact of future TA topologies
◦ Memory Cubes, TSV DRAM, etc.
47
Dissertation Goals:
◦ Quantify the degree single thread performance effects throughput performance for an important class of applications
◦ Improve parallel efficiency through thread scheduling, cache topology, and cache policies
Feasibility
◦ Regular Benchmarks show promising memory behavior
◦ Cycle accurate simulator nearly completed
48
Phase 1 – HPC applications – completed
Phase 2 – Mathematical model &
Benchmark Characterization
◦ MAY-JUNE
Phase 3 – Architectural Enhancements
◦ JULY-AUGUST
Phase 4 – Domain enhancement / new features
◦ September-November
49
50
51
52
53
PNS Working Set Per Task
5
4
3
2
1
0
10
9
8
7
6
0 50,000 100,000 150,000
Working Set Size (Bytes)
50%
40%
30%
20%
10%
0%
200,000
100%
90%
80%
70%
60%
54
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Priority Scheduling, 8 Warps
Warp7-Ins
Warp6-Ins
Warp5-Ins
Warp4-Ins
Warp3-Ins
Warp2-Ins
Warp1-Ins
Warp0-Ins
55
Introduction
◦ Throughput Architectures - The Problem
◦ Dissertation Overview
Modeling Throughput Performance
◦ Throughput
◦ Caches
◦ The Valley
Methodology
Architectural Enhancements
◦ Thread Scheduling
◦ Cache Policies
Odd-set/Odd-bank caches
Placement Policies
◦ Cache Topology
Dissertation Timeline
56
Modeling Throughput Performance
N
P
CHIP
P
L
T
ST
AVG
= Total Active Threads
= Total Throughput Performance
= Single Thread Performance
= Average Latency per instruction
P
Chip
=
N
T
×
P
ST
P
ST
(( Ins / Sec ) / Thread )
=
ILP ( Ins / Thread )
L
AVG
( Sec )
L
AVG
= ( )
Power
CHIP
= E
AVG
(Joules)xP
CHIP
57
Looked at GEMM, FFT & Homme in CMP setting
◦ Learned implementation algorithms, alternative algorithms
◦ Expertise allows for credible throughput analysis
◦ Valuable Lessons in multithreading and caching
Dense Matrix Multiply
◦ Blocking to maximize arithmetic intensity
◦ Need enough contexts to cover latency
Fast Fourier Transform
◦ Pathologically hard on memory system
◦ Communication & synchronization
HOMME – weather modeling
◦ Intra-chip scaling incredibly difficult
◦ Memory system performance variation
◦ Replacing data movement with computation
Most significant publications:
58
4,0
3,5
3,0
2,5
2,0
1,5
1,0
0,5
0,0
32-bank Scratchpad, Conflicts
59
Introduction
◦ Throughput Architectures - The Problem
◦ Dissertation Overview
Modeling Throughput Performance
◦ Throughput
◦ Caches
◦ The Valley
Methodology
Architectural Enhancements
◦ Thread Scheduling
◦ Cache Policies
Odd-set/Odd-bank caches
Placement Policies
◦ Cache Topology
Dissertation Timeline
60
Computation is cheap, data movement is expensive:
! " #$#%&'( ! ) &*+%+, -
! " #$%&'%(&) *) +', - -
! " #$%&'/0 12 2
0 56) '57) +, ( - 8''. '9 9
1: : ) 88'; <='>?10
0 56) '57) +, ( - ', : +588': A%7
B CC#: A%7'D%+)
?) , - 'C+59 '2 ?10
* Bill Dally, IPDPS Keynote, 2011
. 444
@44
. ! 444
.
34
3!
@4
Exponential growth in cores saturates off-chip bandwidth
- Performance capped
Latency to off-chip DRAM now hundreds of cycles
- Need hundreds of threads per core to mask
61
Introduction
◦ Throughput Architectures - The Problem
◦ Dissertation Overview
Modeling Throughput Performance
◦ Throughput
◦ Caches
◦ The Valley
Methodology
Architectural Enhancements
◦ Thread Scheduling
◦ Cache Policies
Odd-set/Odd-bank caches
Placement Policies
◦ Cache Topology
Dissertation Timeline
62
Socket power economically capped
DARPA’s UHCP Exascale Initiative:
◦ Supercomputers now power capped
◦ 10-20x power efficiency by 2017
◦ Supercomputing Moore’s Law:
Double power efficiency every year
“Post-PC” client era requires >20x power efficiency of desktop
Even Throughput Architectures aren’t efficient enough!
63
3
2
1
5
4
0
8
7
6
BackProp-CPI@8 Warps
CMP-L1
RELAXED-L1
FERMI-L1
CMP-DRAM
RELAXED-DRAM
FERMI-DRAM
64
40%
35%
30%
25%
20%
15%
10%
5%
0%
Accesses Per Instruction
Scratchpad
Cache
65
Introduction
◦ Throughput Architectures - The Problem
◦ Dissertation Overview
Modeling Throughput Performance
◦ Throughput
◦ Caches
◦ The Valley
Methodology
Architectural Enhancements
◦ Thread Scheduling
◦ Cache Policies
Odd-set/Odd-bank caches
Placement Policies
◦ Cache Topology
Dissertation Timeline
66
Mathematic Analysis
Architectural algorithms
Benchmark Characterization
Nearly finished full chip simulator
◦ Currently simulates one core at a time
Almost ready to publish 2 papers…
67
Benchmark Characterization
(May-June)
Latency Sensitivity with cache feedback, multiple blocks per core
Global caching, BW across cores
Validate mathematical model with benchmarks
Compiler Controls
68
Architectural Evaluation
(July-August)
Priority Thread Scheduling
Automatic Thread Throttling
Optimized Cache Topology
◦ Low latency / fast path
◦ Odd-set banking
◦ Explicit Epoch placement
69
Extending the Domain (Sep-Nov)
Extend benchmarks
◦ Port HPC applications/kernels to throughput environment
◦ Add at least two irregular applications
E.g. Sparse MM, Photon Mapping, Adaptive Finite
Elements
Extend topologies, enhancements
◦ Explore design space of emerging architectures
◦ Examine optimizations beneficial to irregular applications
70
71
Mathematical Analysis of Throughput
Performance
◦ Caching, saturated bandwidth, sensitivities to application characteristics, latency
Quantify Importance of Single Thread
Latency
Demonstrate novel enhancements
◦ Valley based thread throttling
◦ Priority Scheduling
◦ Subcritical Caching Techniques
72
73
74
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
FAL64
DMPB64
DM64
75
DRAM Cycle Utilization
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Base
2 striped banks
3 striped banks
4 striped banks
256B Line
76
100%
80%
60%
40%
20%
0%
Single Task Fraction of
Scratchpad
77
78
79
Assume ideal caches
Ignore changes to DRAM latency & offchip BW
80