Document 10410078

advertisement
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das Parallelize your code!
Launch more threads!
Improve
Replacement
Policies
Multi-­‐
threading Caching Is the Warp Scheduler
aware of these techniques?
Main Memory Prefetching Improve Prefetcher
Improve Memory
Scheduling Policies
(look deep in the future,
if you can!)
Cache-Conscious
Scheduling,
MICRO’12
Two-level Scheduling
MICRO’11
Multi-­‐
threading Caching Aware Warp
Scheduler
Main Memory Thread-Block-Aware
Scheduling (OWL)
ASPLOS’13
Prefetching ?
Our Proposal n 
Prefetch Aware Warp Scheduler
n 
Goals:
q  Make a Simple prefetcher more Capable
q  Improve system performance by orchestrating
scheduling and prefetching mechanisms
n 
25% average IPC improvement over
q  Prefetching + Conventional Warp Scheduling Policy
n 
7% average IPC improvement over
q  Prefetching + Best Previous Warp Scheduling Policy
4
Outline n 
Proposal
n 
Background and Motivation
n 
Prefetch-aware Scheduling
n 
Evaluation
n 
Conclusions
5
High-­‐Level View of a GPU Streaming
Multiprocessors
(SMs)
Threads
W
CTA W CTA
W
W
CTA W CTA
W
Warps
Scheduler
L1 Caches Prefetcher
ALUs
Interconnect
Cooperative
Thread Arrays
(CTAs) Or
Thread Blocks
L2 cache
DRAM
6
Warp Scheduling Policy n 
n 
W
1
Equal scheduling priority
q  Round-Robin (RR) execution
Problem: Warps stall roughly at the same time
W
2
W
3
W
4
W
5
W
6
W
7
W
8
SIMT
Core
Stalls
Compute
Phase (1)
DRAM
Requests
D1
D2
D3
D4
D5
D6
D7
D8
W
1
W
2
W
3
W
4
W
5
W
6
W
7
W
8
Compute
Phase (2)
Time
7
W
1
W
2
W
3
W
4
W
5
W
6
W
7
Compute
Phase (1)
DRAM
Requests
W
8
SIMT
Core
Stalls
W
1
W
2
W
3
W
4
W
5
W
7
W
8
Compute
Phase (2)
D1
D2
D3
D4
D5
D6
D7
D8
W W W W W W W W
1 2 3 4 5 6 7 8
W
6
Time
W W W W
1 2 3 4
W W W W
5 6 7 8
Group 2
TWO
LEVEL
(TL) SCHEDULING
Group 1
Group
2
Comp.
Comp.
Compute Compute
Phase
Phase
Phase (1) Phase (1)
(2)
(2)
Group 1
DRAM
Requests
D1
D2
D3
D4
D5
D6
D7
D8
Saved
Cycles
W
2
W
4
W
5
Y+2
Y+3
W
3
Y
Y+1
X+3
W
1
X+2
High BankLevel
Parallelism
X
X+1
Accessing DRAM … W
6
W
7
W
8
Memory
Addresses
Legend
Group 1
High Row
Buffer
Locality
Bank 1
W
1
W
2
W
3
Group 2
Bank 2
W
4
W
5
W
6
W
7
Idle for a
period
Bank 1
Bank 2
W
8
Low Bank-Level
Parallelism
High Row Buffer
Locality
Warp Scheduler Perspec?ve (Summary) Warp
Scheduler
Forms Multiple
Warp Groups?
DRAM Bandwidth
Utilization
Bank
Level
Parallelism
Row
Buffer
Locality
RoundRobin
(RR)
✖
✔
✔
Two-Level
(TL)
✔
✖
✔
10
Evalua?ng RR and TL schedulers Round-robin (RR)
IPC
Improvement
factor with Perfect L1 Cache
Can
we further
reduce this gap?
2.20X
1.88X
GMEA
N
JPEG
FWT
BLK
SCP
FFT
BFSR
SPMV
KMN
PVC
Via Prefetching ?
SSC
7
6
5
4
3
2
1
0
Two-level (TL)
11
(1) Prefetching: Saves more cycles Compute Compute
Phase (1) Phase (1)
DRAM
Requests
D1
D2
D3
D4
(B)
Comp.
Phase
(2)
Comp.
Phase
(2)
TL
RR
Saved
Compute
Cycles
Phase-2
D5
D6
D7
D8
Compute Compute
Phase (1) Phase (1)
DRAM
Requests
(A)
(Group-2)
Time
Can
Start
Comp.
Phase
(2)
D1
D2
D3
D4
P5
P6
P7
P8
Comp.
Phase
(2)
Saved
Cycles
Prefetch Requests
W
4
Y+2
Y+3
W
3
Y
Y+1
W
2
X+3
W
1
X+2
X
X+1
(2) Prefetching: Improve DRAM Bandwidth U?liza?on W
5
W
7
W
6
W
8
Idle for
a period
High BankLevel
Parallelism
High Row
Buffer
Locality
Bank 1
W
1
W
2
W
3
No Idle
period!
Bank 2
W
4
W
5
W
6
Memory
Addresses
W
7
W
8
Prefetch
Requests
Bank 1
Bank 2
W
2
W
4
W
5
W
6
Y+2
Y+3
W
3
Y Y
Y+1
X+3
W
1
X+2
X X
X+1
Challenge: Designing a Prefetcher W
7
Memory
Addresses
W
8
Prefetch
Requests
Bank 1
X
Bank 2
Sophisticated
Prefetcher
Y
Our Goal n 
Keep the prefetcher simple, yet get the
performance benefits of a sophisticated
prefetcher.
To this end, we will design a prefetch-aware warp
scheduling policy Why? A simple prefetching does not improve
performance with existing scheduling policies. 15
Simple Prefetching + RR scheduling Compute
Phase (1)
DRAM
Requests
Compute
Phase (1)
DRAM
Requests
RR
Compute
Phase (2)
D1
D2
D3
D4
D5
D6
D7
D8
Overlap
with D2
(Late
Prefetch)
D1
P2
D3
P4
D5
P6
D7
P8
No Saved
CyclesTime
Overlap
with D4
(Late
Prefetch)
Compute
Phase (2)
Time
Simple Prefetching + TL scheduling Group 1
Group 2
Compute Compute
Phase (1) Phase (1)
DRAM
Requests
Group 1
D1
D2
D3
D4
D5
D6
D7
D8
Group 2
Compute Compute
Phase (1) Phase (1)
D1
Overlap
P2
with D2
D3
P4
(Late
Prefetch)
Overlap with D4
(Late Prefetch)
D5
P6
D7
P8
TL
Group 1
Group 2
Comp.
Phase
(2)
Comp.
Phase
(2)
No Saved
Cycles
(over
Group 2
Group 1 TL)
Comp.
Phase
(2)
Comp.
Phase
(2)
RR
Saved
Cycles
Time
Let’s Try… X
Simple
Prefetcher
X+4
18
W
2
X+4
W
4
May not
be
equal to
Bank 1
W
1
W
2
W
3
W
5
Y+2
Y+3
W
3
Y
Y+1
X+3
W
1
X+2
X
X+1
Simple Prefetching with TL scheduling W
6
W
7
W
8
Useless Prefetch
(X + 4)
Idle for a
Y
period
Bank 2
W
4
W
5
W
6
W
7
W
8
UP1 UP2 UP3 UP4
Bank 1
Memory
Addresses
Bank 2
Useless
Prefetches
Simple Prefetching with TL scheduling Compute Compute
Phase (1) Phase (1)
DRAM
Requests
D1
D2
D3
D4
DRAM
Requests
Comp.
Phase
(2)
No
Saved
Comp.
Comp.
Phase
Phase
Cycles
(2)
(2)
(over TL)
Useless
Prefetches
D5
D6
D7
D8
RR
Saved
Cycles
D5
D6
D7
D8
Compute Compute
Phase (1) Phase (1)
D1
D2
D3
D4
U5
U6
U7
U8
Comp.
Phase
(2)
TL
Time
Warp Scheduler Perspec?ve (Summary) Warp
Scheduler
Forms
Multiple
Warp
Groups?
Simple
Prefetcher
Friendly?
DRAM Bandwidth
Utilization
Bank
Level
Parallelism
Row
Buffer
Locality
RoundRobin
(RR)
✖
✖
✔
✔
Two-Level
(TL)
✔
✖
✖
✔
21
Our Goal n 
Keep the prefetcher simple, yet get the
performance benefits of a sophisticated
prefetcher.
To this end, we will design a prefetch-aware warp
scheduling policy
A simple prefetching does not improve
performance with existing scheduling policies. 22
Sophisticated
Prefetcher
Prefetch Aware (PA) Warp Scheduler
Simple
Prefetcher
23
Y
Y+1
Y+2
Y+3
W
7
Y+2
Y+3
W
8
Y
Y+1
W
6
X+3
X+3
W
5
W
1
W
3
W
4
W
5
W
7
X+3
W
4
X+2
Group 2
W
3
X+2
Group 1
W
2
X
X+1
W
1
X+2
X
X+1
Prefetch-­‐aware (PA) warp scheduling W
3
W
4
W
2
W
6
W
8
Round Robin
Scheduling
Two-level
Scheduling
W
2
Y+2
Y+3
W
1
Y
Y+1
X
X+1
See paper for generalized
algorithm of PA scheduler
Prefetch-aware
W
5
W
7
W
6
W
8
Scheduling
Non-consecutive warps are associated with one group
X
X+1
X+2
X+3
Y
Y+1
Y+2
Y+3
Simple Prefetching with PA scheduling W
1
W
3
W
4
W
5
W
7
W
2
Bank 1
W
6
W
8
Bank 2
Reasoning of non-consecutive warp grouping is
that groups can (simple) prefetch for each other
(green warps can prefetch for red warps using
simple prefetcher)
X
Simple
Prefetcher
X+1
W
3
W
4
W
5
W
6
W
7
Y+3
X+2
W
2
Y+1
Y+2
X+1
W
1
X+3
Y
X
Simple Prefetching with PA scheduling W
8
Cache Hits!
Bank 1
X
Bank 2
Simple
Prefetcher
X+1
Simple Prefetching with PA scheduling
(A)
(B) TL
Comp.
Phase
(2)
Compute Compute
Phase (1) Phase (1)
DRAM
Requests
D1
D3
D5
D7
Saved
Compute
Cycles
Phase-2
D2
D4
D6
D8
Compute Compute
Phase (1) Phase (1)
DRAM
Requests
Comp.
Phase
(2)
RR
(Group-2)
Time
Can
Start
Comp.
Phase
(2)
D1
D3
D5
D7
P2
P4
P6
P8
Prefetch Requests
Comp.
Phase
(2)
Saved
Saved
Cycles!!!
(overCycles
TL)
W
3
Y+3
X+2
W
2
Y+1
Y+2
X+1
W
1
X+3
Y
X
DRAM Bandwidth U:liza:on 18% increase in bank-level parallelism
W
4
W
5
W
6
W
7
W
8
24% decrease in row buffer locality
Bank 1
Bank 2
High Bank-Level Parallelism
High Row Buffer Locality
X
Simple
Prefetcher
X+1
Warp Scheduler Perspec?ve (Summary) Warp
Scheduler
Forms
Multiple Warp
Groups?
Simple
Prefetcher
Friendly?
DRAM Bandwidth
Utilization
Bank
Level
Parallelism
Row Buffer
Locality
RoundRobin
(RR)
✖
✖
✔
✔
Two-Level
(TL)
✔
✖
✖
✔
PrefetchAware
(PA)
✔
✔
✔
✔
(with
prefetching)
29
Outline n 
Proposal
n 
Background and Motivation
n 
Prefetch-aware Scheduling
n 
Evaluation
n 
Conclusions
30
Evalua?on Methodology n 
Evaluated on GPGPU-Sim, a cycle accurate GPU simulator
n 
Baseline Architecture
q 
q 
q 
q 
n 
30 SMs, 8 memory controllers, crossbar connected
1300MHz, SIMT Width = 8, Max. 1024 threads/core
32 KB L1 data cache, 8 KB Texture and Constant Caches
L1 Data Cache Prefetcher, GDDR3@1100MHz
Applications Chosen from:
q 
q 
q 
q 
Mapreduce Applications
Rodinia – Heterogeneous Applications
Parboil – Throughput Computing Focused Applications
NVIDIA CUDA SDK – GPGPU Applications
31
Spa?al Locality Detector based Prefetching MACRO BLOCK X D Prefetch:- Not accessed
(demanded) Cache Lines
X + 1 P Prefetch-aware Scheduler
Improves effectiveness of
X + 2 D this
simple
prefetcher
See paper for more details
X + 3 P D = Demand, P = Prefetch
32
Improving Prefetching Effec?veness TL+Prefetching
RR+Prefetching
Fraction of Late Prefetches
100%
89%
86%
PA+Prefetching
Prefetch Accuracy
69%
85%
89%
90%
100%
80%
80%
60%
60%
40%
40%
20%
20%
0%
0%
20%
15%
Reduction in L1D Miss Rates
16%
10%
5%
2%
4%
0%
33
Performance Evalua?on Results are Normalized to RR scheduling
RR+Prefetching
TL
TL+Prefetching
Prefetch-aware (PA)
1.01
3
PA+Prefetching
1.16 1.19 1.20 1.26
2.5
2
1.5
See paper for Additional
Results
GMEAN
JPEG
FWT
SCP
FFT
BFSR
SPMV
KMN
PVC
SSC
0.5
BLK
1
25% IPC improvement over
Prefetching + RR Warp Scheduling Policy (Commonly Used)
7% IPC improvement over
Prefetching + TL Warp Scheduling Policy (Best Previous)
34
Conclusions n 
n 
n 
Existing warp schedulers in GPGPUs cannot take advantage of
simple prefetchers
q  Consecutive warps have good spatial locality, and can
prefetch well for each other
q  But, existing schedulers schedule consecutive warps closeby
in time à prefetches are too late
We proposed prefetch-aware (PA) warp scheduling
q  Key idea: group consecutive warps into different groups
q  Enables a simple prefetcher to be timely since warps in
different groups are scheduled at separate times
Evaluations show that PA warp scheduling improves
performance over combinations of conventional (RR) and the
best previous (TL) warp scheduling and prefetching policies
q  Better orchestrates warp scheduling and prefetching decisions
35
THANKS! QUESTIONS? 36
BACKUP 37
Effect of Prefetch-­‐aware Scheduling Percentage of DRAM requests (averaged over group) with:
1 miss
2 misses
3-4 misses to a macro-block
60%
High Spatial
Locality Requests
Recovered by
Prefetching
40%
High Spatial
Locality
Requests
20%
0%
Two-level
Prefetch-aware
38
Working (With Two-­‐Level Scheduling) MACRO BLOCK X MACRO BLOCK Y D X + 1 D X + 2 D X + 3 D High
Spatial
Locality
Requests
D Y + 1 D Y + 2 D Y + 3 D 39
Working (With Prefetch-­‐Aware Scheduling) MACRO BLOCK X MACRO BLOCK Y D X + 1 P X + 2 D X + 3 P High
Spatial
Locality
Requests
D Y + 1 P Y + 2 D Y + 3 P Working (With Prefetch-­‐Aware Scheduling) MACRO BLOCK MACRO BLOCK X X + 1 D Y Cache
Hits
Y + 1 D X + 2 Y + 2 X + 3 D Y + 3 D Effect on Row Buffer locality TL+Prefetching
PA
PA+Prefetching
12
10
8
6
4
AVG
JPEG
FWT
BLK
SCP
FFT
BFSR
SPMV
KMN
0
PVC
2
SSC
Row Buffer Locality
TL
24% decrease in row buffer locality over TL
42
RR
TL
PA
25
20
15
10
JPEG
FWT
BLK
SCP
FFT
BFSR
SPMV
KMN
PVC
0
AVG
5
SSC
Bank Level Parallelism
Effect on Bank-­‐Level Parallelism 18% increase in bank-level parallelism over TL
43
W
2
W
4
Bank 1
W
1
W
2
W
3
Bank 1
W
5
Y+2
Y+3
W
3
Y
Y+1
X+3
W
1
X+2
X
X+1
Simple Prefetching + RR scheduling W
6
W
7
W
8
W
7
W
8
Bank 2
W
4
W
5
W
6
Bank 2
Memory
Addresses
W
2
W
4
W
5
W
6
Y+2
Y+3
W
3
Y
Y+1
X+3
W
1
X+2
X
X+1
Simple Prefetching with TL scheduling W
7
Memory
Addresses
W
8
Idle for a
period
Bank 1
Legend
Bank 2
Group 1
W
1
W
2
W
3
W
4
W
5
W
6
Idle for a
period
Bank 1
Bank 2
W
7
W
8
Group 2
CTA-Assignment Policy (Example)
Multi-threaded CUDA Kernel
CTA-1
CTA-2
SIMT Core-1
CTA-1
CTA-2
Warp Scheduler
L1 Caches
ALUs
CTA-3
CTA-4
SIMT Core-2
CTA-3
CTA-4
Warp Scheduler
L1 Caches
ALUs
46
Download