A Variable Warp-Size Architecture Timothy G. Rogers Daniel R. Johnson

advertisement
A Variable Warp-Size
Architecture
Timothy G. Rogers
Daniel R. Johnson
Mike O’Connor
Stephen W. Keckler
Contemporary GPU
Massively Multithreaded
10,000’s of threads concurrently executing on 10’s of
Streaming Multiprocessors (SM)


GPU
MemCtrl ... MemCtrl
L2$
L2$
...
SM
SM
Interconnect
SM
... SM
MemCtrl ... MemCtrl
L2$
L2$
2
A Variable Warp-Size Architecture
Tim Rogers
Contemporary
Streaming Multiprocessor (SM)
1,000’s of schedulable threads
Amortize front end and memory overheads by grouping
threads into warps.



Size of the warp is fixed based on the architecture
Streaming Multiprocessor
Frontend
L1 I-Cache
Decode
Warp
Control Logic
Warp Datapath
32-wide
Memory Unit
3
A Variable Warp-Size Architecture
GPU
MemCtrl ... MemCtrl
L2$
L2$
...
SM
SM
Interconnect
SM
... SM
MemCtrl ... MemCtrl
L2$
L2$
Tim Rogers
Contemporary GPU Software
Regular structured computation
Predictable access and control flow patterns
Can take advantage of HW amortization for increased
performance and energy efficiency



Execute Efficiently
on a GPU Today
Graphics
Shaders
…
Matrix
Multiply
4
A Variable Warp-Size Architecture
Tim Rogers
Forward-Looking GPU Software
Still Massively Parallel
Less Structured



Memory access and control flow patterns are less predictable
Less efficient on
today’s GPU
Molecular
Dynamics
Raytracing
Execute efficiently
on a GPU today
Graphics
Shaders
…
Object
Classification
…
Matrix
Multiply
5
A Variable Warp-Size Architecture
Tim Rogers
Divergence: Source of Inefficiency
Regular hardware that amortizes front end and overhead
Irregular software with many different control flow paths
and less predicable memory accesses.


Branch Divergence
Memory Divergence
Load R1, 0(R2)
…
32- Wide
32- Wide
if (…) {
…
}
…
Can cut function unit
utilization to 1/32.
Instruction may wait
for 32 cache lines
Main Memory
6
A Variable Warp-Size Architecture
Tim Rogers
Irregular “Divergent” Applications:
Perform better with a smaller warp size
Branch Divergence
Memory Divergence
Load R1, 0(R2)
…
if (…) {
…
}
…
Each instruction waits
on fewer accesses
Increased
function unit
utilization
7
Allows more
threads to
proceed
concurrently
A Variable Warp-Size Architecture
Main Memory
Tim Rogers
Negative effects of smaller warp size
Less front end amortization


Increase in fetch/decode energy
Negative performance effects


8
Scheduling skew increases pressure on the memory system
A Variable Warp-Size Architecture
Tim Rogers
Regular “Convergent” Applications:
Perform better with a wider warp
GPU memory coalescing
Load R1, 0(R2)
Smaller warps: Less coalescing
Load R1, 0(R2)
32- Wide
One memory system request
can service all 32 threads
8 redundant memory accesses –
no longer occurring together
Main Memory
9
Main Memory
A Variable Warp-Size Architecture
Tim Rogers
Performance vs. Warp Size
165 Applications

IPC normalized to warp size 32
1.8
1.6
Warp Size 4
1.4
1.2
1
0.8
0.6
0.4
0.2
0
Application
Convergent
Applications
10
Warp-Size Insensitive
Applications
A Variable Warp-Size Architecture
Divergent
Applications
Tim Rogers
Goals

Convergent Applications



Warp Size Insensitive Applications


Maintain wide-warp performance
Maintain front end efficiency
Maintain front end efficiency
Divergent Applications

Gain small warp performance
Set the warp size based on
the executing application
11
A Variable Warp-Size Architecture
Tim Rogers
Sliced Datapath + Ganged Scheduling

Split the SM datapath into narrow slices.


Extensively studied 4-thread slices
Gang slice execution to gain efficiencies of wider warp.
Slices share an L1
I-Cache and Memory Unit
Frontend
L1 I-Cache
Ganging
Unit
Slice
Warp Slice Datapath
Slice
Control Logic
Front
4-wide
End
Slices can
execute
independently
Warp Datapath
32-wide Slice
Front
Memory Unit
End
...
Slice
Slice Datapath
4-wide
Memory Unit
12
A Variable Warp-Size Architecture
Tim Rogers
Initial operation

Slices begin execution in ganged mode


Mirrors the baseline 32-wide warp system
Question: When to break the gang?
Instructions are fetched
and decoded once
Frontend
L1 I-Cache
Ganging
Unit
Slice
Front
End
Slice
Slice Datapath
4-wide
...
Ganging unit
drives the slices
Slice
Front
End
Slice
Slice Datapath
4-wide
Memory Unit
13
A Variable Warp-Size Architecture
Tim Rogers
Breaking Gangs on Control Flow Divergence


PCs common to more than one slice form a new gang
Slices that follow a unique PC in the gang are transferred
to independent control
Observes different
PCs from each slice
Ganging
Unit
Slice
Front
End
Frontend
L1 I-Cache
Slice
Slice Datapath
4-wide
...
Unique PCs: no
longer controlled by
ganging unit
Slice
Front
End
Slice
Slice Datapath
4-wide
Memory Unit
14
A Variable Warp-Size Architecture
Tim Rogers
Breaking Gangs on Memory Divergence


Latency of accesses from each slice can differ
Evaluated several heuristics on breaking the gang when
this occurs
Frontend
L1 I-Cache
Ganging
Unit
Slice
Front
End
Hits
in L1
15
Slice
Slice Datapath
4-wide
...
Memory Unit
A Variable Warp-Size Architecture
Slice
Front
End
Slice
Slice Datapath
4-wide
Goes to
memory
Tim Rogers
Gang Reformation

Performed opportunistically


Ganging unit checks for gangs or independent slices at the
same PC
Forms them into a gang
Ganging unit re-gangs
them
Frontend
L1 I-Cache
Ganging
Unit
Slice
Front
End
Slice
Slice Datapath
4-wide
Independent BUT
at the same PC
...
Memory Unit
16
A Variable Warp-Size Architecture
Slice
Front
End
Slice
Slice Datapath
4-wide
More details in the
paper
Tim Rogers
Methodology

In House, Cycle-Level Streaming Multiprocessor Model







17
1 In-order core
64KB L1 Data cache
128KB L2 Data Cache (One SM’s worth)
48KB Shared Memory
Texture memory unit
Limited BW memory system
Greedy-Then-Oldest (GTO) Issue Scheduler
A Variable Warp-Size Architecture
Tim Rogers
Configurations

Warp Size 32 (WS 32)
Warp Size 4 (WS 4)

Inelastic Variable Warp Sizing (I-VWS)




Elastic Variable Warp Sizing (E-VWS)


Gangs break on control flow divergence
Are not reformed
Like I-VWS, except gangs are opportunistically reformed
Studied 5 applications from each
18
Paper Explores Many
Configurations
categoryMore
in detail
A Variable Warp-Size Architecture
Tim Rogers
Divergent Application Performance
HMEAN-DIV
Raytracing
E-VWS
ObjClassifier
I-VWS
GamePhysics
CoMD
IPC normalized to warp size 32
WS 4
Lighting
WS 32
1.8
1.6
1.4
1.2
1
0.8
0.6
0.4
0.2
0
E-VWS:
I-VWS: Break
Break
Warp Size 4
on
+ Reform
CF Only
Divergent Applications
19
A Variable Warp-Size Architecture
Tim Rogers
Used as a proxy for
energy consumption
20
AVG-DIV
I-VWS:
E-VWS:Break
Break
Warp Size 4
on+ CF
Reform
Only
E-VWS
Raytracing
I-VWS
ObjClassifier
WS 4
GamePhysics
WS 32
Lighting
8
7
6
5
4
3
2
1
0
CoMD
Avg. Fetches Per Cycle
Divergent Application Fetch Overhead
Divergent Applications
A Variable Warp-Size Architecture
Tim Rogers
Convergent Application Performance
IPC normalized to warp size 32
1.2
WS 32
WS 4
I-VWS
E-VWS:
I-VWS: Break
Warp Size 4
on
+ Reform
CF Only
E-VWS
1
0.8
0.6
0.4
0.2
HMEAN-CON
Radix Sort
FeatureDetect
Game 2
MatrixMultiply
Game 1
0
Warp-Size Insensitive
Applications Unaffected
Convergent Applications
21
A Variable Warp-Size Architecture
Tim Rogers
on CF Only
+ Reform
Warp
Size 4
Warp-Size Insensitive Applications
22
AVG-CON
Radix Sort
FeatureDetect
Game 2
MatrixMultiply
Game 1
E-VWS
AVG-WSI
I-VWS
FFT
WS 4
Game 4
Game 3
WS 32
Convolution
9
8
7
6
5
4
3
2
1
0
Image Proc.
Avg. Fetches Per Cycle
Convergent/Insensitive Application Fetch
I-VWS: Break
Overhead
E-VWS: Break
Convergent Applications
A Variable Warp-Size Architecture
Tim Rogers
165 Application Performance
IPC normalized to warp size 32
1.8
Warp Size 4
I-VWS
1.6
1.4
1.2
1
0.8
0.6
0.4
Application
Convergent
Applications
23
Warp-Size Insensitive
Applications
A Variable Warp-Size Architecture
Divergent
Applications
Tim Rogers
Related Work
Compaction/Formation
Decrease
Thread Level
Parallelism
Area Cost
Improve
Function Unit
Utilization
Subdivision/Multipath
Increase
Thread Level
Parallelism
Area Cost
Decreased
Function Unit
Utilization
Variable Warp Sizing
Increase
Thread Level
Parallelism
24
Area Cost
Improve
Function Unit
Utilization
A Variable Warp-Size Architecture
VWS Estimate:
5% for 4-wide slices
2.5% for 8-wide slices
Tim Rogers
Conclusion

Explored space surrounding warp size and perfromance

Vary the size of the warp to meet the depends of the
workload



35% performance improvement on divergent apps
No performance degradation on convergent apps
Narrow slices with ganged execution

Improves both SIMD efficiency and thread-level parallelism
Questions?
25
A Variable Warp-Size Architecture
Tim Rogers
Managing DRAM Latency Divergence
in Irregular GPGPU Applications
Niladrish Chatterjee
Mike O’Connor
Gabriel H. Loh
Nuwan Jayasena
Rajeev
Balasubramonian
Irregular GPGPU Applications
• Conventional GPGPU workloads access vector or matrix-based data
structures
• Predictable strides, large data parallelism
• Emerging Irregular Workloads
• Pointer-based data-structures & data-dependent memory accesses
• Memory Latency Divergence on SIMT platforms
Warp-aware memory scheduling to reduce DRAM latency
divergence
27
SC
2014
SIMT Execution Overview
Warps
THREADS
SIMT Core
Lockstep execution
Warp stalled on memory access
SIMT Core
SIMT Core
Warp
Scheduler
Warp 1
Warp 2
Warp 3
Warp N
SIMD Lanes
L1
Memory
Port
I
N
T
E
R
C
O
N
N
E
C
T
Memory Partition
L2 Slice
Memory
Controller
GDDR
5
Channel
GDDR
5
GDDR
5
Channel
GDDR
5
Memory Partition
L2 Slice
28
Memory
Controller
SC
2014
Memory Latency Divergence
SIMD Lanes (32)
• Coalescer has limited efficacy
in irregular workloads
Load Inst
• Partial hits in L1 and L2
Access Coalescing Unit
• 1st source of latency divergence
L1
• DRAM requests can have
varied latencies
• Warp stalled for last request
L2
• DRAM Latency Divergence
GDDR5
29
SC
2014
GPU Memory Controller (GMC)
• Optimized for high throughput
• Harvest channel and bank parallelism
• Address mapping to spread cache-lines across channels and banks.
• Achieve high row-buffer hit rate
• Deep queuing
• Aggressive reordering of requests for row-hit batching
• Not cognizant of the need to service requests from a warp together
• Interleave requests from different warps leading to latency divergence
30
SC
2014
Warp-Aware Scheduling
SM
1
SM
2
MC
Stall CyclesStall CyclesA:
A: LD
A:
Use
Use
Stall Cycles
B: LD
A
A
B
B
A
A
B
B
B:
Use
Reduced Average Memory Stall Time
Baseline
GMC
Scheduling
A
B
A
B
A
B
A
B
Warp-Aware
Scheduling
A
A
A
A
B
B
B
B
31
SC
2014
Impact of DRAM Latency Divergence
If all requests
a warp
were
to–be
in
If there
was onlyfrom
1 request
per
warp
5Xreturned
improvement.
perfect sequence from the DRAM –
~40% improvement.
32
SC
2014
Key Idea
• Form batches of requests from each warp
• warp-group
• Schedule all requests from a warp-group together
• Scheduling algorithm arbitrates between warp-groups to minimize
average stall-time of warps
33
SC
2014
Controller Design
34
SC
2014
Controller Design
35
SC
2014
Warp-Group Scheduling : Single Channel
Pending Warp-Groups
Queuing
delay in
cmd
queues
Row
hit/miss
status of
reqs
Warp-group
priority table
# of reqs
in warpgroup
• Each Warp-Group assigned a
priority
• Reflects completion time of
last request
• Higher Priority to
• Few requests
• High spatial locality
• Lightly loaded banks
Transaction
Scheduler
• Priorities updated dynamically
Pick warp-group
with lowest
runtime
36
• Transaction Scheduler picks warpgroup with lowest run-time
• Shortest-job-first based on
actual service time
SC
2014
WG-scheduling
Latency Divergence
GMC
Baseline
WG
Ideal
Bandwidth Utilization
37
SC
2014
Multiple Memory Controllers
• Channel level parallelism
• Warp’s requests sent to multiple memory channels
• Independent scheduling at each controller
• Subset of warp’s requests can be delayed at one or few memory
controllers
• Coordinate scheduling between controllers
• Prioritize warp-group that has already been serviced at other controllers
• Coordination message broadcast to other controllers on completion of a
warp-group.
38
SC
2014
Warp-Group Scheduling : Multi-Channel
Pending Warp-Groups
Queuing
delay in
cmd
queues
Row
hit/miss
status of
reqs
Priority Table
Status of
Warp-group
in other
channels
# of reqs
in warpgroup
Transaction
Scheduler
Periodic messages
to other channels
about completed
warp-groups
Pick warp-group
with lowest
runtime
39
SC
2014
WG-M Scheduling
Latency Divergence
GMC
Baseline
WG
WG-M
Ideal
Bandwidth Utilization
40
SC
2014
Bandwidth-Aware Warp-Group Scheduling
• Warp-group scheduling negatively affects bandwidth utilization
• Reduced row-hit rate
• Conflicting objectives
• Issue row-miss request from current warp-group
• Issue row-hit requests to maintain bus utilization
• Activate and Precharge idle cycles
• Hidden by row-hits in other banks
• Delay row-miss request to find the right slot
41
SC
2014
Bandwidth-Aware Warp-Group Scheduling
• The minimum number of row-hits needed in other banks to overlap
(tRTP+tRP+tRCD)
• Determined by GDDR timing parameters
• Minimum efficient row burst (MERB)
• Stored in a ROM looked up by Transaction Scheduler
• More banks with pending row-hits
• smaller MERB
• Schedule row-miss after MERB row-hits have been issued to bank
42
SC
2014
WG-Bw Scheduling
Latency Divergence
GMC
Baseline
WG
WG-Bw
WG-M
Ideal
Bandwidth Utilization
43
SC
2014
Warp-Aware Write Draining
• Writes drained in batches
• starts at High_Watermark
• Can stall small warp-groups
• When WQ reaches a threshold (lower than High_Watermark)
• Drain singleton warp-groups only
• Reduce write-induced latency
44
SC
2014
WG-scheduling
Latency Divergence
GMC
Baseline
WG
WG-Bw
WG-M
WG-W
Ideal
Bandwidth Utilization
45
SC
2014
Methodology
• GPGPUSim v3.1 : Cycle Accurate GPGPU simulator
SM Cores
30
Max Threads/Core
1024
Warp Size
32 Threads/warp
L1 / L2
32KB / 128 KB
DRAM
6Gbps GDDR5
DRAM Channels Banks
6 Channels
16 Banks/channel
• USIMM v1.3 : Cycle Accurate DRAM Simulator
• modified to model GMC-baseline & GDDR5 timings
• Irregular and Regular workloads from Parboil, Rodinia, Lonestar, and MARS.
46
SC
2014
IPC NORMALIZED TO
BASELINE
Performance Improvement
Restored
Bandwidth
Utilization
12%
Reduced Latency
Divergence
10%
8%
6%
4%
2%
0%
WG
WG-M
47
WG-Bw
WG-W
SC
2014
Impact on Regular Workloads
• Effective coalescing
• High spatial locality in warp-group
• WG scheduling works similar to GMC-baseline
• No performance loss
• WG-Bw and WG-W provide
• Minor benefits
48
SC
2014
Energy Impact of Reduced Row Hit-Rate
• Scheduling Row-misses over Rowhits
GDDR5 Energy/bit
• Reduces the row-buffer hit rate 16%
40
30
• In GDDR5, power consumption
dominated by I/O.
pJ/bit
20
10
0
Baseline
• Increase in DRAM power negligible
compared to execution speed-up
• Net improvement in system energy
49
WG-Bw
I/O
Row
Column
Control
DLL
Background
SC
2014
Conclusions
• Irregular applications place new demands on the GPU’s memory
system
• Memory scheduling can alleviate the issues caused by latency
divergence
• Carefully orchestrating the scheduling of commands can help regain
the bandwidth lost by warp-aware scheduling
• Future techniques must also include the cache-hierarchy in reducing
latency divergence
50
SC
2014
Thanks !
51
SC
2014
Backup Slides
52
SC
2014
Performance Improvement : IPC
53
SC
2014
Average Warp Stall Latency
54
SC
2014
DRAM Latency Divergence
55
SC
2014
Bandwidth Utilization
56
SC
2014
Memory Controller Microarchitecture
57
SC
2014
Warp-Group Scheduling
• Every batch assigned a priority-score
• completion time of the longest request
• Higher priority to warp groups with
• Few requests
• High spatial locality
• Lightly loaded banks
• Priorities updated after each warp-group scheduling
• Warp-group with lowest service time selected
• Shortest-job-first based on actual service time, not number of requests
58
SC
2014
Priority-Based Cache Allocation
in Throughput Processors
Dong Li*†, Minsoo Rhu*§†, Daniel R. Johnson§, Mike O’Connor§†, Mattan
Erez†, Doug Burger‡ , Donald S. Fussell† and Stephen W. Keckler§†
†
§
‡
*First authors Li and Rhu have made equal contributions to this work and are listed alphabetically
Introduction
• Graphics Processing Units (GPUs)
• General purpose throughput processors
• Massive thread hierarchy, warp=32 threads
• Challenge:
• Small caches + many threads  contention + cache thrashing
• Prior work: throttling thread level parallelism (TLP)
• Problem 1: under-utilized system resources
• Problem 2: unexplored cache efficiency
60
Motivation: Case Study (CoMD)
Throttling improves
performance
Observation 1:
Resource under-utilized
Observation 2:
Unexplored cache
efficiency
Throttling
Throttling
Best Throttled
Throttling
max TLP (default)
61
Motivation
• Throttling
• Idea
• Decoupling cache efficiency
from parallelism
# Warps Allocating Cache
• Tradeoff between cache
efficiency and parallelism.
• # Active Warps = # Warps
Allocating Cache
Baseline
Unexplored
# Active Warps
62
Our Proposal: Priority Based
Cache Allocation (PCAL)
63
Baseline
Standard
operation
Regular
Threads
Warp 0
Warp 1
Regular threads: threads have all capabilities,
operate as usual (full access to the L1 cache)
Warp 2
Warp 3
Warp 4
Warp n-1
64
TLPtoThrottling
TLP reduced
improve performance
Regular
Threads
Warp 0
Warp 1
Warp 2
Warp 3
Warp 4
Throttled
Threads
Regular threads: threads have all capabilities,
operate as usual (full access to the L1 cache)
Second subset: prevented from polluting L1 cache
(fills bypass)
Throttled threads: throttled (not runnable)
Warp n-1
65
Proposed
Solution:
PCAL
Three subsets
of threads
Regular
Threads
Warp 0
Nonpolluting
Threads
Warp 2
Warp 1
Warp 3
Warp 4
Throttled
Threads
Regular threads: threads have all capabilities,
operate as usual (full access to the L1 cache)
Non-polluting threads: runnable, but prevented
from polluting L1 cache
Throttled threads: throttled (not runnable)
Warp n-1
66
Proposed
Solution:cache
PCAL
Tokens for privileged
access
Regular
Threads
1
Warp 0
1
Warp 1
Nonpolluting
Threads
0
Warp 2
0
Warp 3
0
Warp 4
Throttled
Threads
Tokens grant capabilities or privileges to some
threads
Threads with tokens are allowed to allocate and
evict L1 cache lines
Threads without tokens can read and write, but
not allocate or evict, L1 cache lines
Warp n-1
Priority
Token Bit
67
Proposed
PCAL
Enhancedsolution:
scheduler Static
operation
Warp 0
1
Warp 1
Nonpolluting
Threads
0
Warp 2
0
Warp 3
0
Warp 4
Throttled
Threads
Warp n-1
Priority
Token
Bit
Scheduler
Status Bits
HW implementation: simplicity
Warp Scheduler
1
Token Assignment
Regular
Threads
Scheduler manages per-warp token bits
Token bits sent with memory requests
Policies: token assignment
Assign at warp launch, release at
termination
Re-assigned to oldest token-less warp
Parameters supplied by software prior
to kernel launch
68
Two Optimization Strategies
to exploit the 2-D (#Warps, #Tokens) search space
# Warps Allocating Cache
(#Tokens)
Baseline
• 1. Increasing TLP (ITLP)
Resource
under-utilized
ITLP
Unexplored
cache efficiency
MTLP
• Adding non-polluting warps
• Without hurting regular warps
• 2. Maintaining TLP (MTLP)
• Reduce #Token to increase the
efficiency
• Without decreasing TLP
# Active Warps (#Warps)
69
Proposed Solution: Dynamic PCAL
• Decide parameters at run time
• Choose MTLP/ITLP based on resource usage performance counter
• For MTLP: search #Tokens in parallel
• For ITLP: search #Warps in sequential
• Please refer to our paper for more details
70
Evaluation
Simulation: GPGPU-Sim
Open source academic simulator, configured similar to Fermi (fullchip)
Benchmarks
Open source GPGPU suites: Parboil, Rodinia, LonestarGPU. etc
Selected benchmark subset shows sensitivity to cache size
71
MTLP Example (CoMD)
MTLP: reducing #Tokens, keeping #Warps =6
MTLP
MTLP
PCAL with MTLP
MTLP
Best Throttled
72
ITLP Example (Similarity-Score)
ITLP: increasing #Warps, keeping #Tokens=2
ITLP
ITLP
PCAL with ITLP
ITLP
Best Throttled
73
PCAL Results Summary
Baseline: best throttled results
Performance improvement: Static-PCAL: 17%, Dynamic-PCAL: 11%
74
Conclusion
Existing throttling approach may lead to two problems:
Resources underutilization
Unexplored cache efficiency
We propose PCAL, a simple mechanism to rebalance
cache efficiency and parallelism
ITLP: increases parallelism without hurting regular warps
MTLP: alleviates cache thrashing while maintaining parallelism
Throughput improvement over best throttled results
Static PCAL: 17%
Dynamic PCAL: 11%
75
Questions
76
UNLOCKING BANDWIDTH FOR GPUS
IN CC-NUMA SYSTEMS
Neha Agarwal*
David Nellans
Mike O’Connor
Stephen W. Keckler
Thomas F. Wenisch*
NVIDIA
University of Michigan*
(Major part of this work was done when Neha Agarwal was an intern at NVIDIA)
HPCA-2015
EVOLVING GPU MEMORY SYSTEM
GDDR5
Roadmap
200 GB/s
GPU
NVLink
PCI-E
(Cache
Coherent)
80 GB/s
15.8
GB/s
CPU
80 GB/s
DDR4
CUDA 6.0+: Current
Future
cudaMemcpy
Unified virtual memory
CPU-GPU cache-coherent
high BW interconnect
Programmer controlled
copying to GPU
memory
Run-time controlled
copying  Better
productivity
How to best exploit full BW
while maintaining
programmability?
CUDA 1.0-5.x
HPCA-2015
78
DESIGN GOALS & OPPORTUNITIES
No need for explicit data copying
Exploit full DDR + GDDR BW
Additional 30% BW via NVLink
Crucial to BW sensitive GPU apps
Total Bandwidth (GB/s)
Simple programming model:
280
200
80
15.8
Legacy CUDA
UVM
PCI-E
NVLink
Target
CC-NUMA BW
Design intelligent dynamic page migration policies
to achieve both these goals
HPCA-2015
79
NVLink
GPU
200 GB/s
GDDR5
CPU
80 GB/s
80
GB/s
DDR4
Total Bandwidth (GB/s)
BANDWIDTH UTILIZATION
Total System Memory BW
28
0
20
0
GPU Memory BW
80
BW from CC
0
% of Migrated Pages to GPU Memory 100
Coherence-based accesses, no page migration
Wastes GPU memory BW
HPCA-2015
80
NVLink
GPU
200 GB/s
GDDR5
CPU
80 GB/s
80
GB/s
DDR4
Total Bandwidth (GB/s)
BANDWIDTH UTILIZATION
28
0
Total System Memory BW
20
0
GPU Memory BW
80
Additional BW from CC
0
% of Migrated Pages to GPU Memory 100
Static Oracle: Place data in the ratio of memory bandwidths [ASPLOS’15]
Dynamic migration can exploit the full system memory BW
HPCA-2015
81
NVLink
GPU
200 GB/s
GDDR5
CPU
80 GB/s
80
GB/s
DDR4
Total Bandwidth (GB/s)
BANDWIDTH UTILIZATION
28
0
Total System Memory BW
20
0
GPU Memory BW
80
Additional BW from CC
0
% of Migrated Pages to GPU Memory 100
Excessive migration leads to under-utilization of DDR BW
Migrate pages for optimal BW utilization
HPCA-2015
82
CONTRIBUTIONS
Intelligent
Dynamic Page Migration
Aggressively migrate pages upon First-Touch to GDDR memory
Pre-fetch neighbors of touched pages to reduce TLB shootdowns
Throttle page migrations when nearing peak BW
Dynamic page migration performs 1.95x better than no migration
Comes within 28% of the static oracle performance
6% better than Legacy CUDA
HPCA-2015
83
OUTLINE
Page Migration Techniques
First-Touch page migration
Range-Expansion to save TLB shootdowns
BW balancing to stop excessive migrations
Results & Conclusions
HPCA-2015
84
Naive: Migrate pages that are touched
1
1
1
1
1
CPU memory
1
Throughput Relative
to No Migration
FIRST-TOUCH PAGE MIGRATION
3
2.5
2
1.5
1
0.5
0
GPU memory
First-Touch migration approaches Legacy CUDA
First-Touch migration is cheap, no hardware counters required
HPCA-2015
85
Migrate
V1
VA
PA
V1
P2
P1
…
…
Ack
Send
Invalidate
VA
PA
V1
P1
P2
…
…
CPU TLB
GPU TLB
CPU
GPU
Throughput
Relative…
PROBLEMS WITH FIRST-TOUCH MIGRATION
2
No overhead
1.5
1
0.5
0
TLB shootdowns may negate benefits of page migration
How to migrate pages without incurring shootdown cost?
HPCA-2015
86
PRE-FETCH TO AVOID TLB SHOOTDOWNS
Hot pages, consume 80% BW
Intuition: Hot virtual addresses
are clustered
Pre-fetch pages before access by
the GPU
Ma
x
Min
No shootdown cost for pre-fetched pages
HPCA-2015
87
Mi
n
MIGRATION USING RANGE-EXPANSION
X
X
GPU
memory X
X
No range expansion
Throughput Relative
to No Migration
CPU memory
2
1.5
1
TLB
shootdo
wn
No
overhe
ad
TLB
shootdo
wn
No
overhe
ad
Pre-fetched, shootdown not required
0
TLB
shootdo
wn
X Accessed, shootdown required
No
overhe
ad
0.5
Pre-fetch pages in spatially contiguous range
HPCA-2015
88
MIGRATION USING RANGE-EXPANSION
GPU
memory X
X
No range expansion
Throughput Relative
to No Migration
CPU memory
Range:16
2
1.5
1
TLB
shootdo
wn
No
overhe
ad
TLB
shootdo
wn
No
overhe
ad
Pre-fetched, shootdown not required
0
TLB
shootdo
wn
X Accessed, shootdown required
No
overhe
ad
0.5
Pre-fetch pages in spatially contiguous range
Range-Expansion hides TLB shootdown overhead
HPCA-2015
89
Total Bandwidth (GB/s)
REVISITING BANDWIDTH UTILIZATION
28
0
20
0
Total System Memory BW
GPU Memory BW
First-Touch
migration
80
0
Excessive
migration
BW from CC
% of Migrated Pages to GPU Memory 100
First-Touch + Range-Expansion aggressively unlocks GDDR BW
How to avoid excessive page migrations?
HPCA-2015
90
BANDWIDTH BALANCING
100
Excessive
migrations
DDR under-subscribed
Target
70
First-Touch
migrations
GDDR under-subscribed
% of Accesses from GPU
Memory
% of Migrated Pages to GPU Memory
First-Touch + Range Exp
0
80
20
0
Star
t
28
0
Total Bandwidth (GB/s)
End
Time
Throttle migrations when nearing peak BW
HPCA-2015
91
BANDWIDTH BALANCING
100
Excessive
migrations
DDR under-subscribed
Target
70
First-Touch
migrations
GDDR under-subscribed
% of Accesses from GPU
Memory
% of Migrated Pages to GPU Memory
First-Touch + Range Exp
0
80
20
0
Star
t
28
0
Total Bandwidth (GB/s)
End
Time
Throttle migrations when nearing peak BW
HPCA-2015
92
SIMULATION ENVIRONMENT
Simulator: GPGPU-Sim 3.x
Heterogeneous 2-level memory
GPU
GDDR5 (200GB/s, 8-channels)
100 clks
additional
latency
80 GB/s
CPU
DDR4 (80GB/s, 4-channels)
GPU-CPU interconnect
200 GB/s
Latency: 100 GPU core cycles
GDDR5
Workloads:
80
GB/s
DDR4
Rodinia applications [Che’IISWC2009]
DoE mini apps [Villa’SC2014]
HPCA-2015
93
Throughput Relative
to No Migration
RESULTS
6
5
4
3
2
1
0
Legacy CUDA
First-Touch + Range Exp + BW Balancing
ORACLE
Static
Oracle
HPCA-2015
94
Throughput Relative
to No Migration
RESULTS
6
5
4
3
2
1
0
Legacy CUDA
First-Touch + Range Exp + BW Balancing
ORACLE
Static
Oracle
First-Touch + Range-Expansion + BW Balancing outperforms
Legacy CUDA
HPCA-2015
95
Throughput Relative
to No Migration
RESULTS
6
5
4
3
2
1
0
Legacy CUDA
First-Touch + Range Exp + BW Balancing
ORACLE
Static
Oracle
Streaming accesses, no-reuse after migration
HPCA-2015
96
Throughput Relative
to No Migration
RESULTS
6
5
4
3
2
1
0
Legacy CUDA
First-Touch + Range Exp + BW Balancing
ORACLE
Static
Oracle
Dynamic page migration performs 1.95x better than no migration
Comes within 28% of the static oracle performance
6% better than Legacy CUDA
HPCA-2015
97
CONCLUSIONS
Developed migration policies without any programmer involvement
First-Touch migration is cheap but has high TLB shootdowns
First-Touch + Range-Expansion technique unlocks GDDR memory BW
BW balancing maximizes BW utilization, throttles excessive migrations
These 3 complementary techniques effectively unlock full system BW
HPCA-2015
98
THANK YOU
HPCA-2015
99
poor GDDR bandwidth utilization. Range expansion prefetches
in time, because the algorithms are designed to use blocked
access to the key data structures to enhance locality. Thus, the
prefetching effect of range expansion is only visible when a
page is migrated upon first touch to a neighboring page, by
the second access to a page, all its neighbors have already
been accessed at least once and there will be no savings from
avoiding TLB shootdowns. On the other hand, in benchmarks
such as needl e, there is low temporal correlation among
touches to neighboring pages. Even if a migration candidate is
touched 64 or 128 times, some of its neighboring pages may
not have been touched, and thus the prefetching effect of range
expansion provides up to 42% performance improvement even
at higher thresholds.
these low-touch pages to GDDR as well, recouping the performance losses of the higher threshold policies and making
them perform similar to a first touch migration policy. For
mi ni f e, previously discussed in subsection III-B, the effect
of prefetching via range expansion is to recoup some of
the performance loss due to needless migrations. However,
performance still falls short of the legacy memcpy approach,
which in effect, achieves perfect prefetching. Overuse of range
expansion hurts performance in some cases. Under the first
touch migration policy (threshold-1), using range expansion
16, 64, and 128, the worst-case performance degradations are
2%, 3%, and 2.5% respectively. While not visible in the graph
due to the stacked nature of Figure 7, they are included in the
geometric mean calculations.
Benchmark
ExecutionOF RANGE-EXPANSION
% Migrations Exececution
EFFECTIVENESS
Overhead of
Without
Runtime
TLB Shootdowns Shootdown
Saved
Execution
% Migrations
Execution 7.6%
backpropBenchmark 29.1%
26%
Overhead of
Without
Runtime
TLB Shootdown
Shootdown
Saved
bfs
6.7%
12%
0.8%
Backprop
29.1%
26%
7.6%
cns
2.4%
0.5%
Pathfinder
25.9%
10%20%
2.6%
24.9%
55%89%
2.75% 1.8%
comd Needle
2.02%
21.15%
13%
2.75%
kmeans Mummer
4.01%
79%
3.17%
Bfs
6.7%
12%
0.8%
minife
3.6%
36%
1.3%
mummer
21.15%
13%
2.75%
Range-Expansion can save up to 45% TLB shootdowns
needle
24.9%
55%
13.7%
pathfinder
25.9%
10%
2.6%
srad v1
0.5%
27%
0.14%
In the case of backpr op, we can see that higher thresholds perform poorly compared to threshold 1. Thresholds
above 64 are simply too high; most pages are not accessed
this frequently and thus few pages are migrated, resulting in
poor GDDR bandwidth utilization. Range expansion prefetches
Benchmark
backprop
bfs
cns
comd
kmeans
minife
mummer
needle
pathfinder
srad v1
xsbench
Average
Execution
Overhead of
TLB Shootdowns
29.1%
6.7%
2.4%
2.02%
4.01%
3.6%
21.15%
24.9%
25.9%
0.5%
2.1%
11.13%
% Migrations
Without
Shootdown
26%
12%
20%
89%
79%
36%
13%
55%
10%
27%
1%
33.5%
Exececution
Runtime
Saved
7.6%
0.8%
0.5%
1.8%
3.17%
1.3%
2.75%
13.7%
2.6%
0.14%
0.02%
3.72%
TABLE II: Effectiveness of range prefetching at avoiding
TLB shootdowns and runtime savings under a 100-cycle TLB
shootdown overhead.
Overall, we observe that even with range expansion, higherthreshold policies do not significantly outperform the much
simpler first-touch policy. With threshold 1, the average performance gain with range expansion of 128 is 1.85⇥. The best
absolute performance is observed when using a threshold of
64 combined with a range expansion value of 64, providing
1.95⇥ speedup. We believe that this additional ⇡ 5% speedup
over first touch migration with aggressive range expansion
is not worth the implementation complexity of tracking and
differentiating all pages in the system. In the next section,
we discuss how to recoup some of this performance for
benchmarks such as bf s and xsbench, which benefit most
from using a higher threshold.
V.
BA NDWI DTH BA L A NCI NG
In Section III, we showed that using a static thresholdbased page migration policy alone could not ideally balance
migrating enough pages to maximize GDDR bandwidth utilization while selectively moving only the hottest data. In
Section IV, we showed that informed page prefetching using
a low threshold and range expansion to exploit locality within
an application’ s virtual address space matches or exceeds the
performance of a simple threshold-based policy. Combining
low threshold migration with aggressive prefetching drastically
f
a
6
1
o
i
d
w
b
f
HPCA-2015
100
b
DATA TRANSFER RATIO
Performance is low when GDDR/DDR ratio is away from optimal
HPCA-2015
101
GPU WORKLOADS: BW SENSITIVITY
Rela%ve'Throughput''
backprop"
hotspot"
needle"
comd"
3"
bfs"
kmeans"
pathfinder"
minife"
btree"
lava_MD"
sc_gpu"
xsbench"
gaussian"
lud_cuda"
srad_v1"
heartwell"
mummergpu"
cns"
2.5"
2"
1.5"
1"
0.5"
0"
Rela%ve'
Throughput'
0.12x" 0.14x" 0.17x" 0.20x" 0.25x" 0.33x" 0.5x"
1.2$
1x"
2x"
3x"
Aggregate'DRAM'Bandwidth,'1x=200GB/sec'
1$
0.8$
'99$ '80$ '60$ '40$ '20$
0$
20$ 40$ 60$ 80$ 100$
Additonal'DRAM'delay'in'GPU'core'cycles'
GPU workloads are highly BW sensitive
HPCA-2015
102
Throughput Relative
to No Migration
RESULTS
6
5
4
3
2
1
0
Legacy CUDA
First-Touch
Oracle
Dynamic page migration performs 1.95x better than no migration
Comes within 28% of the static oracle performance
6% better than Legacy CUDA
HPCA-2015
103
Download