ppt

advertisement
Introduction to Control Divergence
Lectures Slides and Figures contributed from sources as noted
(1)
Objective
•
Understand the occurrence of control divergence and
the concept of thread reconvergence
 Also described as branch divergence and thread divergence
•
Cover a basic thread reconvergence mechanism –
PDOM
 Set up discussion of further optimizations and advanced
techniques
•
Explore one approach for mitigating the performance
degradation due to control divergence – dynamic warp
formation
(2)
Reading
•
W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic
Warp Formation: Efficient MIMD Control Flow on SIMD
Graphics Hardware,” ACM TACO, June 2009
•
W. Fung and T. Aamodt, “Thread Block Compaction for
Efficient SIMT Control Flow,” International Symposiumomn
High Performance Computer Architecture, 2011
•
M. Rhu and M. Erez, “CAPRI: Prediction of CompactionAdequacy for Handling Control-Divergence in GPGPU
Architectures,” ISCA 2012
•
Figure from M. Rhu and M. Erez, “Maximizing SIMD
Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(3)
Handling Branches
•
CUDA Code:
if(…) … (True for some threads)
T
T
taken
T
T
not taken
else … (True for others)
•
What if threads takes different branches?
•
Branch Divergence!
(4)
Branch Divergence
•
•
Occurs within a warp
Branches lead serialization branch dependent code
 Performance issue: low warp utilization
if(…)
{… }
Idle threads
else {
…}
Reconvergence!
(5)
Branch Divergence
A
Thread Warp
B
C
D
F
Common PC
Thread Thread Thread Thread
1
2
3
4
E
G
•
•
Different threads follow different control flow paths
through the kernel code
Thread execution is (partially) serialized
 Subset of threads that follow the same path execute in
parallel
Courtesy of Wilson Fung,
Ivan Sham, George Yuan,
Tor Aamodt
Basic Idea
•
Split: partition a warp
 Two mutually exclusive thread subsets, each branching to a
different target
 Identify subsets with two activity masks effectively two warps
•
Join: merge two subsets of a previously split warp
 Reconverge the mutually exclusive sets of threads
•
Orchestrate the correct execution for nested branches
•
Note the long history of technques in SIMD processors
(see background in Fung et. al.)
Thread Warp
Common PC
1
T1
T2
T3
0
0
1
activity mask
T4
(7)
Thread Reconvergence
•
Fundamental problem:
 Merge threads with the same PC
 How do we sequence execution of threads? Since this can
effect the ability to reconverge
•
Question: When can threads productively
reconverge?
•
Question: When is the best
time to reconverge?
A
B
C
D
F
E
G
(8)
Dominator
•
Node d dominates node n if every path from the entry
node to n must go through d
A
B
C
D
F
E
G
(9)
Immediate Dominator
•
Node d immediate dominates node n if every path from
the entry node to n must go through d and no other
nodes dominate n between d and n
A
B
C
D
F
E
G
(10)
Post Dominator
•
Node d post dominates node n if every path from
the node n to the exit node must go through d
A
B
C
D
F
E
G
(11)
Immediate Post Dominator
•
Node d immediate post dominates node n if every
path from node n to the exist node must go through d
and no other nodes post dominate n between d and n
A
B
C
D
F
E
G
(12)
Baseline: PDOM
Stack
AA/1111
Reconv. PC
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
TOS
TOS
TOS
BB/1111
C/1001 D
D/0110
C
F
Thread Warp
EE/1111
Common PC
Thread Thread Thread Thread
1
2
3
4
G/1111
G
A
B
C
D
E
G
A
Time
Courtesy of Wilson Fung,
Ivan Sham, George Yuan,
Tor Aamodt
Stack Entry
Stack
AA/1111
TOS
TOS
TOS
BB/1111
C/1001 D
D/0110
C
EE/1111
G/1111
G
Reconv. PC
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
F
• A stack entry is a specification of a
group of active threads that will execute
that basic block
• The natural nested structure of control
exposes the use stack-based
serialization
(14)
More Complex Example
•
Stack based implementation for nested control flow
 Stack entry RPC set to IPDOM
•
Re-convergence at the immediate post-dominator of
the branch
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(15)
Implementation
I-Fetch
Decode
Stack
I-Buffer
TOS
TOS
TOS
Issue
•
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
All Hit?
Data



•
Next PC
Active Mask
E
E
G
A
B
E
D
C
E
1111
0110
1001
GPGPUSim model:


pending
warps
D-Cache
Reconv. PC
Implement per warp stack at issue stage
Acquire the active mask and PC from the
TOS
Scoreboard check prior to issue
Register writeback updates scoreboard and
ready bit in instruction buffer
When RPC = Next PC, pop the stack
Implications for instruction fetch?
Writeback
FromGPGPU-Sim Documentation http://gpgpu-sim.org/manual/index.php/GPGPUSim_3.x_Manual#SIMT_Cores
(16)
Implementation (2)
•
warpPC (next instruction) compared to reconvergence
PC
•
On a branch
 Can store the reconvergence PC as part of the branch
instruction
 Branch unit has NextPC, TargetPC and reconvergence PC to
update the stack
•
On reaching a reconvergence point
 Pop the stack
 Continue fetching from the NextPC of the next entry on the
stack
(17)
Can We Do Better?
•
Warps are formed statically
•
Key idea of dynamic warp formation
 Show a pool of warps and how they can be merged
•
At a high level what are the requirements
A
B
C
D
F
E
G
(18)
Compaction Techniques
•
Can we reform warps so as to increase utilization?
•
Basic idea: Compaction
 Reform warps with threads that follow the same control flow
path
 Increase utilization of warps
•
Two basic types of compaction techniques
•
Inter-warp compaction
 Group threads from different warps
 Group threads within a warp
o
Changing the effective warp size
(19)
Inter-Warp Thread Compaction
Techniques
Lectures Slides and Figures contributed from sources as noted
(20)
Goal
Warp 0
Merge threads?
Warp 1
Warp 2
Taken
Not
Taken
if(…)
Warp 3
Warp 4
{… }
else {
…}
Warp 5
(21)
Reading
•
W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic
Warp Formation: Efficient MIMD Control Flow on
SIMD Graphics Hardware,” ACM TACO, June 2009
•
W. Fung and T. Aamodt, “Thread Block Compaction
for Efficient SIMT Control Flow,” International
Symposiumomn High Performance Computer
Architecture, 2011
(22)
DWF: Example
A
x/1111
y/1111
A
x/1110
y/0011
B
x/1000
Execution of Warp x
at Basic Block A
x/0110
C y/0010 D y/0001 F
E
Legend
A
x/0001
y/1100
Execution of Warp y
at Basic Block A
D
A new warp created from scalar
threads of both Warp x and y
executing at Basic Block D
x/1110
y/0011
x/1111
G y/1111
A
A
B
B
C
C
D
D
E
E
F
F
G
G
A
A
Baseline
Time
Dynamic
Warp
Formation
A
A
Courtesy of Wilson Fung,
Ivan Sham, George Yuan,
Tor Aamodt
B
B
C
D
E
E
F
G
G
A
A
Time
How Does This Work?
•
Criteria for merging
 Same PC
 Complements of active threads in each warp
 Recall: many warps/TB all executing the same code
•
What information do we need to merge two warps
 Need thread IDs and PCs
•
Ideally how would you find/merge warps?
(24)
DWF: Microarchitecture Implementation
Identify available
lanes
Assists in
aggregating threads
Warp Update Register T
TID x N
PC-Warp LUT
REQ PC A
H
Warp Update Register NT
TID x N
REQ PC B
Point to warp
being formed
Warp Pool
PC OCC IDX
PC OCC IDX
H
PC
PC
TID x N Prio
TID x N Prio
PC
TID x
N
Prio
Issue Logic
Thread Scheduler
Identify occupied
lanes
Warp Allocator
(TID, Reg#)
ALU 2
RF 2
(TID, Reg#)
ALU 3
RF 3
(TID, Reg#)
ALU 4
RF 4
(TID, Reg#)
I-Cache
RF 1
Decode
Commit/
Writeback
•
•
ALU 1
Warps formed dynamically in the warp pool
After commit check PC-Warp LUT and merge or allocated newly
forming warp
Courtesy of Wilson Fung,
Ivan Sham, George Yuan,
Tor Aamodt
DWF: Microarchitecture Implementation
A: BEQ R2, B
C: …
Warp Update Register T
PC-Warp LUT
REQ PCBA
5TID
2 x7
3N8 0110
1011
H
Warp Update Register NT
1TID
6 x N4 1001
0100
REQ PCCB
Warp Pool
PC
OCC IDX
0
2
B 0010
0110
PC
OCC IDX
1
C 1101
1001
H
xN
5TID
1
23
4 Prio
8
xN
5TID
1
67
8 Prio
4
PC
x Prio
B TID 7
N
X
Y
X
Y
ALU 1
ALU 3
ALU 4
X
Z
Y
X
Z
Y
1
5
(TID, Reg#)
6
2
(TID, Reg#)
7
3
4 (TID, Reg#)
8
1
5
6
2
7
3
4
8
1
5
6
2
7
3
4
8
X
Z
Y
RF 1
RF 2
RF 3
RF 4
(TID, Reg#)
I-Cache
ALU 2
5
1
2
6
3
7
4
8
Warp Allocator
Decode
Commit/
Writeback
5
1
2
6
3
7
4
8
PC
A
B
PC
C
A
Issue Logic
Thread Scheduler
No Lane Conflict
Courtesy of Wilson Fung,
Ivan Sham, George Yuan,
Tor Aamodt
Resource Usage
•
Ideally would like a small number of unique PCs in
progress at a time  minimize overhead
•
Warp divergence will increase the number of unique
PCs
 Mitigate via warp scheduling
•
Scheduling policies




FIFO
Program counter – address variation measure of divergence
Majority/Minority- most common vs. helping stragglers
Post dominator (catch up)
(27)
Hardware Consequences
•
Expose the implications that warps have in the base
design
 Implications for register file access  lane aware DWF
•
Register bank conflicts
From Fung, et. Al., “Dynamic Warp Formation: Efficient MIMD Control Flow in SIMD Graphics Hardware,
ACM TACO, June 2009
(28)
Relaxing Implications of Warps
•
Thread swizzling
 Essentially remap work to threads so as to create more
opportunities for DWF  requires deep understanding of
algorithm behavior and data sets
•
Lane swizzling in hardware
 Provide limited connectivity between register banks and
lanes  avoiding full crossbars
(29)
Summary
•
Control flow divergence is a fundamental performance
limiter for SIMT execution
•
Dynamic warp formation is one way to mitigate these
effects
 We will look at several others
•
Must balance a complex set of effects
 Memory behaviors
 Synchronization behaviors
 Scheduler behaviors
(30)
Thread Block Compaction
W. Fung and T. Aamodt
HPCA 2011
Goal
•
Overcome some of the disadvantages of dynamic
warp formation
 Impact of Scheduling
 Breaking implicit synchronization
 Reduction of memory coalescing opportunities
(32)
DWF Pathologies: Starvation
• Majority Scheduling
– Best Performing
– Prioritize largest group of threads
with same PC
• Starvation
– LOWER SIMD Efficiency!
– Tricky: Variable Memory Latency
Wilson Fung, Tor Aamodt
Thread Block Compaction
C
1 2 7 8
C
5 -- 11 12
D
E
9 2
1
6 7
3 8
4
D
E
5-- --1011-- 12
-E
1000s cycles1 2 3 4
E
5 6 7 8
D
9 6 3 4
E 9 10 11 12
D
-- 10 -- -E
9 6 3 4
E
-- 10 -- --
Time
• Other Warp Scheduler?
B: if (K > 10)
C:
K = 10;
else
D:
K = 0;
E: B = C[tid.x] + K;
33
DWF Pathologies:
Extra Uncoalesced Accesses
• Coalesced Memory Access = Memory SIMD
– 1st Order CUDA Programmer Optimization
• Not preserved by DWF
E: B = C[tid.x] + K;
No DWF
E
E
E
Memory
#Acc = 3
0x100
0x140
0x180
1 2 3 4
5 6 7 8
9 10 11 12
Memory
#Acc = 9
With DWF
Wilson Fung, Tor Aamodt
E
E
E
1 2 7 12
9 6 3 8
5 10 11 4
0x100
0x140
0x180
Thread Block Compaction
L1 Cache Absorbs
Redundant
Memory Traffic
L1$ Port Conflict
34
DWF Pathologies: Implicit Warp Sync.
•
Some CUDA applications depend on the lockstep
execution of “static warps”
Warp 0
Warp 1
Warp 2
Thread 0 ... 31
Thread 32 ... 63
Thread 64 ... 95
From W. Fung, I. Sham, G. Yuan, and T. Aamodt, “Dynamic Warp Formation: Efficient MIMD Control
Flow on SIMD Graphics Hardware,” ACM TACO, June 2009
Wilson Fung, Tor Aamodt
(35)
Performance Impact
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International
Symposiumomn High Performance Computer Architecture, 2011
(36)
Thread Block Compaction

Block-wide Reconvergence Stack
Thread
Warp
Block
0 0
Warp 1
Warp 2
PC RPC AMask
Active
PC RPC
MaskAMask PC RPC AMask
E -- 1111
1111 1111
E 1111
-1111
E E -- Warp
11110
D E 0011
0011 0100
D 1100
E 0100
DD
E E Warp
1100U
1
C E 1100
1100 1011
C 0011
E 1011
CD
C
E E Warp
0011X
T
2
C Warp Y


Better Reconv. Stack: Likely Convergence


Regroup threads within a block
Converge before Immediate Post-Dominator
Robust


Avg. 22% speedup on divergent CUDA apps
No penalty on others
Wilson Fung, Tor Aamodt
Thread Block Compaction
37
GPU Microarchitecture
SIMTCore
Core
SIMT
SIMT
Core
SIMT Core
Core
SIMT
MemoryPartition
Partition
Memory
Memory
Partition
Last-Level
Last-Level
Last-Level
CacheBank
Bank
Cache
Cache
Bank
Interconnection
Network
Off-Chip
Off-Chip
Off-Chip
DRAMChannel
Channel
DRAM
DRAM
Channel
Done (Warp ID)
SIMT
Front End
Fetch
Decode
Schedule
Branch
SIMD Datapath
Memory Subsystem
SMem L1 D$ Tex $ Const$
Wilson Fung, Tor Aamodt
Thread Block Compaction
Icnt.
Network
More Details
in Paper
Observation


Compute kernels usually contain
divergent and non-divergent
(coherent) code segments
Coalesced memory access
usually in coherent code
segments

DWF no benefit there
Wilson Fung, Tor Aamodt
Thread Block Compaction
Coherent
Divergent
Static
Warp
Divergence
Dynamic
Warp
Reset Warps
Coales. LD/ST
Static
Coherent
Warp
Recvg
Pt.
39
Thread Block Compaction

Barrier @ Branch/reconverge pt.



Starvation
Run a thread block like a warp



All avail. threads arrive at branch
Insensitive to warp scheduling
Implicit
Warp Sync.
Whole block move between coherent/divergent code
Block-wide stack to track exec. paths reconvg.
Warp compaction


Extra Uncoalesced
Memory Access
Regrouping with all avail. threads
If no divergence, gives static warp arrangement
Wilson Fung, Tor Aamodt
Thread Block Compaction
40
Thread Block Compaction
PC RPC
Active Threads
A
E
- 1 2 3 4 5 6 7 8 9 10 11 12
D E -- -- -3 -4 -- -6 -- -- -9 10
-- -- -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11
-- 12
--
A: K = A[tid.x];
B: if (K > 10)
C:
K = 10;
else
D:
K = 0;
E: B = C[tid.x] + K;
A
A
A
1 2 3 4
5 6 7 8
9 10 11 12
A
A
A
1 2 3 4
5 6 7 8
9 10 11 12
C
C
1 2 7 8
5 -- 11 12
D
D
9 6 3 4
-- 10 -- --
C
C
C
1 2 -- -5 -- 7 8
-- -- 11 12
E
E
E
1 2 3 4
5 6 7 8
9 10 11 12
D
D
D
-- -- 3 4
-- 6 -- -9 10 -- --
E
E
E
1 2 7 8
5 6 7 8
9 10 11 12
Time
Wilson Fung, Tor Aamodt
Thread Block Compaction
41
Thread Block Compaction


Barrier every basic block?! (Idle pipeline)
Switch to warps from other thread blocks


Multiple thread blocks run on a core
Already done in most CUDA applications
Branch
Block 0
Warp Compaction
Execution
Block 1
Execution
Execution
Block 2
Execution
Time
Wilson Fung, Tor Aamodt
Thread Block Compaction
42
High Level View
•
DWF: warp broken down every cycle and threads in a
warp shepherded into a new warp (LUT and warp
pool)
•
TBC: warps broken down at potentially divergent
points and threads compacted across the thread block
(43)
Microarchitecture Modification


Per-Warp Stack  Block-Wide Stack
I-Buffer + TIDs  Warp Buffer


New Unit: Thread Compactor


Store the dynamic warps
Translate activemask to compact dynamic warps
More Detail in Paper
Branch Target PC
Block-Wide
Fetch
Valid[1:N]
I-Cache
Warp
Buffer
Decode
Mask
Issue
ScoreBoard
Wilson Fung, Tor Aamodt
Stack
Thread
Compactor Active
Pred.
ALU
ALU
ALU
ALU
RegFile
MEM
Done (WID)
Thread Block Compaction
44
Microarchitecture Modification (2)
TIDs of
#compacted warps
#compacted
warps
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International
Symposiumomn High Performance Computer Architecture, 2011
(45)
Operation
All threads mapped
to the same lane
• Warp 2 arrives first
creating 2 target
entries
• Next warps update the
active mask
When this is zero –
compact (priority
encoder)
Pick a thread mapped to this lane
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International
Symposiumomn High Performance Computer Architecture, 2011
(46)
Thread Compactor


Convert activemask from block-wide stack to
thread IDs in warp buffer
Array of Priority-Encoder
C
E
1 2 -- -- 5 -- 7 8 -- -- 11 12
1 5 --
2 -- --
-- 7 11
-- 8 12
P-Enc
P-Enc
P-Enc
P-Enc
1
5
-2
11
7
12
8
Warp Buffer
C
1 2 7 8
C
5 -- 11 12
Wilson Fung, Tor Aamodt
Thread Block Compaction
47
Likely-Convergence

Immediate Post-Dominator: Conservative


Convergence can happen earlier

A:
B:
C:
D:
E:
F:

All paths from divergent branch must merge there
When any two of the paths merge
while (i < K) {
X = data[i];
if ( X = 0 )
result[i] = Y;
B
else if ( X = 1 )
break;
i++;
iPDom of A
}
return result[i];
A
Rarely
Taken
C
E
D
F
Extended Recvg. Stack to exploit this

TBC: 30% speedup for Ray Tracing
Wilson Fung, Tor Aamodt
Thread Block Compaction
Details in Paper
48
Likely-Convergence (2)

NVIDIA uses break instruction for loop exits


That handles last example
Our solution: Likely-Convergence Points
PC RPC LPC LPos ActiveThds
F
---1234
E
F
--1-22
B
E
F
E
1
1
C
D
F
E
1
23344
E
F
E
1
2

Convergence!
This paper: only used to capture loop-breaks
Wilson Fung, Tor Aamodt
Thread Block Compaction
49
Likely-Convergence (3)
Check !
Merge inside the stack
From W. Fung and T. Aamodt, “Thread Block Compaction for Efficient SIMT Control Flow,” International
Symposiumomn High Performance Computer Architecture, 2011
(50)
Likely-Convergence (4)

Applies to both per-warp stack (PDOM) and
thread block compaction (TBC)


Enable more threads grouping for TBC
Side effect: Reduce stack usage in some case
Wilson Fung, Tor Aamodt
Thread Block Compaction
51
Evaluation

Simulation: GPGPU-Sim (2.2.1b)


~Quadro FX5800 + L1 & L2 Caches
21 Benchmarks



All of GPGPU-Sim original benchmarks
Rodinia benchmarks
Other important applications:




Face Detection from Visbench (UIUC)
DNA Sequencing (MUMMER-GPU++)
Molecular Dynamics Simulation (NAMD)
Ray Tracing from NVIDIA Research
Wilson Fung, Tor Aamodt
Thread Block Compaction
52
Experimental Results

2 Benchmark Groups:


COHE = Non-Divergent CUDA applications
DIVG = Divergent CUDA applications
COHE
DIVG
DWF
TBC
0.6
0.7
0.8
0.9
1
1.1
1.2
Serious Slowdown from
pathologies
No Penalty for COHE
22% Speedup on DIVG
1.3
IPC Relative to Baseline
Per-Warp Stack
Wilson Fung, Tor Aamodt
Thread Block Compaction
53
Effect on Memory Traffic
TBC does still generate some extra uncoalesced
memory access #Acc = 4
Memory
1.2
TBC
DWF Baseline
Wilson Fung, Tor Aamodt
DIVG
Thread Block Compaction
WP
STO
STMCL
RAY
NNC
MGST
LKYT
LIB
DG
CP
BACKP
HRTWL
0%
NVRT
AES
50%
0.6
NAMD
100%
2.67x
0.8
MUMpp
150%
MUM
200%
TBC-RRB
1
FCDT
250%
TBC-AGE
BFS2
Memory Traffic
Normalized to Baseline
Normalized Memory Stalls
2nd Acc will hit
the L1 cache
300%
No Change to
Overall
Memory Traffic
In/out of a core
0x100
0x140
0x180
1 2 7 8
5 -- 11 12
LPS
C
C
HOTSP

COHE
54
Conclusion
•
Thread Block Compaction
 Addressed some key challenges of DWF
 One significant step closer to reality
•
Benefit from advancements on reconvergence stack
 Likely-Convergence Point
 Extensible: Integrate other stack-based proposals
Wilson Fung, Tor
Aamodt
Thread Block Compaction
(55)
CAPRI: Prediction of CompactionAdequacy for Handling ControlDivergence in GPGPU Architectures
M. Rhu and M. Erez
ISCA 2012
Goals
•
Improve the performance of inter-warp compaction
techniques
•
Predict when branches diverge
 Borrow philosophy from branch prediction
•
Use prediction to apply compaction only when it is
beneficial
(57)
Issues with Thread Block Compaction
Implicit barrier across warps
to collect compaction
candidates
•
TBC: warps broken down at
potentially divergent points
and threads compacted
across the thread block
•
Barrier synchronization
overhead cannot always be
hidden
•
When it works, it works well
A
B
C
D
E
F
G
(58)
Divergence Behavior: A Closer Look
Compaction
ineffective branch
Loop executes a fixed
number of times
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(59)
Compaction-Resistant Branches
Control flow
graph
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(60)
Compaction-Resistant Branches(2)
Control flow
graph
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(61)
Impact of Ineffective Compaction
Threads shuffled around with
no performance improvement
However, can lead to increased
memory divergence!
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(62)
Basic Idea
•
Only stall and compact when there is a high
probability of compaction success
•
Otherwise allow warps to bypass the (implicit) barrier
•
Compaction adequacy predictor!
 Think branch prediction!
(63)
Example: TBC vs. CAPRI
Bypassing enables increased
overlap of memory references
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(64)
CAPRI: Example
No divergence,
no stalling
Diverge, stall,
initialize history,
all others will
now stall
• Diverge and
history available,
predict, update
prediction
• All other warps
follow (one
prediction/branc
h
• CAPT updated
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(65)
The Predictor
•
Prediction uses active
masks of all warps
 Need to understand what
could have happened
•
Actual compaction only
uses actual stalled warps
•
Minimum provides the
maximum compaction
ability, i.e., #compacted
warps
Available threads in a lane
•
Update history predictor
accordingly
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(66)
Behavior
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(67)
Impact of Implicit Barriers
•
Idle cycle count helps us understand the negative
effects of implicit barriers in TBC
Figure from M. Rhu and M. Erez, “CAPRI: Prediction of Compaction-Adequacy for Handling ControlDivergence in GPGPU Architectures,” ISCA 2012
(68)
Summary
•
The synchronization overhead of thread block
compaction can introduce performance degradation
•
Some branches more divergence than others
•
Apply TBC judiciously  predict when it is beneficial
•
Effectively predict when the inter-warp compaction is
effective.
(69)
M. Rhu and M. Erez, “Maximizing SIMD
Resource Utilization on GPGPUs with
SIMD Lane Permutation,”
ISCA 2013
Goals
•
Understand the limitations of compaction techniques
and proximity to ideal compaction
•
Provide mechanisms to overcome these limitations
and approach ideal compaction rates
(71)
Limitations of Compaction
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(72)
Mapping Threads to Lanes: Today
Linearization of
thread IDs
Warp 1
Warp 0
Modulo assignment
to lanes
Grid 1
Block
(0, 0)
Block
(0, 1)
Block
(1, 0)
Block
(1, 1)
RF n-1 RF n-1 RF n-1 RF n-1
RF n-1 RF n-1
RF n-2 RF n-2
RF n-2 RF n-2 RF n-2 RF n-2
RF n-2 RF n-2
RF n-3 RF n-3
RF n-3 RF n-3 RF n-3 RF n-3
RF n-3 RF n-3
RF n-4 RF n-4
RF n-4 RF n-4 RF n-4 RF n-4
RF n-4 RF n-4
RF1
RF1
RF1
RF1
RF1
RF1
RF1
RF1
RF0
RF0
RF0
RF0
RF0
RF0
RF0
RF0
scalar
pipeline
scalar
pipeline
scalar
pipeline
scalar
Pipeline
scalar
pipeline
scalar
pipeline
Block (1,1)
RF n-1 RF n-1
(1,0,0) (1,0,1) (1,0,2) (1,0,3)
Thread
(0,1,0)
Thread Thread
(0,1,1) (0,1,2)
Thread
(0,0,3)
Thread
Thread
(0,0,0)
(0,1,3)
scalar
pipeline
Thread Thread
(0,0,1) (0,0,2)
scalar
Pipeline
Thread
(0,0,0)
(73)
Data Dependent Branches
•
Data dependent control flows less likely to produce
lane conflicts
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(74)
Programmatic Branches
•
Programmatic branches can be correlated to lane
assignments (with modulo assignment)
•
Program variables that operate like constants across
threads can produced correlated branching behaviors
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(75)
P-Branches vs. D-Branches
P-branches
are the
problem!
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(76)
Compaction Opportunities
•
Lane reassignment can improve compaction
opportunities
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(77)
Aligned Divergence
•
Threads mapped to a lane tend to evaluate
(programmatic) predicates the same way
 Empirically, rarely exhibited for input, data dependent control
flow behavior
•
Compaction cannot help in the presence of lane
conflicts
•
 performance of compaction mechanisms depends
on both divergence patterns and lane conflicts
•
We need to understand impact of lane assignment
(78)
Impact of Lane Reassignment
Goal: Improve “compactability”
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(79)
Random Permutations
•
Does not always work well  works well on average
•
Better understanding of programs can lead to better
permutations choices
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(80)
Mapping Threads to Lanes; New
Warp 0
Warp 1
RF n-1 RF n-1
RF n-1 RF n-1 RF n-1 RF n-1
RF n-1 RF n-1
RF n-2 RF n-2
RF n-2 RF n-2 RF n-2 RF n-2
RF n-2 RF n-2
RF n-3 RF n-3
RF n-3 RF n-3 RF n-3 RF n-3
RF n-3 RF n-3
RF n-4 RF n-4
RF n-4 RF n-4 RF n-4 RF n-4
RF n-4 RF n-4
RF1
RF1
RF1
RF1
RF1
RF1
RF1
RF1
RF0
RF0
RF0
RF0
RF0
RF0
RF0
RF0
scalar
pipeline
scalar
pipeline
scalar
Pipeline
scalar
pipeline
scalar
pipeline
Warp N-1
scalar
pipeline
scalar
pipeline
scalar
Pipeline
SIMD Width = 8
What criteria do we use for lane assignment?
(81)
Balanced Permutation
Even warps:
permutation within
a half warp
Odd warps:
swap upper and
lower
evenWID
XORevenWID =
2
Each lane has a single
instance of a logical
thread from each warp
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(82)
Balanced Permutation (2)
evenWID
XORevenWID =
2
Logical TID of 0 in each
warp is now assigned a
different lane
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(83)
Characteristics
•
Vertical Balance: Each lane only has logical TIDs of
distinct threads in a warp
•
Horizontal balance: Logical TID x in all of the warps is
bound to different lanes
•
This works when CTA have fewer than SIMD_Width
warps: why?
•
Note that random permutations achieve this only on
average
(84)
Impact on Memory Coalescing
•
Modern GPUs do not
require ordered
requests
•
Coalescing can occur
across a set of
requests  speicific
lane assignments do
not affect coalescing
behavior
•
Increase is L1 miss
rate offset by benefits
of compaction
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(85)
Speedup of Compaction
•
Can improve the compaction rate of divergence dues
to the majority of programmatic branches
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(86)
Compaction Rate vs. Utilization
Distinguish between compaction rate and utilization!
Figure from M. Rhu and M. Erez, “Maximizing SIMD Resource Utilization on GPGPUs with SIMD Lane
Permutation,” ISCA 2013
(87)
Application of Balanced Permutation
I-Fetch
Decode
I-Buffer
•
Permutation is applied when
the warp is launched
•
Maintained for the life of the
warp
•
Does not affect the baseline
compaction mechanism
•
Enable/disable SLP to
preserve target specific,
programmer implemented
optimizations
Issue
PRF
RF
scalar
pipeline
scalar
pipeline
scalar
Pipeline
pending
warps
D-Cache
All Hit?
Data
Writeback
(88)
Summary
•
Structural hazards limit the performance
improvements from inter-warp compaction
•
Program behaviors produce correlated lane
assignments today
•
Remapping threads to lanes enables extension of
compaction opportunities
(89)
Summary: Inter-Warp Compaction
A
Thread Block
Thread Block
B
C
D
F
E
G
Program
Properties
Co-Design of applications,
resource management,
software, microarchitecture,
(90)
Download