Thread Block Compaction for Efficient SIMT Control Flow

advertisement
Wilson W. L. Fung
Tor M. Aamodt
University of British Columbia
HPCA-17 Feb 14, 2011
1
Graphics Processor Unit (GPU)

Commodity ManyCore Accelerator


SIMD HW: Compute BW + Efficiency
Non-Graphics API: CUDA, OpenCL, DirectCompute



Programming Model: Hierarchy of scalar threads
SIMD-ness not exposed at ISA
Scalar threads grouped into warps, run in lockstep
Warp
Scalar Thread
Grid
Blocks
Blocks
Thread
Blocks
1 12 23 34 4 5 6 7 8 9 91010
1111
1212
Single-Instruction-MultiThread
Wilson Fung, Tor Aamodt
Thread Block Compaction
2
2
SIMT Execution Model
Reconvergence Stack
A[] = {4,8,12,16};
A
1 2 3 4
B: if (K > 10)
B
1 2 3 4
C
1 2 -- --
D
-- -- 3 4
E: B = C[tid.x] + K; E
1 2 3 4
C:
K = 10;
else
D:
K = 0;
50% SIMD Efficiency!
Time
A: K = A[tid.x];
PC RPC Active Mask
1111
B E
0011
D E
1100
C
E E
Branch Divergence
In some cases: SIMD Efficiency  20%
Wilson Fung, Tor Aamodt
Thread Block Compaction
3
Dynamic Warp Formation
(W. Fung et al., MICRO 2007)
Warp 0
A
Warp 1
Warp 2
1234
A
5678
A 9 10 11 12
B
1234
B
5678
B 9 10 11 12
C
1 2 -- --
Time
C
D
5 -- 7 8
C
-- -- 11 12
D
9 10 -- --
SIMD Efficiency  88%
C 1 2 7 8
Pack
C 5 -- 11 12
-- -- 3 4
D
E
Reissue/Memory
Latency
-- 6 -- --
1234
E
5678
E 9 10 11 12
Wilson Fung, Tor Aamodt
Thread Block Compaction
4
This Paper

Identified DWF Pathologies

Greedy Warp Scheduling  Starvation


Breaks up coalesced memory access



Lower memory efficiency
Extreme case:
5X slowdown
Some CUDA apps require lockstep execution of
static warps


Lower SIMD Efficiency
DWF breaks them!
Additional HW to fix these issues?
Simpler solution: Thread Block Compaction
Wilson Fung, Tor Aamodt
Thread Block Compaction
5
Thread Block Compaction

Block-wide Reconvergence Stack
Thread
Warp
Block
0 0
Warp 1
Warp 2
PC RPC AMask
Active
PC RPC
MaskAMask PC RPC AMask
E -- 1111
1111 1111
E 1111
-1111
E E -- Warp
11110
D E 0011
0011 0100
D 1100
E 0100
DD
E E Warp
1100U
1
C E 1100
1100 1011
C 0011
E 1011
CD
C
E E Warp
0011X
T
2
C Warp Y


Better Reconv. Stack: Likely Convergence


Regroup threads within a block
Converge before Immediate Post-Dominator
Robust


Avg. 22% speedup on divergent CUDA apps
No penalty on others
Wilson Fung, Tor Aamodt
Thread Block Compaction
6
Outline







Introduction
GPU Microarchitecture
DWF Pathologies
Thread Block Compaction
Likely-Convergence
Experimental Results
Conclusion
Wilson Fung, Tor Aamodt
Thread Block Compaction
7
GPU Microarchitecture
SIMTCore
Core
SIMT
SIMT
Core
SIMTCore
Core
SIMT
Interconnection
Network
MemoryPartition
Partition
Memory
Memory
Partition
Last-Level
Last-Level
Last-Level
CacheBank
Bank
Cache
Cache
Bank
Off-Chip
Off-Chip
Off-Chip
DRAMChannel
Channel
DRAM
DRAM
Channel
Done (Warp ID)
SIMT
Front End
Fetch
Decode
Schedule
Branch
SIMD Datapath
Memory Subsystem
SMem L1 D$ Tex $ Const$
Wilson Fung, Tor Aamodt
Thread Block Compaction
Icnt.
Network
More Details
in Paper
8
DWF Pathologies:
Starvation

Majority Scheduling



Starvation

LOWER SIMD Efficiency!
Other Warp Scheduler?

Tricky: Variable Memory Latency
Wilson Fung, Tor Aamodt
Thread Block Compaction
C
1 2 7 8
C
5 -- 11 12
D
E
9 2
1
6 7
3 8
4
D
E
-5 -1011
-- 12
-E
1 2 3 4
1000s cycles
E
5 6 7 8
D
9 6 3 4
E 9 10 11 12
D
-- 10 -- -E
9 6 3 4
E
-- 10 -- --
Time

Best Performing in Prev. Work
Prioritize largest group of threads
with same PC
B: if (K > 10)
C:
K = 10;
else
D:
K = 0;
E: B = C[tid.x] + K;
9
DWF Pathologies:
Extra Uncoalesced Accesses

Coalesced Memory Access = Memory SIMD


1st Order CUDA Programmer Optimization
Not preserved by DWF
E: B = C[tid.x] + K;
No DWF
With DWF
Wilson Fung, Tor Aamodt
E
E
E
E
E
E
#Acc = 3
0x100
1 2 3 4
0x140
5 6 7 8
9 10 11 12
0x180
#Acc = 9
0x100
1 2 7 12
0x140
9 6 3 8
5 10 11 4
0x180
Thread Block Compaction
Memory
Memory
L1 Cache Absorbs
Redundant
Memory Traffic
L1$ Port Conflict
10
DWF Pathologies:
Implicit Warp Sync.

Some CUDA applications depend on the
lockstep execution of “static warps”
Warp 0
Warp 1
Warp 2

Thread 0 ... 31
Thread 32 ... 63
Thread 64 ... 95
E.g. Task Queue in Ray Tracing
Implicit
Warp
Sync.
int wid = tid.x / 32;
if (tid.x % 32 == 0) {
sharedTaskID[wid] = atomicAdd(g_TaskID, 32);
}
my_TaskID = sharedTaskID[wid] + tid.x % 32;
ProcessTask(my_TaskID);
Wilson Fung, Tor Aamodt
Thread Block Compaction
11
Observation


Compute kernels usually contain
divergent and non-divergent
(coherent) code segments
Coalesced memory access
usually in coherent code
segments

DWF no benefit there
Wilson Fung, Tor Aamodt
Thread Block Compaction
Coherent
Divergent
Static
Warp
Divergence
Dynamic
Warp
Reset Warps
Coales. LD/ST
Static
Coherent
Warp
Recvg
Pt.
12
Thread Block Compaction

Run a thread block like a warp



Barrier @ Branch/reconverge pt.
Implicit
Warp Sync.
All avail. threads arrive at branch
Insensitive to warp scheduling
Starvation



Whole block move between coherent/divergent code
Block-wide stack to track exec. paths reconvg.
Warp compaction


Extra Uncoalesced
Memory Access
Regrouping with all avail. threads
If no divergence, gives static warp arrangement
Wilson Fung, Tor Aamodt
Thread Block Compaction
13
Thread Block Compaction
PC RPC
Active Threads
A
E
- 1 2 3 4 5 6 7 8 9 10 11 12
D E -- -- -3 -4 -- -6 -- -- -9 10
-- -- -C E -1 -2 -- -- -5 -- -7 -8 -- -- 11
-- 12
--
A: K = A[tid.x];
B: if (K > 10)
C:
K = 10;
else
D:
K = 0;
E: B = C[tid.x] + K;
A
A
A
1 2 3 4
5 6 7 8
9 10 11 12
A
A
A
1 2 3 4
5 6 7 8
9 10 11 12
C
C
1 2 7 8
5 -- 11 12
D
D
9 6 3 4
-- 10 -- --
C
C
C
1 2 -- -5 -- 7 8
-- -- 11 12
E
E
E
1 2 3 4
5 6 7 8
9 10 11 12
D
D
D
-- -- 3 4
-- 6 -- -9 10 -- --
E
E
E
1 2 7 8
5 6 7 8
9 10 11 12
Time
Wilson Fung, Tor Aamodt
Thread Block Compaction
14
Thread Block Compaction


Barrier every basic block?! (Idle pipeline)
Switch to warps from other thread blocks


Multiple thread blocks run on a core
Already done in most CUDA applications
Branch
Block 0
Warp Compaction
Execution
Block 1
Execution
Execution
Block 2
Execution
Time
Wilson Fung, Tor Aamodt
Thread Block Compaction
15
Microarchitecture Modification


Per-Warp Stack  Block-Wide Stack
I-Buffer + TIDs  Warp Buffer


New Unit: Thread Compactor


Store the dynamic warps
Translate activemask to compact dynamic warps
More Detail in Paper
Branch Target PC
Block-Wide
Fetch
Valid[1:N]
I-Cache
Warp
Buffer
Decode
Mask
Issue
ScoreBoard
Wilson Fung, Tor Aamodt
Stack
Thread
Compactor Active
Pred.
ALU
ALU
ALU
ALU
RegFile
MEM
Done (WID)
Thread Block Compaction
16
Likely-Convergence

Immediate Post-Dominator: Conservative


Convergence can happen earlier

A:
B:
C:
D:
E:
F:

All paths from divergent branch must merge there
When any two of the paths merge
while (i < K) {
X = data[i];
if ( X = 0 )
result[i] = Y;
B
else if ( X = 1 )
break;
i++;
iPDom of A
}
return result[i];
A
Rarely
Taken
C
E
D
F
Extended Recvg. Stack to exploit this

TBC: 30% speedup for Ray Tracing
Wilson Fung, Tor Aamodt
Thread Block Compaction
Details in Paper
17
Outline







Introduction
GPU Microarchitecture
DWF Pathologies
Thread Block Compaction
Likely-Convergence
Experimental Results
Conclusion
Wilson Fung, Tor Aamodt
Thread Block Compaction
18
Evaluation

Simulation: GPGPU-Sim (2.2.1b)


~Quadro FX5800 + L1 & L2 Caches
21 Benchmarks



All of GPGPU-Sim original benchmarks
Rodinia benchmarks
Other important applications:




Face Detection from Visbench (UIUC)
DNA Sequencing (MUMMER-GPU++)
Molecular Dynamics Simulation (NAMD)
Ray Tracing from NVIDIA Research
Wilson Fung, Tor Aamodt
Thread Block Compaction
19
Experimental Results

2 Benchmark Groups:


COHE = Non-Divergent CUDA applications
DIVG = Divergent CUDA applications
COHE
DIVG
DWF
TBC
0.6
0.7
0.8
0.9
1
1.1
1.2
Serious Slowdown from
pathologies
No Penalty for COHE
22% Speedup on DIVG
1.3
IPC Relative to Baseline
Per-Warp Stack
Wilson Fung, Tor Aamodt
Thread Block Compaction
20
Conclusion

Thread Block Compaction



Addressed some key challenges of DWF
One significant step closer to reality
Benefit from advancements on
reconvergence stack


Likely-Convergence Point
Extensible: Integrate other stack-based proposals
Wilson Fung, Tor Aamodt
Thread Block Compaction
21
Thank You!
Wilson Fung, Tor Aamodt
Thread Block Compaction
22
Thread Compactor


Convert activemask from block-wide stack to
thread IDs in warp buffer
Array of Priority-Encoder
C
E
1 2 -- -- 5 -- 7 8 -- -- 11 12
1 5 --
2 -- --
-- 7 11
-- 8 12
P-Enc
P-Enc
P-Enc
P-Enc
1
5
-2
11
7
12
8
Warp Buffer
C
1 2 7 8
C
5 -- 11 12
Wilson Fung, Tor Aamodt
Thread Block Compaction
23
Effect on Memory Traffic
TBC does still generate some extra uncoalesced
memory access #Acc = 4
Memory
1.2
TBC
DWF Baseline
Wilson Fung, Tor Aamodt
DIVG
Thread Block Compaction
WP
STO
STMCL
RAY
NNC
MGST
LKYT
LIB
DG
CP
BACKP
HRTWL
0%
NVRT
AES
50%
0.6
NAMD
100%
2.67x
0.8
MUMpp
150%
MUM
200%
TBC-RRB
1
FCDT
250%
TBC-AGE
BFS2
Memory Traffic
Normalized to Baseline
Normalized Memory Stalls
2nd Acc will hit
the L1 cache
300%
No Change to
Overall
Memory Traffic
In/out of a core
0x100
0x140
0x180
1 2 7 8
5 -- 11 12
LPS
C
C
HOTSP

COHE
24
Likely-Convergence (2)

NVIDIA uses break instruction for loop exits


That handles last example
Our solution: Likely-Convergence Points
PC RPC LPC LPos ActiveThds
F
---1234
E
F
--1-22
B
E
F
E
1
1
C
D
F
E
1
23344
E
F
E
1
2

Convergence!
This paper: only used to capture loop-breaks
Wilson Fung, Tor Aamodt
Thread Block Compaction
25
Likely-Convergence (3)

Applies to both per-warp stack (PDOM) and
thread block compaction (TBC)


Enable more threads grouping for TBC
Side effect: Reduce stack usage in some case
Wilson Fung, Tor Aamodt
Thread Block Compaction
26
Download