Improving GPU Performance via
Large Warps and Two-Level Warp Scheduling
Veynu Narasiman
The University of Texas at Austin
Michael Shebanow
NVIDIA
Chang Joo Lee
Intel
Rustam Miftakhutdinov
The University of Texas at Austin
Onur Mutlu
Carnegie Mellon University
Yale N. Patt
The University of Texas at Austin
MICRO-44
December 6 th , 2011
Porto Alegre, Brazil
Rise of GPU Computing
GPUs have become a popular platform for general purpose applications
New Programming Models
CUDA
ATI Stream Technology
OpenCL
Order of magnitude speedup over single-threaded CPU
How GPUs Exploit Parallelism
Multiple GPU cores (i.e., Streaming Multiprocessors)
Focus on a single GPU core
Exploit parallelism in 2 major ways:
Threads grouped into warps
Single PC per warp
Warps executed in SIMD fashion
Multiple warps concurrently executed
Round-robin scheduling
Helps hide long latencies
The Problem
Despite these techniques, computational resources can still be underutilized
Two reasons for this:
Branch divergence
Long latency operations
Branch Divergence
A 1111
Taken Not Taken
B 1001 C 0110
D 1111
Current PC:
Current Active Mask:
D
D
0110
1111
C
D
Reconverge
PC
Active
Mask
Execute
PC
Long Latency Operations
Core
Memory
System
All Warps Compute
Req Warp 0
Req Warp 1
Req Warp 15
All Warps Compute
Round Robin Scheduling, 16 total warps time
Computational Resource Utilization
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
Good
32
24 to 31
16 to 23
8 to 15
1 to 7
0
Bad
32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling
Large Warp Microarchitecture (LWM)
Alleviates branch divergence
Fewer, but larger warps
Warp size much greater than SIMD width
Total thread count and SIMD-width stay the same
Dynamically breaks down large warp into sub-warps
Can be executed on existing SIMD pipeline
Rearrange active mask as 2D structure
Number of columns = SIMD width
Search each column for an active thread to create new sub-warp
Large Warp Microarchitecture Example
Decode Stage
Sub-warp 0 mask
1 1 1 1
More Large Warp Microarchitecture
Divergence stack still used
Handled at the large warp level
How large should we make the warps?
More threads per warp more potential for sub-warp creation
Too large a warp size can degrade performance
Re-fetch policy for conditional branches
Must wait till last sub-warp finishes
Optimization for unconditional branch instructions
Don ’ t create multiple sub-warps
Sub-warping always completes in a single cycle
Two Level Round Robin Scheduling
Split warps into equal sized fetch groups
Create initial priority among the fetch groups
Round-robin scheduling among warps in same fetch group
When all warps in the highest priority fetch group are stalled
Rotate fetch group priorities
Highest priority fetch group becomes least
Warps arrive at a stalling point at slightly different points in time
Better overlap of computation and memory latency
Round Robin vs Two Level Round Robin
Core
Memory
System
All Warps Compute
Req Warp 0
Req Warp 1
Req Warp 15
All Warps Compute
Round Robin Scheduling, 16 total warps time
Core
Group 0
Compute
Group 1
Compute
Req Warp 0
Req Warp 1
Group 0
Compute
Group 1
Compute
Saved Cycles
Req Warp 7
Memory
System
Req Warp 8
Req Warp 9
Req Warp 15 time
Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each
More on Two Level Scheduling
What should the fetch group size be?
Enough warps to keep pipeline busy in the absence of long latency stalls
Too small
Uneven progression of warps in the same fetch group
Destroys data locality among warps
Too large
Reduces benefits of two-level scheduling
More warps stall at the same time
Not just for hiding memory latency
Complex instructions (e.g., sine, cosine, sqrt, etc.)
Two-level scheduling allows warps to arrive at such instructions at slightly different points in time
Combining LWM and Two Level Scheduling
4 large warps, 256 threads each
Fetch group size = 1 large warp
Problematic for applications with few long latency stalls
No stalls no fetch group priority changes
Single large warp starved
Branch re-fetch policy for large warps bubbles in pipeline
Timeout invoked fetch group priority change
32K instruction timeout period
Alleviates starvation
Methodology
Simulate single GPU core with 1024 thread contexts divided into 32 warps each with 32 threads
Scalar Front End
1-wide fetch, decode
4KB single ported I-Cache
Round-robin scheduling
SIMD Back End In order, 5 stages, 32 parallel SIMD lanes
Register File and On
Chip Memories
Memory System
64KB Register File
128KB, 4-way, D-Cache with 128B line size
128KB, 32-banked private memory
Open row, first-come first-serve scheduling
8 banks, 4KB row buffer per bank
100-cycle row hit latency, 300-cycle row conflict latency
32 GB/s memory bandwidth
Overall IPC Results
35
30
25
20
15
10
5
0
Baseline TBC LWM 2Lev LWM+2Lev
0.6
0.5
0.4
0.3
0.2
0.1
0.0
35
30
25
20
15
10
5
0
LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC
IPC and Computational Resource Utilization
IPC for blackjack
4
2
0
10
8
6
16
14
12 baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for blackjack
120%
100%
80%
60%
40%
20%
0% baseline LWM 2LEV LWM+2LEV
32
24 to 31
16 to 23
8 to 15
1 to 7
0
120%
100%
80%
60%
40%
20%
0%
IPC for histogram
15
10
5
0
25
20 baseline LWM 2LEV LWM+2LEV
Computational Resource Utilization for histogram
32
24 to 31
16 to 23
8 to 15
1 to 7
0 baseline LWM 2LEV LWM+2LEV
Conclusion
For maximum performance, the computational resources on
GPUs must be effectively utilized
Branch divergence and long latency operations cause them to be underutilized or unused
We proposed two mechanism to alleviate this
Large Warp Microarchitecture for branch divergence
Two-level scheduling for long latency operations
Improves performance by 19.1% over traditional GPU cores
Increases scope of applications that can run efficiently on a GPU
Questions