Slides - Carnegie Mellon University

Improving GPU Performance via

Large Warps and Two-Level Warp Scheduling

Veynu Narasiman

The University of Texas at Austin

Michael Shebanow

NVIDIA

Chang Joo Lee

Intel

Rustam Miftakhutdinov


Onur Mutlu

Carnegie Mellon University

Yale N. Patt


MICRO-44

December 6 th , 2011

Porto Alegre, Brazil

Rise of GPU Computing

 GPUs have become a popular platform for general purpose applications

 New Programming Models

 CUDA

 ATI Stream Technology

 OpenCL

 Order of magnitude speedup over single-threaded CPU

How GPUs Exploit Parallelism

 Multiple GPU cores (i.e., Streaming Multiprocessors)

 Focus on a single GPU core

 Exploit parallelism in 2 major ways:

 Threads grouped into warps





Single PC per warp

Warps executed in SIMD fashion

 Multiple warps concurrently executed

 Round-robin scheduling

 Helps hide long latencies

The Problem

 Despite these techniques, computational resources can still be underutilized

 Two reasons for this:

 Branch divergence

 Long latency operations

Branch Divergence

A 1111

Taken Not Taken

B 1001 C 0110

D 1111

Current PC:

Current Active Mask:

D

D

0110

1111

C

D

Reconverge

PC

Active

Mask

Execute

PC

Long Latency Operations

Core

Memory

System

All Warps Compute

Req Warp 0

Req Warp 1

Req Warp 15

All Warps Compute

Round Robin Scheduling, 16 total warps time

Computational Resource Utilization

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Good

32

24 to 31

16 to 23

8 to 15

1 to 7

0

Bad

32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

Large Warp Microarchitecture (LWM)

 Alleviates branch divergence

 Fewer, but larger warps

 Warp size much greater than SIMD width

 Total thread count and SIMD-width stay the same

 Dynamically breaks down large warp into sub-warps

 Can be executed on existing SIMD pipeline

 Rearrange active mask as 2D structure

 Number of columns = SIMD width

 Search each column for an active thread to create new sub-warp

Large Warp Microarchitecture Example

Decode Stage

Sub-warp 0 mask

1 1 1 1

More Large Warp Microarchitecture

 Divergence stack still used

 Handled at the large warp level

 How large should we make the warps?

 More threads per warp  more potential for sub-warp creation

 Too large a warp size can degrade performance

 Re-fetch policy for conditional branches

 Must wait till last sub-warp finishes

 Optimization for unconditional branch instructions

 Don ’ t create multiple sub-warps

 Sub-warping always completes in a single cycle

Two Level Round Robin Scheduling

 Split warps into equal sized fetch groups

 Create initial priority among the fetch groups

 Round-robin scheduling among warps in same fetch group

 When all warps in the highest priority fetch group are stalled

 Rotate fetch group priorities

 Highest priority fetch group becomes least

 Warps arrive at a stalling point at slightly different points in time

 Better overlap of computation and memory latency

Round Robin vs Two Level Round Robin

Core

Memory

System

All Warps Compute

Req Warp 0

Req Warp 1

Req Warp 15

All Warps Compute

Round Robin Scheduling, 16 total warps time

Core

Group 0

Compute

Group 1

Compute

Req Warp 0

Req Warp 1

Group 0

Compute

Group 1

Compute

Saved Cycles

Req Warp 7

Memory

System

Req Warp 8

Req Warp 9

Req Warp 15 time

Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

More on Two Level Scheduling

 What should the fetch group size be?

 Enough warps to keep pipeline busy in the absence of long latency stalls

 Too small





Uneven progression of warps in the same fetch group

Destroys data locality among warps

 Too large

 Reduces benefits of two-level scheduling

 More warps stall at the same time

 Not just for hiding memory latency

 Complex instructions (e.g., sine, cosine, sqrt, etc.)

 Two-level scheduling allows warps to arrive at such instructions at slightly different points in time

Combining LWM and Two Level Scheduling

 4 large warps, 256 threads each

 Fetch group size = 1 large warp

 Problematic for applications with few long latency stalls

 No stalls  no fetch group priority changes





Single large warp starved

Branch re-fetch policy for large warps  bubbles in pipeline

 Timeout invoked fetch group priority change

 32K instruction timeout period

 Alleviates starvation

Methodology

Simulate single GPU core with 1024 thread contexts divided into 32 warps each with 32 threads

Scalar Front End

1-wide fetch, decode

4KB single ported I-Cache

Round-robin scheduling

SIMD Back End In order, 5 stages, 32 parallel SIMD lanes

Register File and On

Chip Memories

Memory System

64KB Register File

128KB, 4-way, D-Cache with 128B line size

128KB, 32-banked private memory

Open row, first-come first-serve scheduling

8 banks, 4KB row buffer per bank

100-cycle row hit latency, 300-cycle row conflict latency

32 GB/s memory bandwidth

Overall IPC Results

35

30

25

20

15

10

5

0

Baseline TBC LWM 2Lev LWM+2Lev

0.6

0.5

0.4

0.3

0.2

0.1

0.0

35

30

25

20

15

10

5

0

LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

IPC and Computational Resource Utilization

IPC for blackjack

4

2

0

10

8

6

16

14

12 baseline LWM 2LEV LWM+2LEV

Computational Resource Utilization for blackjack

120%

100%

80%

60%

40%

20%

0% baseline LWM 2LEV LWM+2LEV

32

24 to 31

16 to 23

8 to 15

1 to 7

0

120%

100%

80%

60%

40%

20%

0%

IPC for histogram

15

10

5

0

25


Computational Resource Utilization for histogram

32

24 to 31

16 to 23

8 to 15

1 to 7


Conclusion

 For maximum performance, the computational resources on

GPUs must be effectively utilized

 Branch divergence and long latency operations cause them to be underutilized or unused

 We proposed two mechanism to alleviate this

 Large Warp Microarchitecture for branch divergence

 Two-level scheduling for long latency operations

 Improves performance by 19.1% over traditional GPU cores

 Increases scope of applications that can run efficiently on a GPU

 Questions

Slides - Carnegie Mellon University

Related documents

Products

Support

Slides - Carnegie Mellon University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib