Slides - Carnegie Mellon University

advertisement

Improving GPU Performance via

Large Warps and Two-Level Warp Scheduling

Veynu Narasiman

The University of Texas at Austin

Michael Shebanow

NVIDIA

Chang Joo Lee

Intel

Rustam Miftakhutdinov

The University of Texas at Austin

Onur Mutlu

Carnegie Mellon University

Yale N. Patt

The University of Texas at Austin

MICRO-44

December 6 th , 2011

Porto Alegre, Brazil

Rise of GPU Computing

 GPUs have become a popular platform for general purpose applications

 New Programming Models

 CUDA

 ATI Stream Technology

 OpenCL

 Order of magnitude speedup over single-threaded CPU

How GPUs Exploit Parallelism

 Multiple GPU cores (i.e., Streaming Multiprocessors)

 Focus on a single GPU core

 Exploit parallelism in 2 major ways:

 Threads grouped into warps

Single PC per warp

Warps executed in SIMD fashion

 Multiple warps concurrently executed

 Round-robin scheduling

 Helps hide long latencies

The Problem

 Despite these techniques, computational resources can still be underutilized

 Two reasons for this:

 Branch divergence

 Long latency operations

Branch Divergence

A 1111

Taken Not Taken

B 1001 C 0110

D 1111

Current PC:

Current Active Mask:

D

D

0110

1111

C

D

Reconverge

PC

Active

Mask

Execute

PC

Long Latency Operations

Core

Memory

System

All Warps Compute

Req Warp 0

Req Warp 1

Req Warp 15

All Warps Compute

Round Robin Scheduling, 16 total warps time

Computational Resource Utilization

100%

90%

80%

70%

60%

50%

40%

30%

20%

10%

0%

Good

32

24 to 31

16 to 23

8 to 15

1 to 7

0

Bad

32 warps, 32 threads per warp, SIMD width = 32, round-robin scheduling

Large Warp Microarchitecture (LWM)

 Alleviates branch divergence

 Fewer, but larger warps

 Warp size much greater than SIMD width

 Total thread count and SIMD-width stay the same

 Dynamically breaks down large warp into sub-warps

 Can be executed on existing SIMD pipeline

 Rearrange active mask as 2D structure

 Number of columns = SIMD width

 Search each column for an active thread to create new sub-warp

Large Warp Microarchitecture Example

Decode Stage

Sub-warp 0 mask

1 1 1 1

More Large Warp Microarchitecture

 Divergence stack still used

 Handled at the large warp level

 How large should we make the warps?

 More threads per warp  more potential for sub-warp creation

 Too large a warp size can degrade performance

 Re-fetch policy for conditional branches

 Must wait till last sub-warp finishes

 Optimization for unconditional branch instructions

 Don ’ t create multiple sub-warps

 Sub-warping always completes in a single cycle

Two Level Round Robin Scheduling

 Split warps into equal sized fetch groups

 Create initial priority among the fetch groups

 Round-robin scheduling among warps in same fetch group

 When all warps in the highest priority fetch group are stalled

 Rotate fetch group priorities

 Highest priority fetch group becomes least

 Warps arrive at a stalling point at slightly different points in time

 Better overlap of computation and memory latency

Round Robin vs Two Level Round Robin

Core

Memory

System

All Warps Compute

Req Warp 0

Req Warp 1

Req Warp 15

All Warps Compute

Round Robin Scheduling, 16 total warps time

Core

Group 0

Compute

Group 1

Compute

Req Warp 0

Req Warp 1

Group 0

Compute

Group 1

Compute

Saved Cycles

Req Warp 7

Memory

System

Req Warp 8

Req Warp 9

Req Warp 15 time

Two Level Round Robin Scheduling, 2 fetch groups, 8 warps each

More on Two Level Scheduling

 What should the fetch group size be?

 Enough warps to keep pipeline busy in the absence of long latency stalls

 Too small

Uneven progression of warps in the same fetch group

Destroys data locality among warps

 Too large

 Reduces benefits of two-level scheduling

 More warps stall at the same time

 Not just for hiding memory latency

 Complex instructions (e.g., sine, cosine, sqrt, etc.)

 Two-level scheduling allows warps to arrive at such instructions at slightly different points in time

Combining LWM and Two Level Scheduling

 4 large warps, 256 threads each

 Fetch group size = 1 large warp

 Problematic for applications with few long latency stalls

 No stalls  no fetch group priority changes

Single large warp starved

Branch re-fetch policy for large warps  bubbles in pipeline

 Timeout invoked fetch group priority change

 32K instruction timeout period

 Alleviates starvation

Methodology

Simulate single GPU core with 1024 thread contexts divided into 32 warps each with 32 threads

Scalar Front End

1-wide fetch, decode

4KB single ported I-Cache

Round-robin scheduling

SIMD Back End In order, 5 stages, 32 parallel SIMD lanes

Register File and On

Chip Memories

Memory System

64KB Register File

128KB, 4-way, D-Cache with 128B line size

128KB, 32-banked private memory

Open row, first-come first-serve scheduling

8 banks, 4KB row buffer per bank

100-cycle row hit latency, 300-cycle row conflict latency

32 GB/s memory bandwidth

Overall IPC Results

35

30

25

20

15

10

5

0

Baseline TBC LWM 2Lev LWM+2Lev

0.6

0.5

0.4

0.3

0.2

0.1

0.0

35

30

25

20

15

10

5

0

LWM+2Lev improves performance by 19.1% over baseline and by 11.5% over TBC

IPC and Computational Resource Utilization

IPC for blackjack

4

2

0

10

8

6

16

14

12 baseline LWM 2LEV LWM+2LEV

Computational Resource Utilization for blackjack

120%

100%

80%

60%

40%

20%

0% baseline LWM 2LEV LWM+2LEV

32

24 to 31

16 to 23

8 to 15

1 to 7

0

120%

100%

80%

60%

40%

20%

0%

IPC for histogram

15

10

5

0

25

20 baseline LWM 2LEV LWM+2LEV

Computational Resource Utilization for histogram

32

24 to 31

16 to 23

8 to 15

1 to 7

0 baseline LWM 2LEV LWM+2LEV

Conclusion

 For maximum performance, the computational resources on

GPUs must be effectively utilized

 Branch divergence and long latency operations cause them to be underutilized or unused

 We proposed two mechanism to alleviate this

 Large Warp Microarchitecture for branch divergence

 Two-level scheduling for long latency operations

 Improves performance by 19.1% over traditional GPU cores

 Increases scope of applications that can run efficiently on a GPU

 Questions

Download