GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

advertisement

GPU VSIPL: High Performance

VSIPL Implementation for GPUs

Andrew Kerr, Dan Campbell*,

Mark Richards, Mike Davis andrew.kerr@gtri.gatech.edu

, dan.campbell@gtri.gatech.edu

, mark.richards@ece.gatech.edu

, mike.davis@gtri.gatech.edu

High Performance Embedded Computing (HPEC)

Workshop

Distribution Statement (A): Approved for public release; distribution is unlimited

24 September 2008

This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-

7724. The opinions expressed are those of the authors.

GTRI_B‹#›

1

Signal Processing on Graphics Processors

• GPUs original role: turn 3-D polygons into 2D pixels…

• …Which also makes them cheap & plentiful source of FLOPs

• Leverages volume & competition in entertainment industry

Primary role highly parallel, very regular

• Typically <$500 drop-in addition to standard PC

• Outstripping CPU capacity, and growing more quickly

• Peak theoretical ~1TFlop

Power draw: 280GTX = 200W Q6600 = 100W

• Still making improvements in market app with more parallelism, so growth continues

GTRI_B‹#›

2

1000

100

10

GPU/CPU Performance Growth

CPU & GPU Capacity Growth

1171

914

244

122

60

85

24

12

13

ATI NVIDIA Intel x86 "Moore's Law"

GTRI_B‹#›

3

GPGPU (Old) Concept of Operations

“A=B+C” void main(float2 tc0 : TEXCOORD0, out float4 col : COLOR, uniform samplerRECT B, uniform samplerRECT C)

{ col = texRECT (B, tc0) + texRECT (C, tc0);

}

• Arrays  Textures

• Render polygon with the same pixel dimensions as output texture

• Execute with fragment program to perform desired calculation

• Move data from output buffer to desired texture

Now we have computecentric programming models…

… But they require expertise to fully exploit

GTRI_B‹#›

4

VSIPL - Vector Signal Image Processing Library

• Portable API for linear algebra, image & signal processing

Originally sponsored by DARPA in mid ’90s

Targeted embedded processors – portability primary aim

Open standard, Forum-based

Initial API approved April 2000

• Functional coverage

Vector, Matrix, Tensor

Basic math operations, linear algebra, solvers, FFT, FIR/IIR, bookkeeping, etc

GTRI_B‹#›

5

VSIPL & GPU: Well Matched

• VSIPL is great for exploiting GPUs

High level API with good coverage for dense linear algebra

Allows non experts to benefit from hero programmers

Explicit memory access controls

API precision flexibility

• GPUs are great for VSIPL

Improves prototyping by speeding algorithm testing

Cheap addition allows more engineers access to HPC

Large speedups without needing explicit parallelism at application level

GTRI_B‹#›

6

GPU-VSIPL Implementation

• Full, compliant implementation of VSIPL Core-Lite Profile

Fully encapsulated CUDA backend

Leverages CUFFT library

All VSIPL functions accelerated

Core Lite Profile:

Single precision floating point, some basic integer

Vector & Sxalar, complex & real support

Basic elementwise, FFT, FIR, histogram, RNG, support

Full list: http://www.vsipl.org/coreliteprofile.pdf

• Also, some matrix support, including vsip_fftm_f

GTRI_B‹#›

7

CUDA Programming & Optimization

CUDA Programming Model

Grid

Block

Block

Block

CUDA Optimization Considerations

Datapath

Thread

Datapath

Thread

Memory

Shared

Memory

Register

File

Maximize occupancy to hide memory latency

Keep lots of threads in flight

Carefully manage memory access to allow coalesce & avoid conflicts

Avoid slow operations (e.g integer multiply for indexing)

Minimize synch barriers

Careful loop unrolling

Hoist loop invariants

Reduce register use for greater occupancy

Device Memory

• “GPU Performance Assessment with the HPEC Challenge ” – Thursday PM

CPU

Dispatch

Host Memory

GTRI_B‹#›

8

10

1

0,1

GPU VSIPL Speedup: Unary

1000

100 nVidia 8800GTX

Vs

Intel Q6600

320X

20X

Vector Length vcos vexp vlog vsqrt vmag vsq cvconj

GTRI_B‹#›

9

GPU VSIPL Speedup: Binary

100 nVidia 8800GTX

Vs

Intel Q6600

10 25X

40X

1

0,1

0,01 vmul vadd vdiv

Vector Length vsub cvadd cvsub cvmul cvjmul

GTRI_B‹#›

10

0,1

0,01

GPU VSIPL Speedup: FFT

100

10 nVidia 8800GTX

Vs

Intel Q6600

83X

39X

1

Vector Length

Real to Complex Complex to Complex

GTRI_B‹#›

11

GPU VSIPL Speedup: FIR

Filter Length 1024

10000

1000 nVidia 8800GTX

Vs

Intel Q6600

157X

100

10

1

32768

GPU (r)

65536

GPU (c)

131072 262144 524288

Input Vector Length

TASP (r) TASP (c) Speedup (r)

1048576 2097152

Speedup (c)

GTRI_B‹#›

12

Application Example: Range Doppler Map

• Simple Range/Doppler data visualization application demo

• Intro app for new VSIPL programmer

• 59x Speedup TASP  GPU-VSIPL

• No changes to source code

Section

Admit

Baseband

Zeropad

Fast time FFT

Multiply

Fast Time FFT -1

Slow time FFT, 2x CT log10 |.| 2

Release

Total:

9800GX2

Time (ms)

8.88

67.77

23.18

47.25

8.11

48.59

12.89

22.2

54.65

293.52

Q6600

Time (ms) Speedup

0 0

1872.3

110.71

5696.3

33.92

28

5

121

4

5729.04

3387

470.15

0

17299.42

118

263

21

0

59

GTRI_B‹#›

13

GPU-VSIPL: Future Plans

• Expand matrix support

Move toward full Core Profile

More linear algebra/solvers

VSIPL++

Double precision support

GTRI_B‹#›

14

Conclusions

GPUs are fast, cheap signal processors

VSIPL is a portable, intuitive means to exploit GPUs

GPU-VSIPL allows easy access to GPU performance without becoming an expert CUDA/GPU programer

• 10-100x speed improvement possible with no code change

• Not yet released, but unsupported previews may show up at: http://gpu-vsipl.gtri.gatech.edu

GTRI_B‹#›

15

Download