GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

GPU VSIPL: High Performance

VSIPL Implementation for GPUs

Andrew Kerr, Dan Campbell*,

Mark Richards, Mike Davis andrew.kerr@gtri.gatech.edu

, dan.campbell@gtri.gatech.edu

, mark.richards@ece.gatech.edu

, mike.davis@gtri.gatech.edu

High Performance Embedded Computing (HPEC)

Workshop

Distribution Statement (A): Approved for public release; distribution is unlimited

24 September 2008

This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-

7724. The opinions expressed are those of the authors.

GTRI_B‹#›

1

Signal Processing on Graphics Processors

• GPUs original role: turn 3-D polygons into 2D pixels…

• …Which also makes them cheap & plentiful source of FLOPs

• Leverages volume & competition in entertainment industry

•

Primary role highly parallel, very regular

• Typically <$500 drop-in addition to standard PC

• Outstripping CPU capacity, and growing more quickly

•

• Peak theoretical ~1TFlop

Power draw: 280GTX = 200W Q6600 = 100W

• Still making improvements in market app with more parallelism, so growth continues

GTRI_B‹#›

2

1000

100

10

GPU/CPU Performance Growth

CPU & GPU Capacity Growth

1171

914

244

122

60

85

24

12

13

ATI NVIDIA Intel x86 "Moore's Law"

GTRI_B‹#›

3

GPGPU (Old) Concept of Operations

“A=B+C” void main(float2 tc0 : TEXCOORD0, out float4 col : COLOR, uniform samplerRECT B, uniform samplerRECT C)

{ col = texRECT (B, tc0) + texRECT (C, tc0);

}

• Arrays  Textures

• Render polygon with the same pixel dimensions as output texture

• Execute with fragment program to perform desired calculation

• Move data from output buffer to desired texture

Now we have computecentric programming models…

… But they require expertise to fully exploit

GTRI_B‹#›

4

VSIPL - Vector Signal Image Processing Library

•

•

•

•

• Portable API for linear algebra, image & signal processing

Originally sponsored by DARPA in mid ’90s

Targeted embedded processors – portability primary aim

Open standard, Forum-based

Initial API approved April 2000

• Functional coverage

•

•

Vector, Matrix, Tensor

Basic math operations, linear algebra, solvers, FFT, FIR/IIR, bookkeeping, etc

GTRI_B‹#›

5

VSIPL & GPU: Well Matched

• VSIPL is great for exploiting GPUs

•

•

•

•

High level API with good coverage for dense linear algebra

Allows non experts to benefit from hero programmers

Explicit memory access controls

API precision flexibility

• GPUs are great for VSIPL

•

•

•

Improves prototyping by speeding algorithm testing

Cheap addition allows more engineers access to HPC

Large speedups without needing explicit parallelism at application level

GTRI_B‹#›

6

GPU-VSIPL Implementation

•

• Full, compliant implementation of VSIPL Core-Lite Profile

•

Fully encapsulated CUDA backend

•

•

Leverages CUFFT library

All VSIPL functions accelerated

Core Lite Profile:

•

•

•

•

Single precision floating point, some basic integer

Vector & Sxalar, complex & real support

Basic elementwise, FFT, FIR, histogram, RNG, support

Full list: http://www.vsipl.org/coreliteprofile.pdf

• Also, some matrix support, including vsip_fftm_f

GTRI_B‹#›

7

CUDA Programming & Optimization

CUDA Programming Model

Grid

Block

Block

Block

CUDA Optimization Considerations

Datapath

Thread

Datapath

Thread

Memory

Shared

Memory

Register

File

•

•

•

•

•

•

•

•

Maximize occupancy to hide memory latency

Keep lots of threads in flight

Carefully manage memory access to allow coalesce & avoid conflicts

Avoid slow operations (e.g integer multiply for indexing)

Minimize synch barriers

Careful loop unrolling

Hoist loop invariants

Reduce register use for greater occupancy

Device Memory

• “GPU Performance Assessment with the HPEC Challenge ” – Thursday PM

CPU

Dispatch

Host Memory

GTRI_B‹#›

8

10

1

0,1

GPU VSIPL Speedup: Unary

1000

100 nVidia 8800GTX

Vs

Intel Q6600

320X

20X

Vector Length vcos vexp vlog vsqrt vmag vsq cvconj

GTRI_B‹#›

9

GPU VSIPL Speedup: Binary


Vs

Intel Q6600

10 25X

40X

1

0,1

0,01 vmul vadd vdiv

Vector Length vsub cvadd cvsub cvmul cvjmul

GTRI_B‹#›

10

0,1

0,01

GPU VSIPL Speedup: FFT

100


Vs

Intel Q6600

83X

39X

1

Vector Length

Real to Complex Complex to Complex

GTRI_B‹#›

11

GPU VSIPL Speedup: FIR

Filter Length 1024

10000

1000 nVidia 8800GTX

Vs

Intel Q6600

157X

100

10

1

32768

GPU (r)

65536

GPU (c)

131072 262144 524288

Input Vector Length

TASP (r) TASP (c) Speedup (r)

1048576 2097152

Speedup (c)

GTRI_B‹#›

12

Application Example: Range Doppler Map

• Simple Range/Doppler data visualization application demo

• Intro app for new VSIPL programmer

• 59x Speedup TASP  GPU-VSIPL

• No changes to source code

Section

Admit

Baseband

Zeropad

Fast time FFT

Multiply

Fast Time FFT -1

Slow time FFT, 2x CT log10 |.| 2

Release

Total:

9800GX2

Time (ms)

8.88

67.77

23.18

47.25

8.11

48.59

12.89

22.2

54.65

293.52

Q6600

Time (ms) Speedup

0 0

1872.3

110.71

5696.3

33.92

28

5

121

4

5729.04

3387

470.15

0

17299.42

118

263

21

0

59

GTRI_B‹#›

13

GPU-VSIPL: Future Plans

•

•

•

•

• Expand matrix support

Move toward full Core Profile

More linear algebra/solvers

VSIPL++

Double precision support

GTRI_B‹#›

14

Conclusions

•

•

•

GPUs are fast, cheap signal processors

VSIPL is a portable, intuitive means to exploit GPUs

GPU-VSIPL allows easy access to GPU performance without becoming an expert CUDA/GPU programer

• 10-100x speed improvement possible with no code change

• Not yet released, but unsupported previews may show up at: http://gpu-vsipl.gtri.gatech.edu

GTRI_B‹#›

15

GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,

GPU VSIPL: High Performance

VSIPL Implementation for GPUs

Related documents

Products

Support

GPU VSIPL: High Performance VSIPL Implementation for GPUs Andrew Kerr, Dan Campbell*,