Andrew Kerr, Dan Campbell*,
Mark Richards, Mike Davis andrew.kerr@gtri.gatech.edu
, dan.campbell@gtri.gatech.edu
, mark.richards@ece.gatech.edu
, mike.davis@gtri.gatech.edu
High Performance Embedded Computing (HPEC)
Workshop
Distribution Statement (A): Approved for public release; distribution is unlimited
24 September 2008
This work was supported in part by DARPA and AFRL under contracts FA8750-06-1-0012 and FA8650-07-C-
7724. The opinions expressed are those of the authors.
GTRI_B‹#›
1
Signal Processing on Graphics Processors
• GPUs original role: turn 3-D polygons into 2D pixels…
• …Which also makes them cheap & plentiful source of FLOPs
• Leverages volume & competition in entertainment industry
•
Primary role highly parallel, very regular
• Typically <$500 drop-in addition to standard PC
• Outstripping CPU capacity, and growing more quickly
•
• Peak theoretical ~1TFlop
Power draw: 280GTX = 200W Q6600 = 100W
• Still making improvements in market app with more parallelism, so growth continues
GTRI_B‹#›
2
1000
100
10
GPU/CPU Performance Growth
CPU & GPU Capacity Growth
1171
914
244
122
60
85
24
12
13
ATI NVIDIA Intel x86 "Moore's Law"
GTRI_B‹#›
3
GPGPU (Old) Concept of Operations
“A=B+C” void main(float2 tc0 : TEXCOORD0, out float4 col : COLOR, uniform samplerRECT B, uniform samplerRECT C)
{ col = texRECT (B, tc0) + texRECT (C, tc0);
}
• Arrays Textures
• Render polygon with the same pixel dimensions as output texture
• Execute with fragment program to perform desired calculation
• Move data from output buffer to desired texture
Now we have computecentric programming models…
… But they require expertise to fully exploit
GTRI_B‹#›
4
VSIPL - Vector Signal Image Processing Library
•
•
•
•
• Portable API for linear algebra, image & signal processing
Originally sponsored by DARPA in mid ’90s
Targeted embedded processors – portability primary aim
Open standard, Forum-based
Initial API approved April 2000
• Functional coverage
•
•
Vector, Matrix, Tensor
Basic math operations, linear algebra, solvers, FFT, FIR/IIR, bookkeeping, etc
GTRI_B‹#›
5
VSIPL & GPU: Well Matched
• VSIPL is great for exploiting GPUs
•
•
•
•
High level API with good coverage for dense linear algebra
Allows non experts to benefit from hero programmers
Explicit memory access controls
API precision flexibility
• GPUs are great for VSIPL
•
•
•
Improves prototyping by speeding algorithm testing
Cheap addition allows more engineers access to HPC
Large speedups without needing explicit parallelism at application level
GTRI_B‹#›
6
GPU-VSIPL Implementation
•
• Full, compliant implementation of VSIPL Core-Lite Profile
•
Fully encapsulated CUDA backend
•
•
Leverages CUFFT library
All VSIPL functions accelerated
Core Lite Profile:
•
•
•
•
Single precision floating point, some basic integer
Vector & Sxalar, complex & real support
Basic elementwise, FFT, FIR, histogram, RNG, support
Full list: http://www.vsipl.org/coreliteprofile.pdf
• Also, some matrix support, including vsip_fftm_f
GTRI_B‹#›
7
CUDA Programming & Optimization
CUDA Programming Model
Grid
Block
Block
Block
CUDA Optimization Considerations
Datapath
Thread
Datapath
Thread
Memory
Shared
Memory
Register
File
•
•
•
•
•
•
•
•
Maximize occupancy to hide memory latency
Keep lots of threads in flight
Carefully manage memory access to allow coalesce & avoid conflicts
Avoid slow operations (e.g integer multiply for indexing)
Minimize synch barriers
Careful loop unrolling
Hoist loop invariants
Reduce register use for greater occupancy
Device Memory
• “GPU Performance Assessment with the HPEC Challenge ” – Thursday PM
CPU
Dispatch
Host Memory
GTRI_B‹#›
8
10
1
0,1
GPU VSIPL Speedup: Unary
1000
100 nVidia 8800GTX
Vs
Intel Q6600
320X
20X
Vector Length vcos vexp vlog vsqrt vmag vsq cvconj
GTRI_B‹#›
9
GPU VSIPL Speedup: Binary
100 nVidia 8800GTX
Vs
Intel Q6600
10 25X
40X
1
0,1
0,01 vmul vadd vdiv
Vector Length vsub cvadd cvsub cvmul cvjmul
GTRI_B‹#›
10
0,1
0,01
GPU VSIPL Speedup: FFT
100
10 nVidia 8800GTX
Vs
Intel Q6600
83X
39X
1
Vector Length
Real to Complex Complex to Complex
GTRI_B‹#›
11
GPU VSIPL Speedup: FIR
Filter Length 1024
10000
1000 nVidia 8800GTX
Vs
Intel Q6600
157X
100
10
1
32768
GPU (r)
65536
GPU (c)
131072 262144 524288
Input Vector Length
TASP (r) TASP (c) Speedup (r)
1048576 2097152
Speedup (c)
GTRI_B‹#›
12
Application Example: Range Doppler Map
• Simple Range/Doppler data visualization application demo
• Intro app for new VSIPL programmer
• 59x Speedup TASP GPU-VSIPL
• No changes to source code
Section
Admit
Baseband
Zeropad
Fast time FFT
Multiply
Fast Time FFT -1
Slow time FFT, 2x CT log10 |.| 2
Release
Total:
9800GX2
Time (ms)
8.88
67.77
23.18
47.25
8.11
48.59
12.89
22.2
54.65
293.52
Q6600
Time (ms) Speedup
0 0
1872.3
110.71
5696.3
33.92
28
5
121
4
5729.04
3387
470.15
0
17299.42
118
263
21
0
59
GTRI_B‹#›
13
GPU-VSIPL: Future Plans
•
•
•
•
• Expand matrix support
Move toward full Core Profile
More linear algebra/solvers
VSIPL++
Double precision support
GTRI_B‹#›
14
Conclusions
•
•
•
GPUs are fast, cheap signal processors
VSIPL is a portable, intuitive means to exploit GPUs
GPU-VSIPL allows easy access to GPU performance without becoming an expert CUDA/GPU programer
• 10-100x speed improvement possible with no code change
• Not yet released, but unsupported previews may show up at: http://gpu-vsipl.gtri.gatech.edu
GTRI_B‹#›
15