CUDA Optimization Tutorial - Computer Science

advertisement
Parallelizing and Optimizing Programs
for GPU Acceleration using CUDA
Martin Burtscher
Department of Computer Science
CUDA Optimization Tutorial
 Martin Burtscher
 burtscher@txstate.edu
 http://www.cs.txstate.edu/~burtscher/
 Tutorial slides
 http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx
CUDA Optimization Tutorial
2
High-end CPU-GPU Comparison
Cores
Active threads
Frequency
Peak performance (SP)
Peak mem. bandwidth
Maximum power
Price
Xeon E5-2687W
8 (superscalar)
2 per core
3.1 GHz
397 GFlop/s
51 GB/s
150 W
$1900
Kepler GTX 680
1536 (simple)
~11 per core
1.0 GHz
3090 GFlop/s
192 GB/s
195 W*
$500*
Release dates
Xeon: March 2012
Kepler: March 2012
*entire card
CUDA Optimization Tutorial
Intel
Nvidia
3
GPU Advantages
 Performance
 8x as many operations executed per second
 Main memory bandwidth
 4x as many bytes transferred per second
 Cost-, energy-, and size-efficiency
 29x as much performance per dollar
 6x as much performance per watt
 11x as much performance per area
(based on peak values)
CUDA Optimization Tutorial
4
GPU Disadvantages
 Clearly, we should be using GPUs all the time
 So why aren’t we?
 GPUs can only execute some types of code fast
 Need lots of data parallelism, data reuse, & regularity
 GPUs are harder to program and tune than CPUs
 In part because of poor tool support
 In part because of their architecture
 In part because of poor support for irregular codes
CUDA Optimization Tutorial
5
Outline (Part I)
 Introduction
 GPU programming
 N-body example
 Porting and tuning
 Other considerations
Hightechreview.com
 Conclusions
CUDA Optimization Tutorial
6
CUDA Programming Model
 Non-graphics
 C/C++ with extensions
 Function launch
programming
 Uses GPU as massively
parallel co-processor
CPU
PCI-Express
bus
GPU
 SIMT (single-instruction
multiple-threads) model
 Thousands of threads
needed for full efficiency
CUDA Optimization Tutorial

Calling functions on GPU
 Memory management

GPU memory allocation,
copying data to/from GPU
 Declaration qualifiers

Device, shared, local, etc.
 Special instructions

Barriers, fences, etc.
 Keywords

threadIdx, blockIdx
7
Calling GPU Kernels
 Kernels are functions that run on the GPU
 Callable by CPU code
 CPU can continue processing while GPU runs kernel
KernelName<<<m, n>>>(arg1, arg2, ...);
 Launch configuration (programmer selectable)
 GPU spawns m blocks with n threads (i.e., m*n
threads total) that run a copy of the same function
 Normal function parameters: passed conventionally

Different address space, should never pass CPU pointers
CUDA Optimization Tutorial
8
GPU Architecture
 GPUs consist of Streaming Multiprocessors (SMs)
 1 to 30 SMs per chip (run blocks)
 SMs contain Processing Elements (PEs)
 8, 32, or 192 PEs per SM (run threads)
Shared
Memory
Shared
Memory
Adapted from NVIDIA
CUDA Optimization Tutorial
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Shared
Memory
Global Memory
9
Block Scalability
 Hardware can assign blocks to SMs in any order
 A kernel with enough blocks scales across GPUs
 Not all blocks may be resident at the same time
GPU with 2 SMs
Kernel
GPU with 4 SMs
Block 0 Block 1
Block 2 Block 3
Block 0
Block 1
Block 4 Block 5
Block 6 Block 7
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
CUDA Optimization Tutorial
time
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
Adapted from NVIDIA
10
GPU Memories
GPU
 Separate from CPU memory
Block (0, 0)
 CPU can access GPU’s global
& constant mem. via PCIe bus
 Requires slow explicit transfer
Block (1, 0)
Shared Memory (SRAM)
Registers
Registers
Shared Memory (SRAM)
Registers
Registers
 Visible GPU memory types
Thread (0, 0) Thread (1, 0)
 Registers (per thread)
 Local mem. (per thread)
 Shared mem. (per block)

Software-controlled cache
 Global mem. (per kernel)
 Constant mem. (read only)
CUDA Optimization Tutorial
C
P
U
Thread (0, 0) Thread (1, 0)
Global + Local Memory (DRAM)
Constant Memory (DRAM, cached)
Adapted from NVIDIA
 Slow communic. between blocks
11
SM Internals (Fermi and Kepler)
 Caches
 Software-controlled shared memory
 Hardware-controlled incoherent L1 data cache
 64 kB combined size, can be split 16/48, 32/32, 48/16
 Synchronization support
 Fast hardware barrier within block (__syncthreads())
 Fence instructions: memory consistency & coherency
 Special operations
 Thread voting (warp-based reduction operations)
CUDA Optimization Tutorial
12
Block and Thread Allocation Limits
 Blocks assigned to SMs
 Until first limit reached
 Threads assigned to PEs
t0 t1 t2 … tm
SM 0 SM 1
MT IU
MT IU
PE
PE
Blocks
t0 t1 t2 … tm
Shared
Memory
 8/16 active blocks/SM
 1024, 1536, or 2048


Blocks
Shared
Memory
 Hardware limits


resident threads/SM
512 or 1024 threads/blk
16k, 32k, or 64k regs/SM
16 kB or 48 kB shared
memory per SM
216-1 or 231-1 blks/kernel
Adapted from NVIDIA
CUDA Optimization Tutorial
13
Warp-based Execution
 32 contiguous threads form a warp
 Execute same instruction in same cycle (or disabled)
 Warps are scheduled out-of-order with respect to
each other to hide latencies
 Thread divergence
 Some threads in warp jump to different PC than others
 Hardware runs subsets of warp until they re-converge
 Results in reduction of parallelism (performance loss)
CUDA Optimization Tutorial
14
Thread Divergence
 Non-divergent code
if (threadID >= 32) {
some_code;
} else {
other_code;
}
Thread ID:
0123…
31
 Divergent code
if (threadID >= 13) {
some_code;
} else {
other_code;
}
Thread ID:
0123…
31
disabled
disabled
Adapted from NVIDIA
CUDA Optimization Tutorial
Adapted from NVIDIA
15
Parallel Memory Accesses
 Coalesced main memory access (16/32x faster)
 Under some conditions, HW combines multiple (half)
warp memory accesses into a single coalesced access
CC 1.3: 64-byte aligned 64-byte line (any permutation)
 CC 2.x+3.0: 128-byte aligned 128-byte line (cached)

 Bank-conflict-free shared memory access (16/32)
 No superword alignment or contiguity requirements
CC 1.3: 16 different banks per half warp or same word
 CC 2.x+3.0 : 32 different banks + 1-word broadcast each

CUDA Optimization Tutorial
16
Coalesced Main Memory Accesses
single coalesced access
NVIDIA
CUDA Optimization Tutorial
one and two coalesced accesses*
NVIDIA
17
Outline
 Introduction
 GPU programming
 N-body example
 Porting and tuning
 Other considerations
NASA/JPL-Caltech/SSC
 Conclusions
CUDA Optimization Tutorial
18
N-body Simulation
 Time evolution of physical system
 System consists of bodies
 “n” is the number of bodies
 Bodies interact via pair-wise forces
RUG
 Many systems can be modeled in this way
 Star/galaxy clusters (gravitational force)
 Particles (electric force, magnetic force)
Cornell
CUDA Optimization Tutorial
19
Simple N-body Algorithm
 Algorithm
Initialize body masses, positions, and velocities
Iterate over time steps {
Accumulate forces acting on each body
Update body positions and velocities based on force
}
Output result
 More sophisticated n-body algorithms exist
 Barnes Hut algorithm (covered in Part II)
 Fast Multipole Method (FMM)
CUDA Optimization Tutorial
20
Key Loops (Pseudo Code)
bodySet = ...; // input
for timestep do { // sequential
foreach Body b1 in bodySet { // O(n2) parallel
foreach Body b2 in bodySet {
if (b1 != b2) {
b1.addInteractionForce(b2);
}
}
}
foreach Body b in bodySet {
// O(n) parallel
b.Advance();
}
}
// output result
CUDA Optimization Tutorial
21
Force Calculation C Code
struct Body {
float mass, posx, posy, posz; // mass and 3D position
float velx, vely, velz, accx, accy, accz; // 3D velocity & accel
} *body;
for (i = 0; i < nbodies; i++) {
. . .
for (j = 0; j < nbodies; j++) {
if (i != j) {
dx = body[j].posx - px; // delta x
dy = body[j].posy - py; // delta y
dz = body[j].posz - pz; // delta z
dsq = dx*dx + dy*dy + dz*dz; // distance squared
dinv = 1.0f / sqrtf(dsq + epssq); // inverse distance
scale = body[j].mass * dinv * dinv * dinv; // scaled force
ax += dx * scale; // accumulate x contribution of accel
ay += dy * scale; az += dz * scale; // ditto for y and z
}
}
. . .
CUDA Optimization Tutorial
22
Outline
 Introduction
 GPU programming
 N-body example
 Porting and tuning
 Other considerations
 Conclusions
CUDA Optimization Tutorial
23
N-body Algorithm Suitability for GPU
 Lots of data parallelism
 Force calculations are independent
 Should be able to keep SMs and PEs busy
 Sufficient memory access regularity
 All force calculations access body data in same order*
 Should have lots of coalesced memory accesses
 Sufficient code regularity
 All force calculations are identical*
 There should be little thread divergence
 Plenty of data reuse
 O(n2) operations on O(n) data
 CPU/GPU transfer time is insignificant
CUDA Optimization Tutorial
24
C to CUDA Conversion
 Two CUDA kernels
 Force calculation
 Advance position and velocity
 Benefits
 Force calculation requires over 99.9% of runtime

Primary target for acceleration
 Advancing kernel unimportant to runtime
But allows to keep data on GPU during entire simulation
 Minimizes GPU/CPU transfers

CUDA Optimization Tutorial
25
C to CUDA Conversion
__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) {
. . .
}
__global__ void AdvancingKernel(int nbodies, struct Body *body, ...) {
. . .
}
Indicates GPU kernel that CPU can call
int main(...) {
Separate address spaces, need two pointers
Body *body, *bodyl;
Allocate memory on GPU
. . .
cudaMalloc((void**)&bodyl, sizeof(Body)*nbodies);
cudaMemcpy(bodyl, body, sizeof(Body)*nbodies, cuda…HostToDevice);
for (timestep = ...) {
ForceCalcKernel<<<1, 1>>>(nbodies, bodyl, ...); Copy CPU data to GPU
AdvancingKernel<<<1, 1>>>(nbodies, bodyl, ...);
}
cudaMemcpy(body, bodyl, sizeof(Body)*nbodies, cuda…DeviceToHost);
cudaFree(bodyl);
Copy GPU data back to CPU
. . .
Call GPU kernel with 1 block
and 1 thread per block
}
CUDA Optimization Tutorial
26
Evaluation Methodology
 Systems and compilers
 CC 1.3: Quadro FX 5800, nvcc 3.2

30 SMs, 240 PEs, 1.3 GHz, 30720 resident threads
 CC 2.0: Tesla C2050, nvcc 3.2

14 SMs, 448 PEs, 1.15 GHz, 21504 resident threads
 CC 3.0: GeForce GTX 680, nvcc 4.2

8 SMs, 1536 PEs, 1.0 GHz, 16384 resident threads
 Inputs and metric
 1k, 10k, or 100k star clusters (Plummer model)
 Median runtime of three experiments, excluding I/O
CUDA Optimization Tutorial
27
1-Thread Performance
 Problem size
 n=10000, step=1
 n=10000, step=1
 n=3000, step=1
 Slowdown rel. to CPU
 CC 1.3: 72.4
 CC 2.0: 36.7
 CC 3.0: 68.1
(Note: comparing different GPUs to
different CPUs)
CUDA Optimization Tutorial
 Performance
 1 thread is one to two
orders of magnitude
slower on GPU than CPU
 Reasons
 No caches (CC 1.3)
 Not superscalar
 Slower clock frequency
 No SMT latency hiding
28
Using N Threads
 Approach
 Eliminate outer loop
 Instantiate n copies of inner loop, one per body
 Threading
 Blocks can only hold 512 or 1024 threads
Up to 8/16 blocks can be resident in an SM at a time
 SM can hold 1024, 1536, or 2048 threads
 We use 256 threads per block (works for all of our GPUs)

 Need multiple blocks

Last block may not need all of its threads
CUDA Optimization Tutorial
29
Using N Threads
__global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) {
for (i = 0; i < nbodies; i++) {
i = threadIdx.x + blockIdx.x * blockDim.x; // compute i
if (i < nbodies) { // in case last block is only partially used
for (j = ...) {
. . .
}
}
}
__global__ void AdvancingKernel(int nbodies,...) // same changes
#define threads 256
int main(...) {
. . .
int blocks = (nbodies + threads - 1) / threads; // compute block cnt
for (timestep = ...) {
ForceCalcKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...);
AdvancingKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...);
}
}
CUDA Optimization Tutorial
30
N Thread Speedup
 Relative to 1 GPU thread  Performance
 CC 1.3: 7781 (240 PEs)
 Speedup much higher
 CC 2.0: 6495 (448 PEs)
than number of PEs
(32, 14.5, and 7.9 times)
 Due to SMT latency hiding
 CC 3.0: 12150 (1536 PEs)
 Relative to 1 CPU thread
 CC 1.3: 107.5
 CC 2.0: 176.7
 CC 3.0: 176.2
CUDA Optimization Tutorial
 Per-core performance
 CPU core delivers up to
4.4, 5, and 8.7 times as
much performance as a
GPU core (PE)
31
Using Scalar Arrays
 Data structure conversion
 Arrays of structs are bad for coalescing
 Bodies’ elements (e.g., mass fields) are not adjacent
 Optimize data structure
 Use multiple scalar arrays, one per field (need 10)
 Results in code bloat but often much better speed
structs in array
scalar arrays
CUDA Optimization Tutorial
32
Using Scalar Arrays
__global__ void
// change all
}
__global__ void
// change all
}
ForceCalcKernel(int nbodies, float *mass, ...) {
“body[k].blah” to “blah[k]”
AdvancingKernel(int nbodies, float *mass, ...) {
“body[k].blah” to “blah[k]”
int main(...) {
float *mass, *posx, *posy, *posz, *velx, *vely, *velz, *accx, *accy,*accz;
float *massl, *posxl, *posyl, *poszl, *velxl, *velyl, *velzl, ...;
mass = (float *)malloc(sizeof(float) * nbodies); // etc
. . .
cudaMalloc((void**)&massl, sizeof(float)*nbodies); // etc
cudaMemcpy(massl, mass, sizeof(float)*nbodies, cuda…HostToDevice); // etc
for (timestep = ...) {
ForceCalcKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...);
AdvancingKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...);
}
cudaMemcpy(mass, massl, sizeof(float)*nbodies, cuda…DeviceToHost); // etc
. . .
}
CUDA Optimization Tutorial
33
Scalar Array Speedup
 Problem size
 n=100000, step=1
 n=100000, step=1
 n=300000, step=1
 Performance
 Threads access same
memory locations, not
adjacent ones

Always combined but not
really coalesced access
 Relative to struct
 Slowdowns may be due to
 CC 1.3: 0.83
DRAM page/TLB misses
 CC 2.0: 0.96
 CC 3.0: 0.82
CUDA Optimization Tutorial
 Scalar arrays
 Still needed (see later)
34
Constant Kernel Parameters
 Kernel parameters
 Lots of parameters due to scalar arrays
 All but one parameter never change their value
 Constant memory
 “Pass” parameters only once
 Copy them into GPU’s constant memory
 Performance implications
 Reduced parameter passing overhead
 Constant memory has hardware cache
CUDA Optimization Tutorial
35
Constant Kernel Parameters
__constant__ int nbodiesd;
__constant__ float dthfd, epssqd, float *massd, *posxd, ...;
__global__ void ForceCalcKernel(int step) {
// rename affected variables (add “d” to name)
}
__global__ void AdvancingKernel() {
// rename affected variables (add “d” to name)
}
int main(...) {
. . .
cudaMemcpyToSymbol(massd, &massl, sizeof(void *)); // etc
. . .
for (timestep = ...) {
ForceCalcKernel<<<1, 1>>>(step);
AdvancingKernel<<<1, 1>>>();
}
. . .
}
CUDA Optimization Tutorial
36
Constant Mem. Parameter Speedup
 Problem size
 Performance
 n=1000, step=10000
 Minimal perf. impact
 n=1000, step=10000
 May be useful for very
 n=3000, step=10000
 Speedup
 CC 1.3: 1.015
 CC 2.0: 1.016
 CC 3.0: 0.971
CUDA Optimization Tutorial
short kernels that are
often invoked
 Benefit
 Less shared memory
used on CC 1.3 devices
37
Using the RSQRTF Instruction
 Slowest kernel operation
 Computing one over the square root is very slow
 GPU has slightly imprecise but fast 1/sqrt instruction
(frequently used in graphics code to calculate inverse
of distance to a point)
 IEEE floating-point accuracy compliance
 CC 1.x is not entirely compliant
 CC 2.x and above are compliant but also offer faster
non-compliant instructions
CUDA Optimization Tutorial
38
Using the RSQRT Instruction
for (i = 0; i < nbodies; i++) {
. . .
for (j = 0; j < nbodies; j++) {
if (i != j) {
dx = body[j].posx - px;
dy = body[j].posy - py;
dz = body[j].posz - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = 1.0f / sqrtf(dsq + epssq);
dinv = rsqrtf(dsq + epssq);
scale = body[j].mass * dinv * dinv * dinv;
ax += dx * scale;
ay += dy * scale;
az += dz * scale;
}
}
. . .
}
CUDA Optimization Tutorial
39
RSQRT Speedup
 Problem size
 n=100000, step=1
 n=100000, step=1
 Performance
 No change for CC 1.3

 n=300000, step=1
 Speedup
 CC 1.3: 0.99
 CC 2.0: 1.83
 CC 3.0: 1.64
CUDA Optimization Tutorial
Compiler automatically
uses less precise RSQRTF
as most FP ops are not
fully precise anyhow
 83% speedup for CC 2.0



Over entire application
Compiler defaults to
precise instructions
Explicit use of RSQRTF
indicates imprecision okay
40
Using 2 Loops to Avoid If Statement
 “if (i != j)” causes thread divergence
 Break loop into two loops to avoid if statement
for (j = 0; j < nbodies; j++) {
if (i != j) {
dx = body[j].posx - px;
dy = body[j].posy - py;
dz = body[j].posz - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssq);
scale = body[j].mass * dinv * dinv * dinv;
ax += dx * scale;
ay += dy * scale;
az += dz * scale;
}
}
CUDA Optimization Tutorial
41
Using 2 Loops to Avoid If Statement
for (j = 0; j < i; j++) {
dx = body[j].posx - px;
dy = body[j].posy - py;
dz = body[j].posz - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssq);
scale = body[j].mass * dinv * dinv * dinv;
ax += dx * scale;
ay += dy * scale;
az += dz * scale;
}
for (j = i+1; j < nbodies; j++) {
dx = body[j].posx - px;
dy = body[j].posy - py;
dz = body[j].posz - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssq);
scale = body[j].mass * dinv * dinv * dinv;
ax += dx * scale;
ay += dy * scale;
az += dz * scale;
}
CUDA Optimization Tutorial
42
Loop Duplication Speedup
 Problem size
 n=100000, step=1
 n=100000, step=1
 n=300000, step=1
 Performance
 No change for 2.0 & 3.0

 45% slowdown for CC 1.3

 Speedup
 CC 1.3: 0.55
 CC 2.0: 1.00
 CC 3.0: 1.00
CUDA Optimization Tutorial
Divergence moved to loop
Unclear why
 Discussion
 Not a useful optimization
 Code bloat
 A little divergence is okay
(only 1 in 3125 iterations)
43
Blocking using Shared Memory
 Code is memory bound
 Each warp streams in all bodies’ masses and positions
 Block inner loop
 Read block of mass & position info into shared mem
 Requires barriers (fast hardware barrier within SM)
 Advantage
 A lot fewer main memory accesses
 Remaining main memory accesses are fully coalesced
(due to usage of scalar arrays)
CUDA Optimization Tutorial
44
Blocking using Shared Memory
__shared__ float posxs[threads], posys[threads], poszs[…], masss[…];
j = 0;
for (j1 = 0; j1 < nbodiesd; j1 += THREADS) { // first part of loop
idx = tid + j1;
if (idx < nbodiesd) { // each thread copies 4 words (fully coalesced)
posxs[id] = posxd[idx]; posys[id] = posyd[idx];
poszs[id] = poszd[idx]; masss[id] = massd[idx];
}
__syncthreads(); // wait for all copying to be done
bound = min(nbodiesd - j1, THREADS);
for (j2 = 0; j2 < bound; j2++, j++) { // second part of loop
if (i != j) {
dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssqd);
scale = masss[j2] * dinv * dinv * dinv;
ax += dx * scale; ay += dy * scale; az += dz * scale;
}
}
__syncthreads(); // wait for all force calculations to be done
}
CUDA Optimization Tutorial
45
Blocking Speedup
 Problem size
 Performance
 n=100000, step=1
 Great speedup for CC 1.3
 n=100000, step=1
 Some speedup for others
 n=300000, step=1
 Speedup
 CC 1.3: 3.7
 CC 2.0: 1.1
 CC 3.0: 1.6
CUDA Optimization Tutorial

Has hardware data cache
 Discussion
 Very important
optimization for memory
bound code
 Even with L1 cache
46
Loop Unrolling
 CUDA compiler
 Generally good at unrolling loops with fixed bounds
 Does not unroll inner loop of our example code
 Use pragma to unroll (and pad arrays)
#pragma unroll 8
for (j2 = 0; j2 < bound; j2++, j++) {
if (i != j) {
dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssqd);
scale = masss[j2] * dinv * dinv * dinv;
ax += dx * scale; ay += dy * scale; az += dz * scale;
}
}
CUDA Optimization Tutorial
47
Loop Unrolling Speedup
 Problem size
 Performance
 n=100000, step=1
 Noticeable speedup
 n=100000, step=1
 All three GPUs
 n=300000, step=1
 Discussion
 Can be useful
 Speedup
 CC 1.3: 1.07
 CC 2.0: 1.16
 CC 3.0: 1.07
CUDA Optimization Tutorial
 May increase register
usage, which may lower
maximum number of
threads per block and
result in slowdown
48
CC 2.0 Absolute Performance
 Problem size
 n=100000, step=1
 Runtime
 612 ms
 FP operations
 326.7 GFlop/s
 Main mem throughput
 1.035 GB/s
 Not peak performance
 Only 32% of 1030 GFlop/s

Peak assumes FMA every cycle
 3 sub (1c), 3 fma (1c), 1 rsqrt
(8c), 3 mul (1c), 3 fma (1c) =
20c for 20 Flop
 63% of realistic peak of 515.2
GFlop/s

Assumes no non-FP operations
 With int ops = 31c for 20 Flop
 99% of actual peak of 330.45
GFlop/s
CUDA Optimization Tutorial
49
Eliminating the If Statement
 Algorithmic optimization
 Potential softening parameter avoids division by zero
 If-statement is not necessary and can be removed

Eliminates thread divergence
for (j2 = 0; j2 < bound; j2++, j++) {
if (i != j) {
dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssqd);
scale = masss[j2] * dinv * dinv * dinv;
ax += dx * scale; ay += dy * scale; az += dz * scale;
}
}
CUDA Optimization Tutorial
50
If Elimination Speedup
 Problem size
 Performance
 n=100000, step=1
 Large speedup
 n=100000, step=1
 All three GPUs
 n=300000, step=1
 Discussion
 Speedup
 CC 1.3: 1.38
 CC 2.0: 1.54
 CC 3.0: 1.64
CUDA Optimization Tutorial
 No thread divergence
 Allows compiler to
schedule code much
better
51
Rearranging Terms
 Generated code is suboptimal
 Compiler does not emit as many fused multiply-add
(FMA) instructions as it could
 Rearrange terms in expressions to help compiler

Need to check generated assembly code
for (j2 = 0; j2 < bound; j2++, j++) {
dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz;
dsq = dx*dx + dy*dy + dz*dz;
dinv = rsqrtf(dsq + epssqd);
dsq = dx*dx + (dy*dy + (dz*dz + epssqd));
dinv = rsqrtf(dsq);
scale = masss[j2] * dinv * dinv * dinv;
ax += dx * scale; ay += dy * scale; az += dz * scale;
}
CUDA Optimization Tutorial
52
FMA Speedup
 Problem size
 Performance
 n=100000, step=1
 Small speedup
 n=100000, step=1
 All three GPUs
 n=300000, step=1
 Discussion
 Speedup
 CC 1.3: 1.03
 CC 2.0: 1.05
 Seemingly needless
transformations can
make a difference
 CC 3.0: 1.06
CUDA Optimization Tutorial
53
Higher Unroll Factor
 Problem size
 Unroll 128 times
 n=100000, step=1
 Avoid looping overhead
 n=100000, step=1
 Now that there are no ifs
 n=300000, step=1
 Performance
 Speedup
 CC 1.3: 1.01
 CC 2.0: 1.04
 CC 3.0: 0.93
CUDA Optimization Tutorial
 Little speedup/slowdown
 Discussion
 Carefully choose unroll
factor (manually tune)
54
Compiler Flags
 Problem size
 -use_fast_math
 n=100000, step=1
 “-ftz=true” suffices
 n=100000, step=1
(flush denormals to zero)
 Makes SP FP operations
faster except on CC 1.3
 n=300000, step=1
 Speedup
 CC 1.3: 1.00
 CC 2.0: 1.18
 CC 3.0: 1.15
 Performance
 Significant speedup
 Discussion
 Use faster but less precise
operations when prudent
CUDA Optimization Tutorial
55
Final Absolute Performance
 CC 2.0 Fermi GTX 480
 Problem size

n=100000, step=1
 Runtime

296.1 ms
 FP operations



675.6 GFlop/s (SP)
66% of peak performance
261.1 GFlops/s (DP)
 Main mem throughput

2.139 GB/s
CUDA Optimization Tutorial
 CC 3.0 Kepler GTX 680
 Problem size

n=300000, step=1
 Runtime

1073 ms
 FP operations



1677.6 GFlop/s (SP)
54% of peak performance
88.7 GFlops/s (DP)
 Main mem throughput

5.266 GB/s
56
Outline
 Introduction
 GPU programming
 N-body example
 Porting and tuning
 Other considerations
 Conclusions
CUDA Optimization Tutorial
gamedsforum.ca
Thepcreport.net
57
Things to Consider
 Minimize PCIe transfers
 Implementing entire algorithm on GPU, even some
slow serial code sections, might be overall win
 Can stream data to/from GPU while computing
 Locks and synchronization
 Lightweight locks & fast barriers possible within SM
 Slow across different SMs
 L1 data caches are not coherent
 Use volatile & fences to avoid deadlocks
CUDA Optimization Tutorial
58
Warp-based Execution
// wrong on GPU, correct on CPU
do {
cnt = 0;
if (ready[i] != 0) cnt++;
if (ready[j] != 0) cnt++;
} while (cnt < 2);
ready[k] = 1;
// correct
do {
cnt = 0;
if (ready[i] != 0) cnt++;
if (ready[j] != 0) cnt++;
if (cnt == 2) ready[k] = 1;
} while (cnt < 2);
CUDA Optimization Tutorial
 Problem
 Thread divergence
 Loop exiting threads
wait for other threads in
warp to also exit
 “ready[k] = 1” is not
executed until all
threads in warp are
done with loop
 Possible deadlock
59
Hybrid Execution
 CPU always needed for program launch and I/O
 CPU much faster on serial program segments
 GPU 10 times faster than CPU on parallel code
 Running 10% of problem on CPU is hardly worthwhile
 Complicates programming and requires data transfer

Best CPU data structure is often not best for GPU
 PCIe bandwidth much lower than GPU bandwidth
 1.6 to 6.5 GB/s versus 192 GB/s
 But can send data while CPU & GPU are computing
 Merging CPU and GPU on same die (e.g., AMD’s
Fusion APU) makes finer grain switching possible
CUDA Optimization Tutorial
60
Outline
 Introduction
 GPU programming
 N-body example
 Porting and tuning
 Other considerations
 Conclusions
CUDA Optimization Tutorial
gamedsforum.ca
61
Summary and Conclusions (Part I)
 Step-by-step porting and tuning of CUDA code
 Example: n-body simulation
 GPUs have very powerful hardware
 But only exploitable with some codes
 Even harder to program and optimize than CPU hardware
CUDA Optimization Tutorial
62
Parallelizing and Optimizing Programs for
GPU Acceleration using CUDA (Part II)
Martin Burtscher
Department of Computer Science
Mapping Regular Code to GPUs
 Regular codes
 Operate on array- and matrix-based data structures
 Exhibit mostly strided memory access patterns
 Have relatively predictable control flow (control flow
behavior is mainly determined by input size)
 Largely independent computations
 Many regular codes are easy to port to GPUs
 E.g., matrix codes executing many ops/word
Dense matrix operations (level 2 and 3 BLAS)
 Stencil codes (PDE solvers)

CUDA Optimization Tutorial
LLNL
64
Mapping Irregular Code to GPUs
 Irregular codes
 Build, traverse, and update dynamic data structures
FSU
(trees, graphs, linked lists, priority queues, etc.)
 Exhibit pointer-chasing memory access patterns
 Have complex control flow (control flow behavior
depends on input values and changes dynamically)
 Many important scientific programs are irregular
 E.g., n-body simulation, data clustering, SAT solving,
social networks, discrete-event simulation, meshing, …
 Need case studies on how to best map irreg codes
CUDA Optimization Tutorial
65
Example: N-body Simulation
 Irregular Barnes Hut algorithm
 Repeatedly builds unbalanced tree and performs
complex traversals on it
 Our implementation
 Designed for GPUs (not just port of CPU code)
 First GPU implementation of entire BH algorithm
 Results
 GPU is 21 times faster than CPU (6 cores) on this code
CUDA Optimization Tutorial
66
Outline
 Introduction
 Barnes Hut algorithm
 CUDA implementation
 Experimental results
 Conclusions
CUDA Optimization Tutorial
NASA/JPL-Caltech/SSC
67
Barnes Hut Idea
 Precise force calculation
 Requires O(n2) operations (O(n2) body pairs)
 Computationally intractable for large n
 Barnes and Hut (1986)
 Algorithm to approximately compute forces

Bodies’ initial position & velocity are also approximate
 Requires only O(n log n) operations
 Idea is to “combine” far away bodies
 Error should be small because force  1/distance2
CUDA Optimization Tutorial
68
Barnes Hut Algorithm
 Set bodies’ initial position and velocity
 Iterate over time steps
1. Compute bounding box around bodies
2. Subdivide space until at most one body per cell

Record this spatial hierarchy in an octree
3. Compute mass and center of mass of each cell
4. Compute force on bodies by traversing octree

Stop traversal path when encountering a leaf (body) or an
internal node (cell) that is far enough away
5. Update each body’s position and velocity
CUDA Optimization Tutorial
69
Build Tree (Level 1)
*
o
*
*
*
*
*
* *
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
Compute bounding box around all bodies → tree root
CUDA Optimization Tutorial
70
Build Tree (Level 2)
*
o
*
*
*
o
o
o
o
*
*
* *
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
71
Build Tree (Level 3)
*
o
*
*
*
o
o
o
o
*
* *
* *
*
o
o
o
o
o
o
o
o
o
o
o
o
*
*
*
*
*
*
*
*
*
*
*
*
*
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
72
Build Tree (Level 4)
*
o
*
*
*
o
o
o
o
*
* *
* *
*
o
o
o
o
o
o o o o
o
o
o
o o o o
o
o
o
o
*
*
*
*
*
o o o o
*
*
*
*
*
*
*
*
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
73
Build Tree (Level 5)
*
o
*
*
*
o
o
o
o
*
* *
* *
*
o
o
o
o
o
o o o o
o
o
o
o o o o
o
o
o
o
*
*
*
*
*
o o o o
*
*
oooo
*
*
*
*
*
*
Subdivide space until at most one body per cell
CUDA Optimization Tutorial
74
Compute Cells’ Center of Mass
*
o
*
*
*
*
o
o
o
o
* *
* *
*
*
o
o
o
o
o
o
o
o
o
o
o
*
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
For each internal cell, compute sum of mass and weighted average
of position of all bodies in subtree; example shows two cells only
CUDA Optimization Tutorial
75
Compute Forces
*
o
*
*
*
*
o
o
o
o
* *
* *
*
*
o
o
o
o
o
o
o
o
o
o
o
*
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
Compute force, for example, acting upon green body
CUDA Optimization Tutorial
76
Compute Force (short distance)
*
o
*
*
*
*
o
o
o
o
*
*
o
o
o
o
o
o
o
o
o
o
o
*
* *
* *
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
Scan tree depth first from left to right; green portion already completed
CUDA Optimization Tutorial
77
Compute Force (down one level)
*
o
*
*
*
*
o
o
o
o
*
* *
* *
*
*
o
o
o
o
o
o
o
o
o
o
o
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
Red center of mass is too close, need to go down one level
CUDA Optimization Tutorial
78
Compute Force (long distance)
*
o
*
*
*
*
o
o
o
o
*
*
o
o
o
o
o
o
o
o
o
o
o
*
* *
* *
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
Blue center of mass is far enough away
CUDA Optimization Tutorial
79
Compute Force (skip subtree)
*
o
*
*
*
*
o
o
o
o
*
* *
* *
*
*
o
o
o
o
o
o
o
o
o
o
o
*
*
o
o
*
*
*
o o o o
o o o o
o o o o
*
*
o
oooo
*
*
*
*
Therefore, entire subtree rooted in the blue cell can be skipped
CUDA Optimization Tutorial
80
Pseudocode
bodySet = ...
foreach timestep do {
bounding_box = new Bounding_Box();
foreach Body b in bodySet {
bounding_box.include(b);
}
octree = new Octree(bounding_box);
foreach Body b in bodySet {
octree.Insert(b);
}
cellList = octree.CellsByLevel();
foreach Cell c in cellList {
c.Summarize();
}
foreach Body b in bodySet {
b.ComputeForce(octree);
}
foreach Body b in bodySet {
b.Advance();
}
}
CUDA Optimization Tutorial
*
*
*
*
*
*
*
*
*
*
*
* *
* *
*
*
*
* *
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
o
o
o
o
o
o
o
o
o
o
o
o
o o o o
o
o
o
o o o o
o
o
o o o o
oooo
*
*
*
*
*
*
*
*
*
*
*
* *
* *
*
*
*
*
*
*
* *
* *
*
*
*
o
*
o
*
*
*
*
*
*
*
o
*
*
*
*
o
*
*
*
*
*
*
81
Complexity and Parallelism
bodySet = ...
foreach timestep do {
bounding_box = new Bounding_Box();
foreach Body b in bodySet {
bounding_box.include(b);
}
octree = new Octree(bounding_box);
foreach Body b in bodySet {
octree.Insert(b);
}
cellList = octree.CellsByLevel();
foreach Cell c in cellList {
c.Summarize();
}
foreach Body b in bodySet {
b.ComputeForce(octree);
}
foreach Body b in bodySet {
b.Advance();
}
}
CUDA Optimization Tutorial
// O(n log n) + ordered sequential
// O(n) parallel reduction
// O(n log n) top-down tree building
// O(n) + ordered bottom-up traversal
// O(n log n) fully parallel
// O(n) fully parallel
82
Outline
 Introduction
 Barnes Hut algorithm
 CUDA implementation
 Experimental results
 Conclusions
CUDA Optimization Tutorial
83
Efficient GPU Code
 Large amounts of data parallelism
 Coalesced main memory accesses
 Little thread divergence
 Relatively little synchronization between blocks
 Little CPU/GPU data transfer
 Efficient use of shared memory
Thepcreport.net
CUDA Optimization Tutorial
84
Main BH Implementation Challenges
 Uses irregular tree-based data structure
 Initially little parallelism
 Little coalescing
o
o
 Load imbalance
 Complex recursive traversals
o
o
o
o
o
o
o
o
o
o
o o o o
o
o o o o
o
o
o
o o o o
oooo
 Recursion not well supported
 Lots of thread divergence
 Memory-bound pointer-chasing operations
 Not enough computation to hide latency
CUDA Optimization Tutorial
85
o
Six GPU Kernels
Read initial data and transfer to GPU
for each timestep do {
1.
2.
3.
4.
5.
6.
Compute bounding box around bodies (not irregular)
Build hierarchical decomposition, i.e., octree
Summarize body information in internal octree nodes
Approximately sort bodies by spatial location (optional)
Compute forces acting on each body with help of octree
Update body positions and velocities (not irregular)
}
Transfer result from GPU and output
CUDA Optimization Tutorial
86
Global Optimizations
 Make code iterative (recursion not supported*)
 Keep data on GPU between kernel calls
 Use array elements instead of heap nodes
 One aligned array per field for coalesced accesses
objects in array
objects on heap
CUDA Optimization Tutorial
fields in arrays
87
Global Optimizations (cont.)
 Maximize thread count (round down to warp size)
 Maximize resident block count (all SMs filled)
 Pass kernel parameters through constant memory
 Use special allocation order
c0
 Alias arrays (56 B/node)
 Use index arithmetic
 Persistent blocks & threads
 Unroll loops over children
CUDA Optimization Tutorial
c2
b5
ba
c4
c3
b3
b6
b2
c1
b7
b1
bodies (fixed)
b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba
b0
c5
b8
b4
b9
cell allocation direction
...
c5 c4 c3 c2 c1 c0
88
Kernel 1: Bounding Box (Regular)
main memory
...
 Optimizations
threads
warp 1
warp 2
warp 3
warp 4
 Fully coalesced
shared memory
barrier
threads
warp 1
warp 2
 No bank conflicts
shared memory
barrier
threads
warp 1
barrier
t1 t2
shared memory
barrier
threads
shared memory
 Minimal divergence
 Built-in min and max
shared memory
threads
 Fully cached
t1
Reduction operation
CUDA Optimization Tutorial
 2 red/mem, 6 red/bar
 Bodies load balanced
 512*3 threads per SM
89
Kernel 2: Build Octree (Irregular)
 Optimizations
 Only lock leaf “pointers”
Top-down tree building
 Lock-free fast path
 Light-weight lock release
 No re-traverse after lock




acquire failure
Combined memory fence
Re-compute position
during traversal
Separate init kernels
512*3 threads per SM
CUDA Optimization Tutorial
*
90
Kernel 2: Build Octree (cont.)
// initialize
cell = find_insertion_point(body); // no locks, cache cell
child = get_insertion_index(cell, body);
if (child != locked) { // skip atomic if already locked
if (child == null) { // fast path (frequent)
if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion
// move on to next body
}
} else {
if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock
// build subtree with new and existing body
flag = true;
}
}
}
__syncthreads(); // optional barrier
__threadfence(); // make data visible
if (flag) {
cell[child] = new_subtree; // insert subtree and releases lock
// move on to next body
}
CUDA Optimization Tutorial
91
Kernel 3: Summarize Subtrees (Irreg.)
 Optimizations
Bottom-up tree traversal
 Scan avoids deadlock
 Use mass as flag + fence
4

3
1
2
3
No locks, no atomics
 Use wait-free first pass
4
 Cache the ready info
 Piggyback on traversal
allocation direction
...
3 4 1 2 3 4
scan direction
CUDA Optimization Tutorial

Count bodies in subtrees
 No parent “pointers”
 128*6 threads per SM
92
Kernel 4: Sort Bodies (Irregular)
 Optimizations
Top-down tree traversal
4
3
1
2
3
 (Similar to Kernel 3)
 Scan avoids deadlock
 Use data field as flag

4
 Use counts from Kernel 3
 Piggyback on traversal

allocation direction
...
3 4 1 2 3 4
scan direction
CUDA Optimization Tutorial
No locks, no atomics
Move nulls to back
 Throttle warps with
optional barrier
 64*6 threads per SM
93
Kernel 5: Force Calculation (Irregular)
 Optimizations
Multiple prefix traversals
 Group similar work together
 Uses sorting to minimize size of
prefix union in each warp
 Early out (nulls in back)
 Traverse whole union to avoid





CUDA Optimization Tutorial
divergence (warp voting)
Lane 0 controls iteration stack
for entire warp (fits in shmem)
Minimize volatile accesses
Use fast 1/sqrtf instruction
Cache tree-level-based data
256*5 threads per SM
94
Architectural Support
 Coalesced memory accesses & lockstep execution
 All threads in warp read same tree node at same time
 Only one mem access per warp instead of 32 accesses
 Warp-based execution
 Enables data sharing in warps w/o synchronization
 RSQRTF instruction
 Quickly computes good approximation of 1/sqrtf(x)
 Warp voting instructions
 Quickly perform reduction operations within a warp
CUDA Optimization Tutorial
95
Kernel 6: Advance Bodies (Regular)
 Optimizations
 Fully coalesced, no divergence
 Load balanced, 1024*1 threads per SM
Straightforward streaming
main memory
threads
...
warp 1
main memory
CUDA Optimization Tutorial
warp 2
warp 3
warp 4
warp 1
warp 2
...
warp 2
...
96
Outline
 Introduction
 Barnes Hut algorithm
 Experimental results
 Conclusions
10000.0
1000.0
runtime per timestep [s]
 CUDA implementation
CPUbh
GPUbh
100.0
GPUsq
10.0
1.0
10,000
100,000
1,000,000 10,000,000
0.1
0.0
number of bodies
CUDA Optimization Tutorial
97
Evaluation Methodology
 Implementations
 CUDA/GPU: Barnes Hut and O(n2) algorithms
 OpenMP/CPU: Barnes Hut algorithm (derived from CUDA)
 Pthreads/CPU: Barnes Hut algorithm (SPLASH-2 suite)
 Systems and compilers




nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*)
GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM
gcc 4.1.2 (-O3 -fopenmp* -ffast-math*)
Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core
 Inputs and metric
 5k, 50k, 500k, and 5M star clusters (Plummer model)
 Best runtime of three experiments, excluding I/O
CUDA Optimization Tutorial
98
Nodes Touched per Activity (5M Input)
 Kernel “activities”
 K1: pair reduction
 K2: tree insertion
 K3: bottom-up step
 K4: top-down step
 K5: prefix traversal
 K6: integration step
 Max tree depth ≤ 22
 Cells have 3.1 children
CUDA Optimization Tutorial
kernel 1
kernel 2
kernel 3
kernel 4
kernel 5
kernel 6
neighborhood size
min
avg
max
1
2.0
2
2
13.2
22
2
4.1
9
2
4.1
9
818 4,117.0
6,315
1
1.0
1
 Prefix ≤ 6,315 nodes
(≤ 0.1% of 7.4 million)
 BH algorithm & sorting
to min. union work well
99
CUDA Optimization Tutorial
100
10
kernel and round
 Almost every “round” has lots of activities without
data dependencies that can be processed in parallel
100
k6.1
1,000
force calc.
integration
summarization
k5.1
100,000
tree building
k4.1
k4.2
k4.3
k4.4
k4.5
k4.6
k4.7
k4.8
k4.9
k4.10
k4.11
k4.12
k4.13
k4.14
k4.15
k4.16
k4.17
k4.18
k4.19
k4.20
bounding box
k3.1
k3.2
k3.3
k3.4
k3.5
k3.6
k3.7
k3.8
k3.9
k3.10
k3.11
k3.12
k3.13
k3.14
k3.15
k3.16
k3.17
k3.18
k3.19
k3.20
1,000,000
k2.1
k2.2
k2.3
k2.4
k2.5
k2.6
k2.7
k2.8
k2.9
k2.10
k2.11
k2.12
k2.13
k2.14
k2.15
k2.16
k2.17
k2.18
k2.19
k1.1
k1.2
k1.3
k1.4
k1.5
k1.6
k1.7
k1.8
k1.9
k1.10
k1.11
k1.12
k1.13
k1.14
k1.15
k1.16
k1.17
k1.18
k1.19
k1.20
k1.21
k1.22
k1.23
available parallelism
Available Amorphous Data Parallelism
10,000,000
5,000 bodies
sorting
50,000 bodies
500,000 bodies
5,000,000 bodies
10,000
1
Runtime Comparison
 GPU BH inefficiency
 5k input too small for
100,000
GPU CUDA
GPU O(n^2)
10,000
5,760 to 23,040 threads
algorithm
 O(n2) faster with fewer
than about 15k bodies
 GPU (5M input)
 21.1x faster than OpenMP
 23.2x faster than Pthreads
CPU Pthreads
runtime per timestep [ms]
 BH vs.
O(n2)
CPU OpenMP
1,000
100
10
1
5,000
50,000
500,000
5,000,000
0
number of bodies
CUDA Optimization Tutorial
101
Kernel Performance for 5M Input
 $200 GPU delivers 228 GFlops/s on irregular code
2
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n )
Gflops/s
71.6
5.8
2.5
n/a
240.6
33.5
228.4
897.0
GB/s
142.9
26.8
10.6
12.8
8.0
133.9
8.8
2.8
runtime [ms]
0.4
44.6
28.0
14.2
1641.2
2.2
1730.6 557421.5
 GPU chip is 2.7 to 23.5 times faster than CPU chip
non-compliant fast single-precision version
IEEE 754-compliant double-precision version
kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6
X5690 CPU
5.5 185.7
75.8
52.1 38,540.3
16.4
10.3 193.1 101.0
51.6 47,706.4
33.1
GTX 480 GPU
0.4
44.6
28.0
14.2 1,641.2
2.2
0.8
46.7
31.0
14.2 5,177.1
4.2
CPU/GPU
13.1
4.2
2.7
3.7
23.5
7.3
12.7
4.1
3.3
3.6
9.2
7.9
 GPU hardware is better suited for BH than CPU hw
 But difficult and very time consuming to program
CUDA Optimization Tutorial
102
Kernel Speedups
 Optimizations that are generally applicable
avoid
volatile
50,000
1.14x
500,000
1.19x
5,000,000
1.18x
rsqrtf
instr.
1.43x
1.47x
1.46x
recalc.
data
0.99x
1.32x
1.69x
thread full multivoting threading
2.04x
20.80x
2.49x
27.99x
2.47x
28.85x
 Optimizations for irregular kernels
throttling waitfree combined sorting of sync'ed
barrier pre-pass mem fence bodies execution
50,000
0.97x
1.02x
1.54x
3.60x
6.23x
500,000
1.03x
1.21x
1.57x
6.28x
8.04x
5,000,000
1.04x
1.31x
1.50x
8.21x
8.60x
CUDA Optimization Tutorial
103
Outline
 Introduction
 Barnes Hut algorithm
 CUDA implementation
 Experimental results
 Conclusions
CUDA Optimization Tutorial
104
Optimization Summary
 Reduce main memory accesses
 Share data within warp, combine memory fences &
traversals, re-compute data, avoid volatile accesses
 Minimize thread divergence
 Group similar work together, force synchronicity
 Implement entire algorithm on and for GPU
 Avoid data transfers & data structure inefficiencies,
wait-free pre-pass, scan entire prefix union
CUDA Optimization Tutorial
105
Optimization Summary (cont.)
 Exploit hardware features
 Fast synchronization & thread startup, special instrs.,
coalesced memory accesses, even lockstep execution
 Use light-weight locking and synchronization
 Minimize locks, reuse fields, and use fence + store ops
 Maximize parallelism
 Parallelize every step within and across SMs
CUDA Optimization Tutorial
106
CPU/GPU Implementation Comparison
 Irregular CPU code
 Dynamically




(incrementally) allocated
shared data structures
Structure-based shared
data structures
Logical lock-based
implementation
Global/local worklists
Recursive or iterative
implementation
CUDA Optimization Tutorial
 Irregular GPU code
 Statically (wholly)




allocated shared data
structures
Multiple-array-based
shared data structures
Lock-free
implementation
(Implicit) local worklists
Iterative implementation
107
Useful GPU Hardware Features
 Wide parallelism
 Great for exploiting large
amounts of parallelism
 Massive multithreading
 Ideal for hiding latency of
irregular mem. accesses
 Fast thread startup
 Essential when launching
thousands of threads
 Shared memory
 Fast data sharing
 Useful for local worklists
CUDA Optimization Tutorial
 HW support for reduction
and synchronization
 Makes otherwise costly
operations very fast
 Coalesced accesses
 Memory access combining
is useful in irregular codes
 Lockstep execution
 Can share data without
explicit synchronization
 Allows to consolidate
iteration stacks
108
Challenges with GPUs
 Warp-based execution
 Often requires sorting of
work or algorithm change
 Data structure layout
 Best layout for CPU differs
from best layout for GPU
 SoA can be tedious to
code and deal with
(parameter passing)
 Separate memory space
 Slow transfers
 Pack/unpack data
CUDA Optimization Tutorial
 Incoherent L1 caches
 May need to explicitly
manage data (fences)
 Poor recursion support
 Need to make code
iterative and maintain
explicit iteration stacks
 Thread and block counts
 Hierarchy complicates
implementation
 Optimal counts have to
be (auto-)tuned
109
Running Irregular Algorithms on GPUs
 Mandatory
 Important
 Need vast amounts of data
 Scheduling is independent
parallelism
 Can do large chunks of
computation on GPU
of previous activities
 Easy to sort activities by
similarity (if needed)
 Very Important
 Beneficial
 Cautious implementation
 DS can be expressed
 Easy to express iteratively
 Has statically known range
through fixed arrays
 Uses local worklists that
can be statically populated
of neighborhoods
 DS size (or bound) can be
determined based on input
CUDA Optimization Tutorial
110
Conclusions
 Irregularity does not necessarily prevent high-
performance on GPUs
 Entire Barnes Hut algorithm implemented on GPU
 Builds and traverses unbalanced octree
 GPU is 21.1 times (float) and 9.1 times (double)
faster than high-end 6-core Xeon
 Code directly for GPU, do not merely adjust CPU code
 Requires different data and code structures
 Benefits from different algorithmic modifications
CUDA Optimization Tutorial
111
Acknowledgments
 Hardware
 NVIDIA Corp. and Intel Corp.
 Funding
 NVIDIA Corp. and Texas State University
 OpenMP code
 Ricardo Alves (Universidade do Minho, Portugal)
 Collaborator
 Keshav Pingali (University of Texas at Austin)
CUDA Optimization Tutorial
112
CUDA Optimization Tutorial
 Martin Burtscher
 burtscher@txstate.edu
 http://www.cs.txstate.edu/~burtscher/
 Barnes Hut CUDA code
 http://www.gpucomputing.net/?q=node/1314
 Tutorial slides
 http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx
CUDA Optimization Tutorial
113
Download