Parallelizing and Optimizing Programs for GPU Acceleration using CUDA Martin Burtscher Department of Computer Science CUDA Optimization Tutorial Martin Burtscher burtscher@txstate.edu http://www.cs.txstate.edu/~burtscher/ Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx CUDA Optimization Tutorial 2 High-end CPU-GPU Comparison Cores Active threads Frequency Peak performance (SP) Peak mem. bandwidth Maximum power Price Xeon E5-2687W 8 (superscalar) 2 per core 3.1 GHz 397 GFlop/s 51 GB/s 150 W $1900 Kepler GTX 680 1536 (simple) ~11 per core 1.0 GHz 3090 GFlop/s 192 GB/s 195 W* $500* Release dates Xeon: March 2012 Kepler: March 2012 *entire card CUDA Optimization Tutorial Intel Nvidia 3 GPU Advantages Performance 8x as many operations executed per second Main memory bandwidth 4x as many bytes transferred per second Cost-, energy-, and size-efficiency 29x as much performance per dollar 6x as much performance per watt 11x as much performance per area (based on peak values) CUDA Optimization Tutorial 4 GPU Disadvantages Clearly, we should be using GPUs all the time So why aren’t we? GPUs can only execute some types of code fast Need lots of data parallelism, data reuse, & regularity GPUs are harder to program and tune than CPUs In part because of poor tool support In part because of their architecture In part because of poor support for irregular codes CUDA Optimization Tutorial 5 Outline (Part I) Introduction GPU programming N-body example Porting and tuning Other considerations Hightechreview.com Conclusions CUDA Optimization Tutorial 6 CUDA Programming Model Non-graphics C/C++ with extensions Function launch programming Uses GPU as massively parallel co-processor CPU PCI-Express bus GPU SIMT (single-instruction multiple-threads) model Thousands of threads needed for full efficiency CUDA Optimization Tutorial Calling functions on GPU Memory management GPU memory allocation, copying data to/from GPU Declaration qualifiers Device, shared, local, etc. Special instructions Barriers, fences, etc. Keywords threadIdx, blockIdx 7 Calling GPU Kernels Kernels are functions that run on the GPU Callable by CPU code CPU can continue processing while GPU runs kernel KernelName<<<m, n>>>(arg1, arg2, ...); Launch configuration (programmer selectable) GPU spawns m blocks with n threads (i.e., m*n threads total) that run a copy of the same function Normal function parameters: passed conventionally Different address space, should never pass CPU pointers CUDA Optimization Tutorial 8 GPU Architecture GPUs consist of Streaming Multiprocessors (SMs) 1 to 30 SMs per chip (run blocks) SMs contain Processing Elements (PEs) 8, 32, or 192 PEs per SM (run threads) Shared Memory Shared Memory Adapted from NVIDIA CUDA Optimization Tutorial Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Global Memory 9 Block Scalability Hardware can assign blocks to SMs in any order A kernel with enough blocks scales across GPUs Not all blocks may be resident at the same time GPU with 2 SMs Kernel GPU with 4 SMs Block 0 Block 1 Block 2 Block 3 Block 0 Block 1 Block 4 Block 5 Block 6 Block 7 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 CUDA Optimization Tutorial time Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Adapted from NVIDIA 10 GPU Memories GPU Separate from CPU memory Block (0, 0) CPU can access GPU’s global & constant mem. via PCIe bus Requires slow explicit transfer Block (1, 0) Shared Memory (SRAM) Registers Registers Shared Memory (SRAM) Registers Registers Visible GPU memory types Thread (0, 0) Thread (1, 0) Registers (per thread) Local mem. (per thread) Shared mem. (per block) Software-controlled cache Global mem. (per kernel) Constant mem. (read only) CUDA Optimization Tutorial C P U Thread (0, 0) Thread (1, 0) Global + Local Memory (DRAM) Constant Memory (DRAM, cached) Adapted from NVIDIA Slow communic. between blocks 11 SM Internals (Fermi and Kepler) Caches Software-controlled shared memory Hardware-controlled incoherent L1 data cache 64 kB combined size, can be split 16/48, 32/32, 48/16 Synchronization support Fast hardware barrier within block (__syncthreads()) Fence instructions: memory consistency & coherency Special operations Thread voting (warp-based reduction operations) CUDA Optimization Tutorial 12 Block and Thread Allocation Limits Blocks assigned to SMs Until first limit reached Threads assigned to PEs t0 t1 t2 … tm SM 0 SM 1 MT IU MT IU PE PE Blocks t0 t1 t2 … tm Shared Memory 8/16 active blocks/SM 1024, 1536, or 2048 Blocks Shared Memory Hardware limits resident threads/SM 512 or 1024 threads/blk 16k, 32k, or 64k regs/SM 16 kB or 48 kB shared memory per SM 216-1 or 231-1 blks/kernel Adapted from NVIDIA CUDA Optimization Tutorial 13 Warp-based Execution 32 contiguous threads form a warp Execute same instruction in same cycle (or disabled) Warps are scheduled out-of-order with respect to each other to hide latencies Thread divergence Some threads in warp jump to different PC than others Hardware runs subsets of warp until they re-converge Results in reduction of parallelism (performance loss) CUDA Optimization Tutorial 14 Thread Divergence Non-divergent code if (threadID >= 32) { some_code; } else { other_code; } Thread ID: 0123… 31 Divergent code if (threadID >= 13) { some_code; } else { other_code; } Thread ID: 0123… 31 disabled disabled Adapted from NVIDIA CUDA Optimization Tutorial Adapted from NVIDIA 15 Parallel Memory Accesses Coalesced main memory access (16/32x faster) Under some conditions, HW combines multiple (half) warp memory accesses into a single coalesced access CC 1.3: 64-byte aligned 64-byte line (any permutation) CC 2.x+3.0: 128-byte aligned 128-byte line (cached) Bank-conflict-free shared memory access (16/32) No superword alignment or contiguity requirements CC 1.3: 16 different banks per half warp or same word CC 2.x+3.0 : 32 different banks + 1-word broadcast each CUDA Optimization Tutorial 16 Coalesced Main Memory Accesses single coalesced access NVIDIA CUDA Optimization Tutorial one and two coalesced accesses* NVIDIA 17 Outline Introduction GPU programming N-body example Porting and tuning Other considerations NASA/JPL-Caltech/SSC Conclusions CUDA Optimization Tutorial 18 N-body Simulation Time evolution of physical system System consists of bodies “n” is the number of bodies Bodies interact via pair-wise forces RUG Many systems can be modeled in this way Star/galaxy clusters (gravitational force) Particles (electric force, magnetic force) Cornell CUDA Optimization Tutorial 19 Simple N-body Algorithm Algorithm Initialize body masses, positions, and velocities Iterate over time steps { Accumulate forces acting on each body Update body positions and velocities based on force } Output result More sophisticated n-body algorithms exist Barnes Hut algorithm (covered in Part II) Fast Multipole Method (FMM) CUDA Optimization Tutorial 20 Key Loops (Pseudo Code) bodySet = ...; // input for timestep do { // sequential foreach Body b1 in bodySet { // O(n2) parallel foreach Body b2 in bodySet { if (b1 != b2) { b1.addInteractionForce(b2); } } } foreach Body b in bodySet { // O(n) parallel b.Advance(); } } // output result CUDA Optimization Tutorial 21 Force Calculation C Code struct Body { float mass, posx, posy, posz; // mass and 3D position float velx, vely, velz, accx, accy, accz; // 3D velocity & accel } *body; for (i = 0; i < nbodies; i++) { . . . for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; // delta x dy = body[j].posy - py; // delta y dz = body[j].posz - pz; // delta z dsq = dx*dx + dy*dy + dz*dz; // distance squared dinv = 1.0f / sqrtf(dsq + epssq); // inverse distance scale = body[j].mass * dinv * dinv * dinv; // scaled force ax += dx * scale; // accumulate x contribution of accel ay += dy * scale; az += dz * scale; // ditto for y and z } } . . . CUDA Optimization Tutorial 22 Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions CUDA Optimization Tutorial 23 N-body Algorithm Suitability for GPU Lots of data parallelism Force calculations are independent Should be able to keep SMs and PEs busy Sufficient memory access regularity All force calculations access body data in same order* Should have lots of coalesced memory accesses Sufficient code regularity All force calculations are identical* There should be little thread divergence Plenty of data reuse O(n2) operations on O(n) data CPU/GPU transfer time is insignificant CUDA Optimization Tutorial 24 C to CUDA Conversion Two CUDA kernels Force calculation Advance position and velocity Benefits Force calculation requires over 99.9% of runtime Primary target for acceleration Advancing kernel unimportant to runtime But allows to keep data on GPU during entire simulation Minimizes GPU/CPU transfers CUDA Optimization Tutorial 25 C to CUDA Conversion __global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { . . . } __global__ void AdvancingKernel(int nbodies, struct Body *body, ...) { . . . } Indicates GPU kernel that CPU can call int main(...) { Separate address spaces, need two pointers Body *body, *bodyl; Allocate memory on GPU . . . cudaMalloc((void**)&bodyl, sizeof(Body)*nbodies); cudaMemcpy(bodyl, body, sizeof(Body)*nbodies, cuda…HostToDevice); for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(nbodies, bodyl, ...); Copy CPU data to GPU AdvancingKernel<<<1, 1>>>(nbodies, bodyl, ...); } cudaMemcpy(body, bodyl, sizeof(Body)*nbodies, cuda…DeviceToHost); cudaFree(bodyl); Copy GPU data back to CPU . . . Call GPU kernel with 1 block and 1 thread per block } CUDA Optimization Tutorial 26 Evaluation Methodology Systems and compilers CC 1.3: Quadro FX 5800, nvcc 3.2 30 SMs, 240 PEs, 1.3 GHz, 30720 resident threads CC 2.0: Tesla C2050, nvcc 3.2 14 SMs, 448 PEs, 1.15 GHz, 21504 resident threads CC 3.0: GeForce GTX 680, nvcc 4.2 8 SMs, 1536 PEs, 1.0 GHz, 16384 resident threads Inputs and metric 1k, 10k, or 100k star clusters (Plummer model) Median runtime of three experiments, excluding I/O CUDA Optimization Tutorial 27 1-Thread Performance Problem size n=10000, step=1 n=10000, step=1 n=3000, step=1 Slowdown rel. to CPU CC 1.3: 72.4 CC 2.0: 36.7 CC 3.0: 68.1 (Note: comparing different GPUs to different CPUs) CUDA Optimization Tutorial Performance 1 thread is one to two orders of magnitude slower on GPU than CPU Reasons No caches (CC 1.3) Not superscalar Slower clock frequency No SMT latency hiding 28 Using N Threads Approach Eliminate outer loop Instantiate n copies of inner loop, one per body Threading Blocks can only hold 512 or 1024 threads Up to 8/16 blocks can be resident in an SM at a time SM can hold 1024, 1536, or 2048 threads We use 256 threads per block (works for all of our GPUs) Need multiple blocks Last block may not need all of its threads CUDA Optimization Tutorial 29 Using N Threads __global__ void ForceCalcKernel(int nbodies, struct Body *body, ...) { for (i = 0; i < nbodies; i++) { i = threadIdx.x + blockIdx.x * blockDim.x; // compute i if (i < nbodies) { // in case last block is only partially used for (j = ...) { . . . } } } __global__ void AdvancingKernel(int nbodies,...) // same changes #define threads 256 int main(...) { . . . int blocks = (nbodies + threads - 1) / threads; // compute block cnt for (timestep = ...) { ForceCalcKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); AdvancingKernel<<<1, 1blocks, threads>>>(nbodies, bodyl, ...); } } CUDA Optimization Tutorial 30 N Thread Speedup Relative to 1 GPU thread Performance CC 1.3: 7781 (240 PEs) Speedup much higher CC 2.0: 6495 (448 PEs) than number of PEs (32, 14.5, and 7.9 times) Due to SMT latency hiding CC 3.0: 12150 (1536 PEs) Relative to 1 CPU thread CC 1.3: 107.5 CC 2.0: 176.7 CC 3.0: 176.2 CUDA Optimization Tutorial Per-core performance CPU core delivers up to 4.4, 5, and 8.7 times as much performance as a GPU core (PE) 31 Using Scalar Arrays Data structure conversion Arrays of structs are bad for coalescing Bodies’ elements (e.g., mass fields) are not adjacent Optimize data structure Use multiple scalar arrays, one per field (need 10) Results in code bloat but often much better speed structs in array scalar arrays CUDA Optimization Tutorial 32 Using Scalar Arrays __global__ void // change all } __global__ void // change all } ForceCalcKernel(int nbodies, float *mass, ...) { “body[k].blah” to “blah[k]” AdvancingKernel(int nbodies, float *mass, ...) { “body[k].blah” to “blah[k]” int main(...) { float *mass, *posx, *posy, *posz, *velx, *vely, *velz, *accx, *accy,*accz; float *massl, *posxl, *posyl, *poszl, *velxl, *velyl, *velzl, ...; mass = (float *)malloc(sizeof(float) * nbodies); // etc . . . cudaMalloc((void**)&massl, sizeof(float)*nbodies); // etc cudaMemcpy(massl, mass, sizeof(float)*nbodies, cuda…HostToDevice); // etc for (timestep = ...) { ForceCalcKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); AdvancingKernel<<<blocks, threads>>>(nbodies, massl, posxl, ...); } cudaMemcpy(mass, massl, sizeof(float)*nbodies, cuda…DeviceToHost); // etc . . . } CUDA Optimization Tutorial 33 Scalar Array Speedup Problem size n=100000, step=1 n=100000, step=1 n=300000, step=1 Performance Threads access same memory locations, not adjacent ones Always combined but not really coalesced access Relative to struct Slowdowns may be due to CC 1.3: 0.83 DRAM page/TLB misses CC 2.0: 0.96 CC 3.0: 0.82 CUDA Optimization Tutorial Scalar arrays Still needed (see later) 34 Constant Kernel Parameters Kernel parameters Lots of parameters due to scalar arrays All but one parameter never change their value Constant memory “Pass” parameters only once Copy them into GPU’s constant memory Performance implications Reduced parameter passing overhead Constant memory has hardware cache CUDA Optimization Tutorial 35 Constant Kernel Parameters __constant__ int nbodiesd; __constant__ float dthfd, epssqd, float *massd, *posxd, ...; __global__ void ForceCalcKernel(int step) { // rename affected variables (add “d” to name) } __global__ void AdvancingKernel() { // rename affected variables (add “d” to name) } int main(...) { . . . cudaMemcpyToSymbol(massd, &massl, sizeof(void *)); // etc . . . for (timestep = ...) { ForceCalcKernel<<<1, 1>>>(step); AdvancingKernel<<<1, 1>>>(); } . . . } CUDA Optimization Tutorial 36 Constant Mem. Parameter Speedup Problem size Performance n=1000, step=10000 Minimal perf. impact n=1000, step=10000 May be useful for very n=3000, step=10000 Speedup CC 1.3: 1.015 CC 2.0: 1.016 CC 3.0: 0.971 CUDA Optimization Tutorial short kernels that are often invoked Benefit Less shared memory used on CC 1.3 devices 37 Using the RSQRTF Instruction Slowest kernel operation Computing one over the square root is very slow GPU has slightly imprecise but fast 1/sqrt instruction (frequently used in graphics code to calculate inverse of distance to a point) IEEE floating-point accuracy compliance CC 1.x is not entirely compliant CC 2.x and above are compliant but also offer faster non-compliant instructions CUDA Optimization Tutorial 38 Using the RSQRT Instruction for (i = 0; i < nbodies; i++) { . . . for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = 1.0f / sqrtf(dsq + epssq); dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } . . . } CUDA Optimization Tutorial 39 RSQRT Speedup Problem size n=100000, step=1 n=100000, step=1 Performance No change for CC 1.3 n=300000, step=1 Speedup CC 1.3: 0.99 CC 2.0: 1.83 CC 3.0: 1.64 CUDA Optimization Tutorial Compiler automatically uses less precise RSQRTF as most FP ops are not fully precise anyhow 83% speedup for CC 2.0 Over entire application Compiler defaults to precise instructions Explicit use of RSQRTF indicates imprecision okay 40 Using 2 Loops to Avoid If Statement “if (i != j)” causes thread divergence Break loop into two loops to avoid if statement for (j = 0; j < nbodies; j++) { if (i != j) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } CUDA Optimization Tutorial 41 Using 2 Loops to Avoid If Statement for (j = 0; j < i; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } for (j = i+1; j < nbodies; j++) { dx = body[j].posx - px; dy = body[j].posy - py; dz = body[j].posz - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssq); scale = body[j].mass * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } CUDA Optimization Tutorial 42 Loop Duplication Speedup Problem size n=100000, step=1 n=100000, step=1 n=300000, step=1 Performance No change for 2.0 & 3.0 45% slowdown for CC 1.3 Speedup CC 1.3: 0.55 CC 2.0: 1.00 CC 3.0: 1.00 CUDA Optimization Tutorial Divergence moved to loop Unclear why Discussion Not a useful optimization Code bloat A little divergence is okay (only 1 in 3125 iterations) 43 Blocking using Shared Memory Code is memory bound Each warp streams in all bodies’ masses and positions Block inner loop Read block of mass & position info into shared mem Requires barriers (fast hardware barrier within SM) Advantage A lot fewer main memory accesses Remaining main memory accesses are fully coalesced (due to usage of scalar arrays) CUDA Optimization Tutorial 44 Blocking using Shared Memory __shared__ float posxs[threads], posys[threads], poszs[…], masss[…]; j = 0; for (j1 = 0; j1 < nbodiesd; j1 += THREADS) { // first part of loop idx = tid + j1; if (idx < nbodiesd) { // each thread copies 4 words (fully coalesced) posxs[id] = posxd[idx]; posys[id] = posyd[idx]; poszs[id] = poszd[idx]; masss[id] = massd[idx]; } __syncthreads(); // wait for all copying to be done bound = min(nbodiesd - j1, THREADS); for (j2 = 0; j2 < bound; j2++, j++) { // second part of loop if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } __syncthreads(); // wait for all force calculations to be done } CUDA Optimization Tutorial 45 Blocking Speedup Problem size Performance n=100000, step=1 Great speedup for CC 1.3 n=100000, step=1 Some speedup for others n=300000, step=1 Speedup CC 1.3: 3.7 CC 2.0: 1.1 CC 3.0: 1.6 CUDA Optimization Tutorial Has hardware data cache Discussion Very important optimization for memory bound code Even with L1 cache 46 Loop Unrolling CUDA compiler Generally good at unrolling loops with fixed bounds Does not unroll inner loop of our example code Use pragma to unroll (and pad arrays) #pragma unroll 8 for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } CUDA Optimization Tutorial 47 Loop Unrolling Speedup Problem size Performance n=100000, step=1 Noticeable speedup n=100000, step=1 All three GPUs n=300000, step=1 Discussion Can be useful Speedup CC 1.3: 1.07 CC 2.0: 1.16 CC 3.0: 1.07 CUDA Optimization Tutorial May increase register usage, which may lower maximum number of threads per block and result in slowdown 48 CC 2.0 Absolute Performance Problem size n=100000, step=1 Runtime 612 ms FP operations 326.7 GFlop/s Main mem throughput 1.035 GB/s Not peak performance Only 32% of 1030 GFlop/s Peak assumes FMA every cycle 3 sub (1c), 3 fma (1c), 1 rsqrt (8c), 3 mul (1c), 3 fma (1c) = 20c for 20 Flop 63% of realistic peak of 515.2 GFlop/s Assumes no non-FP operations With int ops = 31c for 20 Flop 99% of actual peak of 330.45 GFlop/s CUDA Optimization Tutorial 49 Eliminating the If Statement Algorithmic optimization Potential softening parameter avoids division by zero If-statement is not necessary and can be removed Eliminates thread divergence for (j2 = 0; j2 < bound; j2++, j++) { if (i != j) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } } CUDA Optimization Tutorial 50 If Elimination Speedup Problem size Performance n=100000, step=1 Large speedup n=100000, step=1 All three GPUs n=300000, step=1 Discussion Speedup CC 1.3: 1.38 CC 2.0: 1.54 CC 3.0: 1.64 CUDA Optimization Tutorial No thread divergence Allows compiler to schedule code much better 51 Rearranging Terms Generated code is suboptimal Compiler does not emit as many fused multiply-add (FMA) instructions as it could Rearrange terms in expressions to help compiler Need to check generated assembly code for (j2 = 0; j2 < bound; j2++, j++) { dx = posxs[j2] – px; dy = posys[j2] – py; dz = poszs[j2] - pz; dsq = dx*dx + dy*dy + dz*dz; dinv = rsqrtf(dsq + epssqd); dsq = dx*dx + (dy*dy + (dz*dz + epssqd)); dinv = rsqrtf(dsq); scale = masss[j2] * dinv * dinv * dinv; ax += dx * scale; ay += dy * scale; az += dz * scale; } CUDA Optimization Tutorial 52 FMA Speedup Problem size Performance n=100000, step=1 Small speedup n=100000, step=1 All three GPUs n=300000, step=1 Discussion Speedup CC 1.3: 1.03 CC 2.0: 1.05 Seemingly needless transformations can make a difference CC 3.0: 1.06 CUDA Optimization Tutorial 53 Higher Unroll Factor Problem size Unroll 128 times n=100000, step=1 Avoid looping overhead n=100000, step=1 Now that there are no ifs n=300000, step=1 Performance Speedup CC 1.3: 1.01 CC 2.0: 1.04 CC 3.0: 0.93 CUDA Optimization Tutorial Little speedup/slowdown Discussion Carefully choose unroll factor (manually tune) 54 Compiler Flags Problem size -use_fast_math n=100000, step=1 “-ftz=true” suffices n=100000, step=1 (flush denormals to zero) Makes SP FP operations faster except on CC 1.3 n=300000, step=1 Speedup CC 1.3: 1.00 CC 2.0: 1.18 CC 3.0: 1.15 Performance Significant speedup Discussion Use faster but less precise operations when prudent CUDA Optimization Tutorial 55 Final Absolute Performance CC 2.0 Fermi GTX 480 Problem size n=100000, step=1 Runtime 296.1 ms FP operations 675.6 GFlop/s (SP) 66% of peak performance 261.1 GFlops/s (DP) Main mem throughput 2.139 GB/s CUDA Optimization Tutorial CC 3.0 Kepler GTX 680 Problem size n=300000, step=1 Runtime 1073 ms FP operations 1677.6 GFlop/s (SP) 54% of peak performance 88.7 GFlops/s (DP) Main mem throughput 5.266 GB/s 56 Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions CUDA Optimization Tutorial gamedsforum.ca Thepcreport.net 57 Things to Consider Minimize PCIe transfers Implementing entire algorithm on GPU, even some slow serial code sections, might be overall win Can stream data to/from GPU while computing Locks and synchronization Lightweight locks & fast barriers possible within SM Slow across different SMs L1 data caches are not coherent Use volatile & fences to avoid deadlocks CUDA Optimization Tutorial 58 Warp-based Execution // wrong on GPU, correct on CPU do { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++; } while (cnt < 2); ready[k] = 1; // correct do { cnt = 0; if (ready[i] != 0) cnt++; if (ready[j] != 0) cnt++; if (cnt == 2) ready[k] = 1; } while (cnt < 2); CUDA Optimization Tutorial Problem Thread divergence Loop exiting threads wait for other threads in warp to also exit “ready[k] = 1” is not executed until all threads in warp are done with loop Possible deadlock 59 Hybrid Execution CPU always needed for program launch and I/O CPU much faster on serial program segments GPU 10 times faster than CPU on parallel code Running 10% of problem on CPU is hardly worthwhile Complicates programming and requires data transfer Best CPU data structure is often not best for GPU PCIe bandwidth much lower than GPU bandwidth 1.6 to 6.5 GB/s versus 192 GB/s But can send data while CPU & GPU are computing Merging CPU and GPU on same die (e.g., AMD’s Fusion APU) makes finer grain switching possible CUDA Optimization Tutorial 60 Outline Introduction GPU programming N-body example Porting and tuning Other considerations Conclusions CUDA Optimization Tutorial gamedsforum.ca 61 Summary and Conclusions (Part I) Step-by-step porting and tuning of CUDA code Example: n-body simulation GPUs have very powerful hardware But only exploitable with some codes Even harder to program and optimize than CPU hardware CUDA Optimization Tutorial 62 Parallelizing and Optimizing Programs for GPU Acceleration using CUDA (Part II) Martin Burtscher Department of Computer Science Mapping Regular Code to GPUs Regular codes Operate on array- and matrix-based data structures Exhibit mostly strided memory access patterns Have relatively predictable control flow (control flow behavior is mainly determined by input size) Largely independent computations Many regular codes are easy to port to GPUs E.g., matrix codes executing many ops/word Dense matrix operations (level 2 and 3 BLAS) Stencil codes (PDE solvers) CUDA Optimization Tutorial LLNL 64 Mapping Irregular Code to GPUs Irregular codes Build, traverse, and update dynamic data structures FSU (trees, graphs, linked lists, priority queues, etc.) Exhibit pointer-chasing memory access patterns Have complex control flow (control flow behavior depends on input values and changes dynamically) Many important scientific programs are irregular E.g., n-body simulation, data clustering, SAT solving, social networks, discrete-event simulation, meshing, … Need case studies on how to best map irreg codes CUDA Optimization Tutorial 65 Example: N-body Simulation Irregular Barnes Hut algorithm Repeatedly builds unbalanced tree and performs complex traversals on it Our implementation Designed for GPUs (not just port of CPU code) First GPU implementation of entire BH algorithm Results GPU is 21 times faster than CPU (6 cores) on this code CUDA Optimization Tutorial 66 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions CUDA Optimization Tutorial NASA/JPL-Caltech/SSC 67 Barnes Hut Idea Precise force calculation Requires O(n2) operations (O(n2) body pairs) Computationally intractable for large n Barnes and Hut (1986) Algorithm to approximately compute forces Bodies’ initial position & velocity are also approximate Requires only O(n log n) operations Idea is to “combine” far away bodies Error should be small because force 1/distance2 CUDA Optimization Tutorial 68 Barnes Hut Algorithm Set bodies’ initial position and velocity Iterate over time steps 1. Compute bounding box around bodies 2. Subdivide space until at most one body per cell Record this spatial hierarchy in an octree 3. Compute mass and center of mass of each cell 4. Compute force on bodies by traversing octree Stop traversal path when encountering a leaf (body) or an internal node (cell) that is far enough away 5. Update each body’s position and velocity CUDA Optimization Tutorial 69 Build Tree (Level 1) * o * * * * * * * * * * * * * * * * * * * * * * Compute bounding box around all bodies → tree root CUDA Optimization Tutorial 70 Build Tree (Level 2) * o * * * o o o o * * * * * * * * * * * * * * * * * * * Subdivide space until at most one body per cell CUDA Optimization Tutorial 71 Build Tree (Level 3) * o * * * o o o o * * * * * * o o o o o o o o o o o o * * * * * * * * * * * * * Subdivide space until at most one body per cell CUDA Optimization Tutorial 72 Build Tree (Level 4) * o * * * o o o o * * * * * * o o o o o o o o o o o o o o o o o o o o * * * * * o o o o * * * * * * * * Subdivide space until at most one body per cell CUDA Optimization Tutorial 73 Build Tree (Level 5) * o * * * o o o o * * * * * * o o o o o o o o o o o o o o o o o o o o * * * * * o o o o * * oooo * * * * * * Subdivide space until at most one body per cell CUDA Optimization Tutorial 74 Compute Cells’ Center of Mass * o * * * * o o o o * * * * * * o o o o o o o o o o o * * * o o * * * o o o o o o o o o o o o * * o oooo * * * * For each internal cell, compute sum of mass and weighted average of position of all bodies in subtree; example shows two cells only CUDA Optimization Tutorial 75 Compute Forces * o * * * * o o o o * * * * * * o o o o o o o o o o o * * * o o * * * o o o o o o o o o o o o * * o oooo * * * * Compute force, for example, acting upon green body CUDA Optimization Tutorial 76 Compute Force (short distance) * o * * * * o o o o * * o o o o o o o o o o o * * * * * * * o o * * * o o o o o o o o o o o o * * o oooo * * * * Scan tree depth first from left to right; green portion already completed CUDA Optimization Tutorial 77 Compute Force (down one level) * o * * * * o o o o * * * * * * * o o o o o o o o o o o * * o o * * * o o o o o o o o o o o o * * o oooo * * * * Red center of mass is too close, need to go down one level CUDA Optimization Tutorial 78 Compute Force (long distance) * o * * * * o o o o * * o o o o o o o o o o o * * * * * * * o o * * * o o o o o o o o o o o o * * o oooo * * * * Blue center of mass is far enough away CUDA Optimization Tutorial 79 Compute Force (skip subtree) * o * * * * o o o o * * * * * * * o o o o o o o o o o o * * o o * * * o o o o o o o o o o o o * * o oooo * * * * Therefore, entire subtree rooted in the blue cell can be skipped CUDA Optimization Tutorial 80 Pseudocode bodySet = ... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodySet { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { c.Summarize(); } foreach Body b in bodySet { b.ComputeForce(octree); } foreach Body b in bodySet { b.Advance(); } } CUDA Optimization Tutorial * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oooo * * * * * * * * * * * * * * * * * * * * * * * * * * * * o * o * * * * * * * o * * * * o * * * * * * 81 Complexity and Parallelism bodySet = ... foreach timestep do { bounding_box = new Bounding_Box(); foreach Body b in bodySet { bounding_box.include(b); } octree = new Octree(bounding_box); foreach Body b in bodySet { octree.Insert(b); } cellList = octree.CellsByLevel(); foreach Cell c in cellList { c.Summarize(); } foreach Body b in bodySet { b.ComputeForce(octree); } foreach Body b in bodySet { b.Advance(); } } CUDA Optimization Tutorial // O(n log n) + ordered sequential // O(n) parallel reduction // O(n log n) top-down tree building // O(n) + ordered bottom-up traversal // O(n log n) fully parallel // O(n) fully parallel 82 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions CUDA Optimization Tutorial 83 Efficient GPU Code Large amounts of data parallelism Coalesced main memory accesses Little thread divergence Relatively little synchronization between blocks Little CPU/GPU data transfer Efficient use of shared memory Thepcreport.net CUDA Optimization Tutorial 84 Main BH Implementation Challenges Uses irregular tree-based data structure Initially little parallelism Little coalescing o o Load imbalance Complex recursive traversals o o o o o o o o o o o o o o o o o o o o o o o o o o oooo Recursion not well supported Lots of thread divergence Memory-bound pointer-chasing operations Not enough computation to hide latency CUDA Optimization Tutorial 85 o Six GPU Kernels Read initial data and transfer to GPU for each timestep do { 1. 2. 3. 4. 5. 6. Compute bounding box around bodies (not irregular) Build hierarchical decomposition, i.e., octree Summarize body information in internal octree nodes Approximately sort bodies by spatial location (optional) Compute forces acting on each body with help of octree Update body positions and velocities (not irregular) } Transfer result from GPU and output CUDA Optimization Tutorial 86 Global Optimizations Make code iterative (recursion not supported*) Keep data on GPU between kernel calls Use array elements instead of heap nodes One aligned array per field for coalesced accesses objects in array objects on heap CUDA Optimization Tutorial fields in arrays 87 Global Optimizations (cont.) Maximize thread count (round down to warp size) Maximize resident block count (all SMs filled) Pass kernel parameters through constant memory Use special allocation order c0 Alias arrays (56 B/node) Use index arithmetic Persistent blocks & threads Unroll loops over children CUDA Optimization Tutorial c2 b5 ba c4 c3 b3 b6 b2 c1 b7 b1 bodies (fixed) b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba b0 c5 b8 b4 b9 cell allocation direction ... c5 c4 c3 c2 c1 c0 88 Kernel 1: Bounding Box (Regular) main memory ... Optimizations threads warp 1 warp 2 warp 3 warp 4 Fully coalesced shared memory barrier threads warp 1 warp 2 No bank conflicts shared memory barrier threads warp 1 barrier t1 t2 shared memory barrier threads shared memory Minimal divergence Built-in min and max shared memory threads Fully cached t1 Reduction operation CUDA Optimization Tutorial 2 red/mem, 6 red/bar Bodies load balanced 512*3 threads per SM 89 Kernel 2: Build Octree (Irregular) Optimizations Only lock leaf “pointers” Top-down tree building Lock-free fast path Light-weight lock release No re-traverse after lock acquire failure Combined memory fence Re-compute position during traversal Separate init kernels 512*3 threads per SM CUDA Optimization Tutorial * 90 Kernel 2: Build Octree (cont.) // initialize cell = find_insertion_point(body); // no locks, cache cell child = get_insertion_index(cell, body); if (child != locked) { // skip atomic if already locked if (child == null) { // fast path (frequent) if (null == atomicCAS(&cell[child], null, body)) { // lock-free insertion // move on to next body } } else { if (child == atomicCAS(&cell[child], child, lock)) { // acquire lock // build subtree with new and existing body flag = true; } } } __syncthreads(); // optional barrier __threadfence(); // make data visible if (flag) { cell[child] = new_subtree; // insert subtree and releases lock // move on to next body } CUDA Optimization Tutorial 91 Kernel 3: Summarize Subtrees (Irreg.) Optimizations Bottom-up tree traversal Scan avoids deadlock Use mass as flag + fence 4 3 1 2 3 No locks, no atomics Use wait-free first pass 4 Cache the ready info Piggyback on traversal allocation direction ... 3 4 1 2 3 4 scan direction CUDA Optimization Tutorial Count bodies in subtrees No parent “pointers” 128*6 threads per SM 92 Kernel 4: Sort Bodies (Irregular) Optimizations Top-down tree traversal 4 3 1 2 3 (Similar to Kernel 3) Scan avoids deadlock Use data field as flag 4 Use counts from Kernel 3 Piggyback on traversal allocation direction ... 3 4 1 2 3 4 scan direction CUDA Optimization Tutorial No locks, no atomics Move nulls to back Throttle warps with optional barrier 64*6 threads per SM 93 Kernel 5: Force Calculation (Irregular) Optimizations Multiple prefix traversals Group similar work together Uses sorting to minimize size of prefix union in each warp Early out (nulls in back) Traverse whole union to avoid CUDA Optimization Tutorial divergence (warp voting) Lane 0 controls iteration stack for entire warp (fits in shmem) Minimize volatile accesses Use fast 1/sqrtf instruction Cache tree-level-based data 256*5 threads per SM 94 Architectural Support Coalesced memory accesses & lockstep execution All threads in warp read same tree node at same time Only one mem access per warp instead of 32 accesses Warp-based execution Enables data sharing in warps w/o synchronization RSQRTF instruction Quickly computes good approximation of 1/sqrtf(x) Warp voting instructions Quickly perform reduction operations within a warp CUDA Optimization Tutorial 95 Kernel 6: Advance Bodies (Regular) Optimizations Fully coalesced, no divergence Load balanced, 1024*1 threads per SM Straightforward streaming main memory threads ... warp 1 main memory CUDA Optimization Tutorial warp 2 warp 3 warp 4 warp 1 warp 2 ... warp 2 ... 96 Outline Introduction Barnes Hut algorithm Experimental results Conclusions 10000.0 1000.0 runtime per timestep [s] CUDA implementation CPUbh GPUbh 100.0 GPUsq 10.0 1.0 10,000 100,000 1,000,000 10,000,000 0.1 0.0 number of bodies CUDA Optimization Tutorial 97 Evaluation Methodology Implementations CUDA/GPU: Barnes Hut and O(n2) algorithms OpenMP/CPU: Barnes Hut algorithm (derived from CUDA) Pthreads/CPU: Barnes Hut algorithm (SPLASH-2 suite) Systems and compilers nvcc 4.0 (-O3 -arch=sm_20 -ftz=true*) GeForce GTX 480, 1.4 GHz, 15 SMs, 32 cores per SM gcc 4.1.2 (-O3 -fopenmp* -ffast-math*) Xeon X5690, 3.46 GHz, 6 cores, 2 threads per core Inputs and metric 5k, 50k, 500k, and 5M star clusters (Plummer model) Best runtime of three experiments, excluding I/O CUDA Optimization Tutorial 98 Nodes Touched per Activity (5M Input) Kernel “activities” K1: pair reduction K2: tree insertion K3: bottom-up step K4: top-down step K5: prefix traversal K6: integration step Max tree depth ≤ 22 Cells have 3.1 children CUDA Optimization Tutorial kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 neighborhood size min avg max 1 2.0 2 2 13.2 22 2 4.1 9 2 4.1 9 818 4,117.0 6,315 1 1.0 1 Prefix ≤ 6,315 nodes (≤ 0.1% of 7.4 million) BH algorithm & sorting to min. union work well 99 CUDA Optimization Tutorial 100 10 kernel and round Almost every “round” has lots of activities without data dependencies that can be processed in parallel 100 k6.1 1,000 force calc. integration summarization k5.1 100,000 tree building k4.1 k4.2 k4.3 k4.4 k4.5 k4.6 k4.7 k4.8 k4.9 k4.10 k4.11 k4.12 k4.13 k4.14 k4.15 k4.16 k4.17 k4.18 k4.19 k4.20 bounding box k3.1 k3.2 k3.3 k3.4 k3.5 k3.6 k3.7 k3.8 k3.9 k3.10 k3.11 k3.12 k3.13 k3.14 k3.15 k3.16 k3.17 k3.18 k3.19 k3.20 1,000,000 k2.1 k2.2 k2.3 k2.4 k2.5 k2.6 k2.7 k2.8 k2.9 k2.10 k2.11 k2.12 k2.13 k2.14 k2.15 k2.16 k2.17 k2.18 k2.19 k1.1 k1.2 k1.3 k1.4 k1.5 k1.6 k1.7 k1.8 k1.9 k1.10 k1.11 k1.12 k1.13 k1.14 k1.15 k1.16 k1.17 k1.18 k1.19 k1.20 k1.21 k1.22 k1.23 available parallelism Available Amorphous Data Parallelism 10,000,000 5,000 bodies sorting 50,000 bodies 500,000 bodies 5,000,000 bodies 10,000 1 Runtime Comparison GPU BH inefficiency 5k input too small for 100,000 GPU CUDA GPU O(n^2) 10,000 5,760 to 23,040 threads algorithm O(n2) faster with fewer than about 15k bodies GPU (5M input) 21.1x faster than OpenMP 23.2x faster than Pthreads CPU Pthreads runtime per timestep [ms] BH vs. O(n2) CPU OpenMP 1,000 100 10 1 5,000 50,000 500,000 5,000,000 0 number of bodies CUDA Optimization Tutorial 101 Kernel Performance for 5M Input $200 GPU delivers 228 GFlops/s on irregular code 2 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 BarnesHut O(n ) Gflops/s 71.6 5.8 2.5 n/a 240.6 33.5 228.4 897.0 GB/s 142.9 26.8 10.6 12.8 8.0 133.9 8.8 2.8 runtime [ms] 0.4 44.6 28.0 14.2 1641.2 2.2 1730.6 557421.5 GPU chip is 2.7 to 23.5 times faster than CPU chip non-compliant fast single-precision version IEEE 754-compliant double-precision version kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 kernel 1 kernel 2 kernel 3 kernel 4 kernel 5 kernel 6 X5690 CPU 5.5 185.7 75.8 52.1 38,540.3 16.4 10.3 193.1 101.0 51.6 47,706.4 33.1 GTX 480 GPU 0.4 44.6 28.0 14.2 1,641.2 2.2 0.8 46.7 31.0 14.2 5,177.1 4.2 CPU/GPU 13.1 4.2 2.7 3.7 23.5 7.3 12.7 4.1 3.3 3.6 9.2 7.9 GPU hardware is better suited for BH than CPU hw But difficult and very time consuming to program CUDA Optimization Tutorial 102 Kernel Speedups Optimizations that are generally applicable avoid volatile 50,000 1.14x 500,000 1.19x 5,000,000 1.18x rsqrtf instr. 1.43x 1.47x 1.46x recalc. data 0.99x 1.32x 1.69x thread full multivoting threading 2.04x 20.80x 2.49x 27.99x 2.47x 28.85x Optimizations for irregular kernels throttling waitfree combined sorting of sync'ed barrier pre-pass mem fence bodies execution 50,000 0.97x 1.02x 1.54x 3.60x 6.23x 500,000 1.03x 1.21x 1.57x 6.28x 8.04x 5,000,000 1.04x 1.31x 1.50x 8.21x 8.60x CUDA Optimization Tutorial 103 Outline Introduction Barnes Hut algorithm CUDA implementation Experimental results Conclusions CUDA Optimization Tutorial 104 Optimization Summary Reduce main memory accesses Share data within warp, combine memory fences & traversals, re-compute data, avoid volatile accesses Minimize thread divergence Group similar work together, force synchronicity Implement entire algorithm on and for GPU Avoid data transfers & data structure inefficiencies, wait-free pre-pass, scan entire prefix union CUDA Optimization Tutorial 105 Optimization Summary (cont.) Exploit hardware features Fast synchronization & thread startup, special instrs., coalesced memory accesses, even lockstep execution Use light-weight locking and synchronization Minimize locks, reuse fields, and use fence + store ops Maximize parallelism Parallelize every step within and across SMs CUDA Optimization Tutorial 106 CPU/GPU Implementation Comparison Irregular CPU code Dynamically (incrementally) allocated shared data structures Structure-based shared data structures Logical lock-based implementation Global/local worklists Recursive or iterative implementation CUDA Optimization Tutorial Irregular GPU code Statically (wholly) allocated shared data structures Multiple-array-based shared data structures Lock-free implementation (Implicit) local worklists Iterative implementation 107 Useful GPU Hardware Features Wide parallelism Great for exploiting large amounts of parallelism Massive multithreading Ideal for hiding latency of irregular mem. accesses Fast thread startup Essential when launching thousands of threads Shared memory Fast data sharing Useful for local worklists CUDA Optimization Tutorial HW support for reduction and synchronization Makes otherwise costly operations very fast Coalesced accesses Memory access combining is useful in irregular codes Lockstep execution Can share data without explicit synchronization Allows to consolidate iteration stacks 108 Challenges with GPUs Warp-based execution Often requires sorting of work or algorithm change Data structure layout Best layout for CPU differs from best layout for GPU SoA can be tedious to code and deal with (parameter passing) Separate memory space Slow transfers Pack/unpack data CUDA Optimization Tutorial Incoherent L1 caches May need to explicitly manage data (fences) Poor recursion support Need to make code iterative and maintain explicit iteration stacks Thread and block counts Hierarchy complicates implementation Optimal counts have to be (auto-)tuned 109 Running Irregular Algorithms on GPUs Mandatory Important Need vast amounts of data Scheduling is independent parallelism Can do large chunks of computation on GPU of previous activities Easy to sort activities by similarity (if needed) Very Important Beneficial Cautious implementation DS can be expressed Easy to express iteratively Has statically known range through fixed arrays Uses local worklists that can be statically populated of neighborhoods DS size (or bound) can be determined based on input CUDA Optimization Tutorial 110 Conclusions Irregularity does not necessarily prevent high- performance on GPUs Entire Barnes Hut algorithm implemented on GPU Builds and traverses unbalanced octree GPU is 21.1 times (float) and 9.1 times (double) faster than high-end 6-core Xeon Code directly for GPU, do not merely adjust CPU code Requires different data and code structures Benefits from different algorithmic modifications CUDA Optimization Tutorial 111 Acknowledgments Hardware NVIDIA Corp. and Intel Corp. Funding NVIDIA Corp. and Texas State University OpenMP code Ricardo Alves (Universidade do Minho, Portugal) Collaborator Keshav Pingali (University of Texas at Austin) CUDA Optimization Tutorial 112 CUDA Optimization Tutorial Martin Burtscher burtscher@txstate.edu http://www.cs.txstate.edu/~burtscher/ Barnes Hut CUDA code http://www.gpucomputing.net/?q=node/1314 Tutorial slides http://www.cs.txstate.edu/~burtscher/tutorials/COT5/slides.pptx CUDA Optimization Tutorial 113