Harnessing the Power of GPUs for Non-Graphics Applications Michael Boyer Department of Computer Science University of Virginia Advisor: Kevin Skadron © 2010 Michael Boyer 1 Outline • • • • GPU architecture Programming GPUs using CUDA Case study: Leukocyte Tracking Current work: CPU-GPU Task Sharing © 2010 Michael Boyer 2 Graphics Processors • Graphics Processing Units (GPUs) are designed specifically for graphics rendering applications Courtesy of GameSpot © 2010 Michael Boyer 3 Graphics Applications • Graphics applications involve applying the same operation to many pieces of data • Application characteristics: – Massively parallel – Only aggregate performance matters © 2010 Michael Boyer 4 CPU vs. GPU: Architectural Difference 1 CPU GPU Fetch/ Decode Branch Predictor Register File OOO Logic Execute Memory Pre-Fetcher Data Cache Avoid structures that only improve single-thread performance © 2010 Michael Boyer 5 CPU vs. GPU: Architectural Difference 2 CPU GPU Fetch/ Decode Branch Predictor Register File OOO Logic Execute Memory Pre-Fetcher Fetch/ Decode Thread Group RF Register RF RF File RF EXE EXE Execute EXE EXE Data Cache Amortize the overhead of control logic across multiple execution units (SIMD processing) © 2010 Michael Boyer 6 CPU vs. GPU: Architectural Difference 3 CPU GPU Fetch/ Decode Branch Predictor Fetch/ Decode Register File OOO Logic Thread Thread Group Group 1 RF RF RF RF Execute Memory Pre-Fetcher Thread Group 2 EXE RF EXE RF EXE RF EXE RF Thread Group 3 RF RF RF RF Thread Group 4 RF RF RF RF EXE EXE EXE EXE Data Cache Use multiple groups of threads to keep execution units busy and hide memory latency © 2010 Michael Boyer 7 CPU vs. GPU: Architectural Difference 4 CPU GPU Fetch/ Decode Branch Predictor Core 1 Register File Core 2 OOO Logic Core 1 RF Core 6 Memory ExecuteCPU Core Pre-Fetcher Core 3 Core 2 RF Core 7 Core RF Core 11 12 Core 4 Core RF 16 Data Cache Fetch/ Decode Core 3 Core 4 Core 8 Core 9 Core RFCore RF 13 GPU Core Core RFCore 17 18 Core Core 21 RF 22 RF Core RF 23 Core Core Core 26 EXE 27 EXE28 14 Core RF 19 Core Core 5 RF Core 10 Core RF 15 Core RF 20 Core RF24 RF 25 Core 29 EXE Core 30 EXE Replicate cores to leverage more parallelism © 2010 Michael Boyer 8 CPU vs. GPU: Architectural Differences • Summary: take advantage of abundant parallelism – Lots of threads, so focus on aggregate performance – Parallelism in space: • SIMD processing in each core • Many independent SIMD cores across the chip – Parallelism in time: • Multiple SIMD groups in each core © 2010 Michael Boyer 9 CPU vs. GPU: Peak Performance Processor Type Product Throughput (GFLOPs) Memory Bandwidth (GB/s) Cost CPU GPU Intel Xeon W5590 (Nehalem) AMD Radeon HD 5870 107 2,720 32 154 $1,700 $450 • Note that these are peak numbers • What we really care about is performance on real-world applications © 2010 Michael Boyer 10 General-Purpose Computing on GPUs • Lots of recent interest in using GPUs to run nongraphics applications (GPGPU) • Why GPUs? Why now? – Recent increases in performance via parallelism – Recent increases in programmability – Ubiquity in multiple market segments • Old approach: graphics languages • New approach: GPGPU languages – OpenCL, CUDA © 2010 Michael Boyer 11 CUDA • Programming model for running general-purpose applications on NVIDIA GPUs • Extension to the C programming language • GPU is a co-processor: – Main program runs on the CPU – Large computations (kernels) are offloaded to the GPU – CPU and GPU have separate memory, so data must be transferred back and forth © 2010 Michael Boyer 12 CUDA: Typical Program Structure void function(…) { Allocate memory on the GPU Transfer input data to the GPU Launch kernel on the GPU Transfer output data to CPU } CPU CPU Memory GPU Memory __global__ void kernel(…) { Code executed on the GPU goes here… } GPU © 2010 Michael Boyer 13 CUDA: Typical Program Transformation for (i = 0; i < N; i++) { Process array element i } Body of loop becomes body of kernel __global__ void kernel(…) { Determine this thread’s value of i Process array element i } © 2010 Michael Boyer 14 CUDA Kernel • Scalar program invoked across many threads – Typically one thread per data element • Overall computation decomposed into a grid of thread blocks – Thread blocks are independent and cannot communicate (with some exceptions) – Threads within the same block can communicate Thread Block 1 Thread Block 2 Thread Block 3 © 2010 Michael Boyer Thread Block 4 Thread Block 5 15 Simple Example: Vector Addition C=A+B A 1 2 3 4 5 6 7 8 13 14 15 16 18 20 22 24 + B 9 10 11 12 = C 10 12 14 16 © 2010 Michael Boyer 16 C Code float *CPU_add_vectors(float *A, float *B, int N) { // Allocate memory for the result float *C = (float *) malloc(N * sizeof(float)); // Compute the sum; for (int i = 0; i < N; i++) { C[i] = A[i] + B[i]; } // Return the result return C; } © 2010 Michael Boyer 17 CUDA Kernel // GPU kernel that computes the vector sum C = A + B // (each thread computes a single value of the result) __global__ void kernel(float *A, float *B, float *C, int N) { // Determine which element this thread is computing int i = blockDim.x * blockIdx.x + threadIdx.x; // Compute a single element of the result vector if (i < N) { C[i] = A[i] + B[i]; } } © 2010 Michael Boyer 18 CUDA Host Code float *GPU_add_vectors(float *A_CPU, float *B_CPU, int N) { // Allocate GPU memory for the inputs and the result int vector_size = N * sizeof(float); float *A_GPU, *B_GPU, *C_GPU; cudaMalloc((void **) &A_GPU, vector_size); cudaMalloc((void **) &B_GPU, vector_size); cudaMalloc((void **) &C_GPU, vector_size); // Transfer the input vectors to GPU memory cudaMemcpy(A_GPU, A_CPU, vector_size, cudaMemcpyHostToDevice); cudaMemcpy(B_GPU, B_CPU, vector_size, cudaMemcpyHostToDevice); // Execute the kernel to compute the vector sum on the GPU int num_blocks = ceil((double) N / (double) THREADS_PER_BLOCK); kernel <<< num_blocks, THREADS_PER_BLOCK >>> (A_GPU, B_GPU, C_GPU, N); // Transfer the result vector from the GPU to the CPU float *C_CPU = (float *) malloc(vector_size); cudaMemcpy(C_CPU, C_GPU, vector_size, cudaMemcpyDeviceToHost); return C_CPU; } © 2010 Michael Boyer 19 Example Program Output ./vector_add 50,000,000 GPU: Transfer to GPU: Kernel execution: Transfer from GPU: Total: 0.236 0.005 0.152 0.404 sec sec sec sec CPU: 0.136 sec Execution: GPU outperformed CPU by 27.2x Overall: CPU outperformed GPU by 2.97x Vector addition does not do enough work per memory operation to justify offload! © 2010 Michael Boyer 20 Case Study: Leukocyte Tracking © 2010 Michael Boyer 21 Leukocyte Tracking • Important for evaluating inflammatory drugs • Velocity measured by tracking leukocytes through multiple frames • Current approaches: – Manual analysis: 1 minute video in tens of hours – MATLAB: 1 minute video in 5 hours © 2010 Michael Boyer 22 Goal: Leverage CUDA and a GPU to accelerate leukocyte tracking to near real-time speeds © 2010 Michael Boyer 23 Acceleration 1. Translation: convert MATLAB code to C 2. Parallelization: – OpenMP for multi-core CPU – CUDA for GPU • Experimental setup: – CPU: 3.2 GHz quad-core Intel Core 2 Extreme X9770 – GPU: NVIDIA GeForce GTX 280 (PCIe 2.0) © 2010 Michael Boyer 24 Tracking Algorithm Inputs: Video frame Location of cells in previous frame Output: Location of cells in current frame For each cell: – Extract sub-image near cell’s old location – Compute MGVF matrix over sub-image – Evolve active contour using MGVF matrix © 2010 Michael Boyer → 99.8% 25 Computing the MGVF Matrix • Motion Gradient Vector Flow • MGVF matrix is approximated via an iterative solution procedure Sub-image near cell MGVF © 2010 Michael Boyer 26 MGVF Pseudo-code MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) © 2010 Michael Boyer 27 Naïve CUDA Implementation Speedup over MATLAB 250x 200x 150x 100x 50x 2.0x 7.7x 0.8x C C + OpenMP Naïve CUDA 0x CUDA • Kernel is called ~50,000 times per frame • Amount of work per call is small • Runtime dominated by CUDA overheads: – Memory allocation, memory copying, kernel call overhead © 2010 Michael Boyer 28 Kernel Overhead • Kernel calls are not cheap! – Overhead of one kernel call: 9 microseconds – Overhead of one CPU function: 3 nanoseconds – Kernel call is 3,000 times more expensive • Heaviside kernel: – 27% of kernel runtime due to computation – 73% of kernel runtime due to kernel overhead © 2010 Michael Boyer 29 Lesson 1: Reduce Kernel Overhead • Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation © 2010 Michael Boyer 30 Larger Kernel Implementation MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) © 2010 Michael Boyer 31 Larger Kernel Implementation Speedup over MATLAB 250x 200x 150x 100x 50x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel 0x CUDA Memory Allocation 71% Memory Copying 15% Kernel Execution 9% 0% 20% 40% 60% 80% 100% Percentage of Runtime © 2010 Michael Boyer 32 Memory Allocation Overhead 10000 Time Per Call (microseconds) malloc (CPU memory) cudaMalloc (GPU memory) 1000 100 10 1 0.1 0.01 1E-07 1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000 Megabytes Allocated Per Call © 2010 Michael Boyer 33 Lesson 2: Reduce Memory Management Overhead • Reduce the number of memory allocations – Allocate memory once and reuse it throughout the application – If memory size is not known a priori, estimate and only reallocate if estimate is too small © 2010 Michael Boyer 34 Reduced Allocation Implementation Speedup over MATLAB 250x 200x 150x 100x 50x 25.4x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel 0x Reduced Allocation CUDA Memory Allocation 3% Memory Copying 56% Kernel Execution 31% 0% 20% 40% 60% 80% 100% Percentage of Runtime © 2010 Michael Boyer 35 Memory Transfer Overhead 1000 CPU to GPU GPU to CPU Transfer Time (milliseconds) 100 10 1 0.1 0.01 0.001 1E-06 1E-05 0.0001 0.001 0.01 0.1 1 10 100 1000 Megabytes per Transfer © 2010 Michael Boyer 36 Lesson 3: Reduce Memory Transfer Overhead • If the CPU operates on values produced by the GPU: – Move the operation to the GPU – May improve performance even if the operation itself is slower on the GPU values produced by GPU Memory Transfer Operation (CPU) Operation (GPU) Memory Transfer values consumed by GPU Time © 2010 Michael Boyer 37 GPU Reduction Implementation MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) © 2010 Michael Boyer 38 GPU Reduction Implementation Speedup over MATLAB 250x 200x 150x 100x 60.7x 50x 25.4x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel 0x Reduced Allocation GPU Reduction CUDA Memory Allocation Memory Copying 7% 1% Kernel Execution 80% 0% 20% 40% 60% 80% 100% Percentage of Runtime © 2010 Michael Boyer 39 Persistent Thread Block MGVF = normalized sub-image gradient do { Compute eight matrices based on current MGVF Compute Heaviside function across each matrix Update MGVF matrix Compute convergence criterion } while (! converged) © 2010 Michael Boyer 40 Persistent Thread Block • Problem: need a global memory fence – Multiple thread blocks compute the MGVF matrix – Thread blocks cannot communicate with each other – So each iteration requires a separate kernel call • Solution: compute entire matrix in one thread block – Arbitrary number of iterations can be computed in a single kernel call © 2010 Michael Boyer 41 Persistent Thread Block: Example MGVF Matrix MGVF Matrix 1 2 3 1 1 1 4 5 6 1 1 1 7 8 9 1 1 1 Persistent Thread Block Canonical CUDA Approach (1-to-1 mapping between threads and data elements) © 2010 Michael Boyer 42 Persistent Thread Block: Example GPU GPU Cell SM 1 Cell SM 1 Cell SM 1 Cell SM 1 Cell SM 2 Cell SM 3 Cell SM 1 Cell SM 1 Cell SM 1 Cell SM 4 Cell SM 5 Cell SM 6 Cell SM 1 Cell SM 1 Cell SM 1 Cell SM 7 Cell SM 8 Cell SM 9 Persistent Thread Block Canonical CUDA Approach (1-to-1 mapping between threads and data elements) SM = Streaming Multiprocessor (GPU core) © 2010 Michael Boyer 43 Lesson 4: Avoid Global Memory Fences • Confine dependent computations to a single thread block – Execute an iterative algorithm until convergence in a single kernel call – Only efficient if there are multiple independent computations © 2010 Michael Boyer 44 Persistent Thread Block Implementation Speedup over MATLAB 250x 211.3x 200x 150x 27x 100x 60.7x 50x 25.4x 2.0x 7.7x 0.8x 6.3x C C + OpenMP Naïve CUDA Larger Kernel 0x Reduced Allocation GPU Reduction Persistent Thread Block CUDA © 2010 Michael Boyer 45 Absolute Performance Frames per Second (FPS) 25 21.6 20 15 10 5 0.11 0.22 0.83 MATLAB C C + OpenMP 0 © 2010 Michael Boyer CUDA 46 Video Example © 2010 Michael Boyer 47 Conclusions • CUDA overheads can be significant bottlenecks • CUDA provides enormous performance improvements for leukocyte tracking – 200x over MATLAB – 27x over OpenMP • Processing time reduced from > 4.5 hours to minutes • Real-time analysis feasible in near future < 1.5 M. Boyer, D. Tarjan, S. T. Acton, and K. Skadron. "Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors.“ In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), May 2009. © 2010 Michael Boyer 48 Current work: CPU-GPU Task Sharing © 2010 Michael Boyer 49 CPU-GPU Task Sharing • Offloading decision is generally considered to be binary GPU? CPU? © 2010 Michael Boyer 50 CPU-GPU Task Sharing • Offload decision does not need to be binary! – Dividing a task between the CPU and GPU can provide improved performance over either device alone GPU? CPU? GPU CPU © 2010 Michael Boyer 51 Theoretical Performance Performance normalized to best without sharing 2 GPU CPU CPU+GPU (equal sharing) CPU+GPU (optimal sharing) 1.5 1 0.5 0 0.01 0.1 1 10 100 Ratio of GPU to CPU performance © 2010 Michael Boyer 52 Research Goal 1. Given an input program written in CUDA or OpenCL, automatically generate a program that can execute on the CPU and GPU concurrently 2. Automatically determine best division of work: – When beneficial, share work between CPU and GPU – Otherwise, execute on CPU or GPU exclusively – Optimal decision can change at runtime: • • With different inputs With contention © 2010 Michael Boyer 53 Proposed System OpenCL code Transform all GPU memory allocations, memory transfers, and kernel launches into a form supporting concurrent CPU-GPU execution Source-to-source Translation Framework Modified OpenCL code OpenCL Compiler CPU/GPU binary © 2010 Michael Boyer 54 Potential Problems • One version of the kernel for multiple devices – Optimizations for GPU may hurt performance on CPU and vice versa • Possible (but rare) for thread blocks to communicate with each other – Do we try to support this? • Statically predicting data access patterns can be hard (or even impossible for some applications) © 2010 Michael Boyer 55 Data Sharing • If we cannot predict data access patterns statically, then the CPU and the GPU must have a consistent view of memory GPU CPU 1) Computation 2) Data Transfer © 2010 Michael Boyer 56 Data Sharing (2) • If we can predict data access patterns statically, then we can minimize the data transfer overhead GPU CPU 1) Computation 2) Data Transfer © 2010 Michael Boyer 57 Preliminary Results (HotSpot) 12 Static Analysis Dynamic Analysis No Sharing Execution Time (seconds) 10 8 6 4 2 0 0 20 40 60 80 100 Percent of Computation on GPU © 2010 Michael Boyer 58 Conclusions • GPUs are designed to provide good performance on graphics workloads – But they have evolved to support any workload with abundant parallelism • GPUs can provide large performance improvements – But we need to take into account the overheads involved to fully take advantage • Allowing the CPU and GPU to work together can provide an even larger performance improvement © 2010 Michael Boyer 59 Acknowledgements • Funding provided by: – NSF grant IIS-0612049 – SRC grant 1607.001 – NVIDIA research grant – GRC AMD/Mahboob Kahn Ph.D. fellowship • Equipment donated by NVIDIA © 2010 Michael Boyer 60 BACKUP © 2010 Michael Boyer 61 3D Rendering APIs • High-level abstractions for rendering geometry Graphics Application Vertex Program Rasterization Fragment Program Display Courtesy of D. Luebke, NVIDIA © 2010 Michael Boyer 62 CUDA: Abstractions 1. Kernel function – Mapped onto a grid of thread blocks 2. Scratchpad memory – For sharing data within a thread block 3. Barrier synchronization – For synchronizing within a thread block © 2010 Michael Boyer 63 Kernel Function __global__ void kernel(int *in, int *out) { // Determine this thread’s index int i = threadIdx.x; // Add one to the input value out[i] = in[i] + 1; } © 2010 Michael Boyer 64 Grid of Thread Blocks Grid: 2-dimensional ≤ 4.3 billion blocks Thread block: 3-dimensional ≤ 512 threads © 2010 Michael Boyer 65 Launching a Kernel int num_threads = ...; int threads_per_block = 256; // Determine how many thread blocks are needed // (using either of the two methods shown below) int num_blocks = ceil(num_threads / threads_per_block); int num_blocks = (num_threads + threads_per_block – 1) / threads_per_block; // Make structures for grid and thread block dimensions dim3 grid(num_blocks, 1); dim3 thread_block(threads_per_block, 1, 1); // Launch the kernel kernel <<< grid, thread_block >>> (in, out); © 2010 Michael Boyer 66 Scratchpad Memory • Each multiprocessor has 16 KB of software-controlled shared memory • Variables declared “__shared__” get mapped into this memory • Values can only be shared among threads within the same thread block © 2010 Michael Boyer 67 Scratchpad Memory: Example __global__ void kernel() { int i = threadIdx.x; // Compute some function int v = foo(i); // Write the value into shared memory __shared__ int values[THREADS_PER_BLOCK]; values[i] = v; // Use the shared values ... } © 2010 Michael Boyer 68 Barrier Synchronization • __syncthreads() function • Each thread waits for all other threads in the thread block • All values written by every thread are now visible to all other threads © 2010 Michael Boyer 69 Barrier Synchronization: Example __global__ void kernel(float *out, int N) { int i = threadIdx.x; __shared__ int values[THREADS_PER_BLOCK]; values[i] = foo(i); // Wait to ensure all values have been written __syncthreads(); // Compute average of two values out[i] = (values[i] + values[(i + 1) % N]); } © 2010 Michael Boyer 70 CUDA Overheads • Driver initialization: 0.14 seconds • Kernel launch: 13 μs • GPU memory allocation and deallocation: orders of magnitude slower than on CPU • Memory transfer: 15 μs + 1 ms/MB © 2010 Michael Boyer 71 Acceleration using CUDA CPU GPU Program Allocate GPU memory Transfer input data Launch kernel Transfer results CUDA kernel Free GPU memory Step 1: Determine which code to offload to the GPU as a CUDA kernel Step 2: Write the CPU-side CUDA code Step 3: Write and optimize the GPU kernel © 2010 Michael Boyer 72 Performance Issues • Branch divergence • Memory coalescing • Key concept: Warp – Group of threads that execute concurrently – In current hardware, warp size is 32 threads © 2010 Michael Boyer 73 Branch Divergence • Remember: hardware is SIMD • What if threads in the same warp follow two different paths? • Solution: entire warp executes both paths – Unneeded values are simply ignored – Performance can suffer with many divergent branches © 2010 Michael Boyer 74 Memory Coalescing • Threads in the same half-warp access memory together • If all threads access successive memory locations: – All of the accesses are combined (coalesced) – Result: significantly improved memory performance • Otherwise: – Each thread accesses memory separately – Result: significantly reduced memory performance © 2010 Michael Boyer 75 Memory Coalescing: Examples Coalesced Accesses Non-Coalesced Access © 2010 Michael Boyer 76 Parallelization Granularity CPU Memory GPU Memory GPU CPU © 2010 Michael Boyer 77 Kernel Overhead Revisited • Overhead depends on calling pattern: – One at a time (synchronous): 9 microseconds – Back-to-back (asynchronous): 3 microseconds Implicit Synchronization Synchronous: Kernel Call Memory Transfer Kernel Call Memory Transfer Kernel Call Asynchronous: Kernel Call Kernel Call Kernel Call Kernel Call Kernel Call © 2010 Michael Boyer 78 Lesson 1 Revisited: Reduce Kernel Overhead • Increase amount of work per kernel call – Decrease total number of kernel calls – Amortize overhead of each kernel call across more computation • Launch kernels back-to-back – Kernel calls are asynchronous: avoid explicit or implicit synchronization between kernel calls – Overlap kernel execution on the GPU with driver access on the CPU © 2010 Michael Boyer 79