CUDA and GPU Training: Sessions 1 & 2 April 16 & 23, 2012 University of Georgia CUDA Teaching Center UGA CUDA Teaching Center UGA, through the efforts of Professor Thiab Taha, has been selected by NVIDIA as a 2011-2012 CUDA Teaching Center Presenters: Jennifer Rouan, Shan-ho Tsai, John Kerry Visit us at http://cuda.uga.edu 2 Workshop Outline Introduction to GPUs and CUDA CUDA Programming Concepts Georgia Advanced Computing Resource Center (GACRC) “My First CUDA Program” – hands-on programming project 3 A Little Bit of GPU Background Graphics processing unit (GPU) evolution driven by market demand for high-quality, real-time 3D graphics in computer applications, especially video games Microsoft’s DirectX 10 API (2006), introduced a geometry shading stage, which demanded an increase in operation rate, particularly floatingpoint operations Nvidia’s GeForce 8800 GPU (2006), introduced unified processors, which mapped three separate graphics stages (vertex shading, geometry processing, and pixel processing) to a single array of processors Scientists recognize the raw performance potential of this hardware and develop General Purpose GPU computing (GPGPU) 4 Graphics Pipeline Logical pipeline: vertex shading geometry processing pixel processing Physical loop: Unified Array of Processors Load balancing – make maximum use of the hardware Can dedicate all resources to optimizing one piece of hardware 5 Compute Unified Device Architecture (CUDA) Nvidia creates CUDA to facilitate the development of parallel programs on GPUs (2007) The CUDA language is ANSI C extended with very few keywords for labeling data-parallel functions (kernels) and their associated data Because Nvidia technology benefits from massive economies of scale in the gaming market, CUDA-enabled cards are very inexpensive for the performance they provide 6 Hardware Summary: CPUs and GPUs Central Processing Units (CPUs) are optimized to complete a large variety of sequential tasks very quickly Graphics Processing Units (GPUs) are optimized to do one thing: to perform floating point operations on a large amount of data at one time Compared to CPUs, GPUs dedicate very little chip area to memory in exchange for more computing cores on the chip memory CPU 7 GPU Why Program Massively Parallel Processors? Potential to mass-market applications that are currently considered supercomputing applications (or “superapplications” [Kirk 2010]), such as biology research, image processing, and 3D imaging and visualization Many of today’s medical imaging applications are still running on microprocessor clusters and special-purpose hardware, and could achieve size and cost improvement on a GPU Market demand for even better user interfaces and still more realistic gaming is not going to go away 8 Speed Tests on UGA Equipment Equipment: Barracuda: Nvidia GeForce GTX 480 GPU (480 cores) Z-Cluster: Nvidia Tesla C2075 GPU (448 cores) R-Cluster: Nvidia Tesla S1070 GPU (240 cores) CPU only (serial) : Intel Dual Processor Quad-core Xeon CPU Testing: Multiply two square matrices of single-precision floating point numbers, ranging from 16 x 16 to 8192 x 8192 Time to move data from host to device and back is included in GPU timing Conducted five rounds of tests and averaged the results 9 Small Problem Size Depending on the hardware configuration, the overhead to copy the data may overwhelm the performance improvement of the GPU 2500 Time in Milliseconds (ms) 2000 1500 Barracuda Z-Cluster 1000 R-Cluster CPU Only 500 0 16 32 48 64 80 96 112 128 144 160 Matrix Size (N 10 x N) 176 192 208 224 240 256 Medium Problem Size GPU advantage becomes apparent as the matrix size increases 4000 3500 Time in Seconds (s) 3000 2500 Barracuda 2000 Z-Cluster R-Cluster 1500 CPU Only 1000 500 0 256 512 768 1024 1280 1536 1792 2048 2304 2560 2816 3072 3328 3584 3840 4096 Matrix Size (N x N) 11 Large Problem Size The GPUs can still finish a job in a matter of seconds that takes several hours on the CPU 9 8 6 5 Barracuda 4 Z-Cluster 3 R-Cluster CPU Only 2 1 Matrix Size (N x N) 12 8192 7680 7168 6656 6144 5632 5120 4608 0 4096 Time in Hours (h) 7 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 512 768 1024 1536 1792 2048 2560 2816 3072 3584 4608 5120 5632 6144 6656 7168 7680 8192 Speed-up = GPU time / CPU time Speedup Summary 1800 1600 1400 1200 1000 Barracuda 800 Z-Cluster R-Cluster 600 400 200 0 13 CUDA Computing System A CUDA computing system consists of a host (CPU) and one or more devices (GPUs) The portions of the program that can be evaluated in parallel are executed on the device. The host handles the serial portions and the transfer of execution and data to and from the device 14 CUDA Program Source Code A CUDA program is a unified source code encompassing both host and device code. Convention: program_name.cu NVIDIA’s compiler (nvcc) separates the host and device code at compilation The host code is compiled by the host’s standard C compilers. The device code is further compiled by nvcc for execution on the GPU 15 CUDA Program Execution Execution of a CUDA program begins on the host CPU When a kernel function (or simply “kernel”) is launched, execution is transferred to the device and a massive “grid” of lightweight threads is spawned When all threads of a kernel have finished executing, the grid terminates and control of the program returns to the host until another kernel is launched 16 CUDA Program Execution TIME 17 CUDA Program Structure example int main(void) { float *a_h, *a_d; // pointers to host and device arrays const int N = 10; // number of elements in array size_t size = N * sizeof(float); // size of array in memory // allocate memory on host and device for the array // initialize array on host (a_h) // copy array a_h to allocated device memory location (a_d) // kernel invocation code – to have the device perform // the parallel operations // copy a_d from the device memory back to a_h // free allocated memory on device and host } 18 Data Movement and Memory Management In CUDA, host and device have separate memory spaces To execute a kernel, the program must allocate memory on the device and transfer data from the host to the device After kernel execution, the program needs to transfer the resultant data back to the host memory and free the device memory C functions: malloc(), free() CUDA functions: cudaMalloc(), cudaMemcpy(), and cudaFree() 19 Data Movement example int main(void) { float *a_h, *a_d; const int N = 10; size_t size = N * sizeof(float); // size of array in memory a_h = (float *)malloc(size); // allocate array on host cudaMalloc((void **) &a_d, size); // allocate array on device for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // kernel invocation code cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); free(a_h); // free allocated memory } 20 Execution Parameters and Kernel Launch A kernel is invoked by the host program with execution parameters surrounded by ‘<<<’ and ‘>>>’ as in: function_name <<< grid_dim, block_dim >>> (arg1, arg2); At kernel launch, a “grid” is spawned on the device. A grid consists of a one- or two-dimensional array of “blocks”. In turn, a block consists of a one-, two-, or three-dimensional array of “threads” Grid and block dimensions are passed to the kernel function at invocation as execution parameters 21 Execution Parameters and Kernel Launch gridDim and blockDim are CUDA built-in variables of type dim3, essentially a C struct with three unsigned integer fields, x, y, and z Since a grid is generally two-dimensional, gridDim.z is ignored but should be set to 1 for clarity dim3 grid_d = (n_blocks, 1, 1); // this is still dim3 block_d = (block_size, 1, 1); // host code function_name <<< grid_d, block_d >>> (arg1, arg2); For one-dimensional grids and blocks, scalar values can be used instead of dim3 type 22 Execution Parameters and Kernel Launch dim3 grid_dim = (2, 2, 1) dim3 block_dim = (4, 2, 2) 23 Limits on gridDim and blockDim The maximum size of a block (blockDim.x * blockDim.y * blockDim.z) is 512 threads, regardless of dimension. You cannot increase the number of allowed threads by adding another dimension Since a block is limited to 512 threads, one block per grid will usually not be sufficient The values of gridDim.x and gridDim.y can range from 1 to 65,535 24 Kernel Invocation example int main(void) { float *a_h, *a_d; const int N = 10; size_t size = N * sizeof(float); // size of array in memory a_h = (float *)malloc(size); // allocate array on host cudaMalloc((void **) &a_d, size); // allocate array on device for (i=0; i<N; i++) a_h[i] = (float)i; // initialize array cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); int block_size = 4; // set up execution parameters int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); cudaFree(a_d); free(a_h); // free allocated memory } 25 Kernel Functions A kernel function specifies the code to be executed by all threads in parallel – an instance of single-program, multiple-data (SPMD) parallel programming. A kernel function declaration is a C function extended with one of three keywords: “__device__”, “__global__”, or “__host__”. Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ float HostFunc() 26 CUDA Thread Organization Since all threads of a grid execute the same code, they rely on a two-level hierarchy of coordinates to distinguish themselves: blockIdx and threadIdx The example code fragment: ID = blockIdx.x * blockDim.x + threadIdx.x; will yield a unique ID for every thread across all blocks of a grid 27 Kernel Function and Threading example CUDA kernel function: __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx] * a[idx]; } Compare with serial C version: void square_array(float *a, int N) { int i; for (i = 0; i < N; i++) a[i] = a[i] * a[i]; } 28 CUDA Device Memory Types Global Memory and Constant Memory can be accessed by the host and device. Constant Memory serves read-only data to the device at high bandwith. Global Memory is read-write and has a longer latency Registers, Local Memory, and Shared Memory are accessable only to the device. Registers and Local Memory are available only to their own thread. Shared Memory is accessable to all threads within the same block 29 CUDA Memory Model Host 30 CUDA Variable Type Qualifiers Variable Declaration Memory Scope Lifetime Non-Array Automatic Variables Register Thread Kernel Automatic Array Variables Local Thread Kernel __device__, __shared__, int var1 Shared Block Kernel __device__, int var2 Global Grid Application __device__, __constant__, int var3 Constant Grid Application 31 Sychronization Threads in the same block can synchronize using __syncthreads() When used in an if-then-else construction, all threads must branch on the same path or they will wait on each other forever, i.e., the __syncthreads() in the if branch is distinct from the __syncthreads() in the else branch Threads in different blocks cannot perform barrier synchronization with each other. This is a major constraint, but it comes as part of a a big scalability trade-off 32 Transparent Scalability CUDA’s synchronization constraint allows blocks to be executed in any order relative to each other, providing transparent scalability The exact same code can be executed on devices with different execution resources. The execution time is inversely proportional to the available resources 33 CUDA Atomic Operations Race condition review: When two or more concurrently running threads access a shared data item and the result depends on the order of execution. Deposit Operation: load add store balance amount balance Desired Action: Balance: 100 Deposit: 10 Deposit: 200 Balance: 310 We use “Atomics” to solve this problem 34 Possible Problem: Balance: 100 load 100 add 10 load 100 add 200 store 110 store 300 Balance: 300 CUDA Atomic Operations Race conditions are exceptionally problematic in massively parallel programs when thousands of threads access data simultaneously CUDA provides many atomic functions for integers including atomicAdd(), atomicSub(), atomicExch(), atomicMin(), and atomicMax() Atomic operations can create bottlenecks which collapse your parallel program to a serial program and significantly degrade performance Use sparingly and use wisely 35 Example of using a CUDA atomic wisely to find a global maximum Naïve approach: __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); } Better idea: __global__ void global_max(int* values, int* max, int* regional_maxes, int num_regions) { // int i and val as before if(atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); } } 36 Example of using a CUDA atomic wisely to find a global maximum Naïve approach: __global__ void global_max(int* values, int* gl_max) { int i = threadIdx.x + blockDim.x * blockIdx.x; int val = values[i]; atomicMax(gl_max,val); } Better idea: __global__ void global_max(int* values, int* max, int* regional_maxes, int num_regions) { // int i and val as before if(atomicMax(&reg_max[region],val) < val) { atomicMax(max,val); } } 37 Georgia Advanced Computing Resource Center (GACRC) GACRC resources Requesting an account Setting up the user environment (i.e.: path variables, etc.) Compiling a CUDA program using nvcc Creating a submission shell script Submitting a job to the queue 38 My First CUDA Program squares.c and squares.cu are identical serial C programs Edit squares.cu with CUDA keywords to port to parallel program (leave squares.c clean to refer back to if necessary) Compile with makefile Create submission shell script Submit to queue 39 More CUDA Training Resources University of Georgia CUDA Teaching Center: http://cuda.uga.edu Nvidia training and education site: http://developer.nvidia.com/cuda-education-training Stanford University course on iTunes U: http://itunes.apple.com/us/itunesu/programming-massively-parallel/id384233322 University of Illinois: http://courses.engr.illinois.edu/ece498/al/Syllabus.html University of California, Davis: https://smartsite.ucdavis.edu/xsl-portal/site/1707812c4009-4d91-a80e-271bde5c8fac/page/de40f2cc-40d9-4b0f-a2d3-e8518bd0266a University of Wisconsin: http://sbel.wisc.edu/Courses/ME964/2011/me964Spring2011.pdf University of North Carolina at Charlotte: http://coitweb.uncc.edu/~abw/ITCS6010S11/index.html 40 References Kirk, D., & Hwu, W. (2010). Programming Massively Parallel Processors: A Hands-on Approach, 1 – 75 Tarjan, D. (2010). Introduction to CUDA, Stanford University on iTunes U Atallah, M. J. (Ed.), (1998). Algorithms and theory of computation handbook. Boca Raton, FL: CRC Press von Neumann, J. (1945). First draft of a report on the EDVAC. Contract No. W-670-ORD4926, U.S. Army Ordnance Department and University of Pennsylvania Sutter, H., & Larus, J. (2005). Software and the concurrency revolution. ACM Queue, 3(7), 54 – 62 Stratton, J. A., Stone, S. S., & Hwu, W. W. (2008). MCUDA: And efficient implementation of CUDA kernels for multi-core CPUs. Canada: Edmonton Vandenbout, Dave (2008). My First Cuda Program, http://llpanorama.wordpress.com/2008/05/21/my-first-cuda-program/ 41