Training Program on GPU Programming with CUDA 31st July, 7th Aug, 14th Aug 2011 CUDA Teaching Center @ UoM Training Program on GPU Programming with CUDA Day 1, Session 2 CUDA Programming Model CUDA Threads Sanath Jayasena CUDA Teaching Center @ UoM Outline for Day 1 Session 2 • • • • CUDA Programming Model, CUDA Threads Data Parallelism CUDA Program Structure Memory Model & Data Transfer (Brief) Kernel Functions & Threading (Discussion with Example: Matrix Multiplication) July-Aug 2011 CUDA Training Program 3 Data Parallelism • Data Parallelism – A problem/program property – Many arithmetic operations can be safely performed on the data structures simultaneously – Example: matrix multiplication (next slide) • CUDA devices can exploit data parallelism to accelerate execution of applications July-Aug 2011 CUDA Training Program 4 Example: Matrix Multiplication P=M·N N width width M P width July-Aug 2011 •Each element in P is computed as dot product between a row of M and a column of N •All elements in P can be computed independently and simultaneously width CUDA Training Program 5 CUDA Program Structure • A CUDA program consists of one or more phases executed on either the host (CPU) or a device (GPU), supplied as a single source code • Little or no data parallelism host code – ANSI C, compiled with standard compiler • Significant data parallelism device code – Extended ANSI C to specify kernels, data structs • NVIDIA C Complier separates the two and … July-Aug 2011 CUDA Training Program 6 Execution of a CUDA Program July-Aug 2011 CUDA Training Program 7 Execution of a CUDA Program • Execution starts with host (CPU) • When a kernel is invoked, execution moves to the device (GPU) – A large number of threads generated – Grid : collection of all threads generated by kernel – (Previous slide shows two grids of threads) • Once all threads in a grid complete execution, the grid terminates and execution continues on the host July-Aug 2011 CUDA Training Program 8 Example: Matrix Multiplication int main (void) { 1. // Allocate and initialize matrices M, N, P // I/O to read the input matrices M and N …. 2. // M * N on the device MatrixMulOnDevice (M, N, P, width); 3. // I/O to write the output matrix P // Free matrices M, N, P … return 0; } July-Aug 2011 CUDA Training Program A simple CUDA host code skeleton for matrix multiplication 9 CUDA Device Memory Model • Host, devices have separate memory spaces – E.g., hardware cards with their own DRAM • To execute a kernel on a device – Need to allocate memory on device – Transfer data: host memory device memory • After device execution – Transfer results: device memory host memory – Free device memory no longer needed July-Aug 2011 CUDA Training Program 10 CUDA Device Memory Model July-Aug 2011 CUDA Training Program 11 CUDA API : Memory Mgt. July-Aug 2011 CUDA Training Program 12 CUDA API : Memory Mgt. • Example float *Md; int size = Width * Width * sizeof(float); cudaMalloc((void**)&Md, size); … cudaFree(Md); July-Aug 2011 CUDA Training Program 13 CUDA API : Data Transfer July-Aug 2011 CUDA Training Program 14 Example: Matrix Multiplication July-Aug 2011 CUDA Training Program 15 Kernel Functions & Threading • A kernel function specifies the code to be executed by all threads of a parallel phase – All threads of a parallel phase execute the same code single-program multiple-data (SPMD), a popular programming style for parallel computing • Need a mechanism to – Allow threads to distinguish themselves – Direct themselves to specific parts of data they are supposed to work on July-Aug 2011 CUDA Training Program 16 Kernel Functions & Threading • Keywords “threadIdx.x” and “threadIdx.y” – Thread indices of a thread – Allow a thread to identify itself at runtime (by accessing hardware registers associated with it) • Can refer a thread as Thread threadIdx.x,threadIdx.y • Thread indices reflect a multi-dimensional organization for threads July-Aug 2011 CUDA Training Program 17 Example: Matrix Multiplication Kernel See next slide for more details on accessing relevant data July-Aug 2011 CUDA Training Program 18 Thread Indices & Accessing Data Relevant to a Thread tx How matrix Pd would be laid out in memory (as it is a 1-D array) •Each thread uses tx, ty to identify the relevant row of Md, column of Nd and the element of Pd in the for loop width ty tx Md July-Aug 2011 Pd Nd tx width ty row 1 ty * width x y row 0 Pd width •E.g., Thread2,3 will perform dot product between row 2 of Md and column 3 of Nd and write the result into element (2,3) of Pd CUDA Training Program 19 Threading & Grids • When a kernel is invoked/launched, it is executed as a grid of parallel threads • A CUDA thread grid can have millions of lightweight GPU threads per kernel invocation – To fully utilize hardware enough threads required large data parallelism required • Threads in a grid has a two-level hierarchy – A grid consists of 1 or more thread blocks – All blocks in a grid have same # of threads July-Aug 2011 CUDA Training Program 20 CUDA Thread Organization July-Aug 2011 CUDA Training Program 21 Threading with Grids & Blocks • Each thread block has a unique 2-D coordinate given by CUDA keywords “blockIdx.x” and “blockIdx.y” – All blocks must have the same structure, thread # • Each block has a 3-D array of threads up to a total of 1024 threads max – Coordinates of threads in a block are defined by indices: threadIdx.x, threadIdx.y, threadIdx.z – (Not all apps will use all 3 dimensions) July-Aug 2011 CUDA Training Program 22 Our Example: Matrix Multiplication • The kernel is shown 5 slides before (slide 18) – This can only use one thread block – The block is organized as a 2D-array • The code can compute a product matrix Pd of only up to 1024 elements – As a block can have a max of 1024 threads – Each thread computes one element in Pd – Is this sufficient / acceptable? July-Aug 2011 CUDA Training Program 23 Our Example: Matrix Multiplication • When host code invokes the kernel, the grid and block dimensions are set by passing them as parameters • Example // Setup the execution configuration dim3 dimBlock(16, 16, 1); //Width=16, as example dim3 dimGrid(1, 1, 1); //last 1 ignored // Launch the device computation threads! MatrixMulKernel<<<dimGrid,dimBlock>>>(Md,Nd,Pd,16); July-Aug 2011 CUDA Training Program 24 Here is an Exercise… • Implement Matrix Multiplication – Execute it with different matrix dimensions using (a) CPU only, (b) GPUs and (c) GPUs with different grid/block organizations • Fill a table like the following Dimensions (M, N) CPU time (s) GPU time (s) Speedup [400,800] , [400, 400] [800,1600] , [800, 800] …. …. [2400,4800] , [2400, 4800] July-Aug 2011 CUDA Training Program 25 Conclusion • We discussed CUDA Programming Model and CUDA Thread Basics – Data Parallelism – CUDA Program Structure – Memory Model & Data Transfer (briefly) – Kernel Functions & Threading – (Discussion with Example: Matrix Multiplication) July-Aug 2011 CUDA Training Program 26 References for this Session • Chapter 2 of: D. Kirk and W. Hwu, Programming Massively Parallel Processors, Morgan Kaufmann, 2010 • Chapters 4-5 of: E. Kandrot and J. Sanders, CUDA by Example, Addison-Wesley, 2010 • Chapter 2 of: NVIDIA CUDA C Programming Guide, V. 3.2/4.0, NVIDIA Corp. , 2010-2011 July-Aug 2011 CUDA Training Program 27