CUDA Programming Floating Point Operations for the CPU and the GPU Memory Bandwidth for the CPU and the GPU GPU Devotes More Transistors to Data Processing Hillis’ Thesis ’85 (back to the future !) Piece of silicon Sequential computer Parallel computer • proposed “The Connection Machine” with massive number of processors each with small memory operating in SIMD mode. • CM-1, CM-2 machines from Thinking Machines Corporation (TMC)were examples of this architecture with 32K-128K processors. 5 CUDA Supports Various Languages or Application Programming Interfaces Automatic Scalability A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with fewer cores. • NVIDIA GPUs have a number of multiprocessors, each of which executes in parallel with the others. • On Tesla, each multiprocessor has a group of 8 stream processors; a Fermi multiprocessor has two groups of 16 stream processors. • A core refer to a stream processor. The high end Tesla accelerators have 30 multiprocessors, for a total of 240 cores; • A high end Fermi has 16 multiprocessors, for 512 cores. • Each core can execute a sequential thread, but the cores execute in what NVIDIA calls SIMT (Single Instruction, Multiple Thread) fashion; all cores in the same group execute the same instruction at the same time, much like classical SIMD processors. •SIMT handles conditionals somewhat differently than SIMD, though the effect is much the same, where some cores are disabled for conditional operations. Compute Capability • Compute capability : of a device is defined by a major revision number and a minor revision number. • Devices with the same major revision number are of the same core architecture. • The major revision number of devices based on the Fermi architecture is 2. • Prior devices are all of compute capability 1.x (Their major revision number is 1). • The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features. CUDA-Enabled Devices with Compute Capability, Number of Multiprocessors, and Number of CUDA Cores CUDA-Enabled Devices with Compute Capability, Number of Multiprocessors, and Number of CUDA Cores Features and Technical Specifications Grid of Thread Blocks Memory Hierarchy Programming Model • Heterogeneous Programming • Serial code executes on the host while parallel code executes on the device. CUDA Thread Organization •A thread block can have up to 512 threads Kernels • CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. •A kernel is defined using the __global__ declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<…>>> execution configuration syntax. •Each thread that executes the kernel is given a unique thread ID that is accessible within the kernel through the built-in threadIdx variable. Vector Addition Example // Kernel definition __global__ void VecAdd(float* A, float* B,float* C) { int i = threadIdx.x; C[i] = A[i] + B[i]; } int main() { ... ... // Kernel invocation with N threads VecAdd<<<1, N>>>(A, B, C); } Thread Hierarchy • threadIdx : is a 3-component vector, so that threads can be identified using a onedimensional, two-dimensional, or three-dimensional thread index, forming a onedimensional, two-dimensional, or three-dimensional thread block. • for a one-dimensional block of size Dx, the thread ID of a thread of index (x) is x • for a two-dimensional block of size (Dx, Dy), the thread ID of a thread of index (x, y) is (x + y * Dx) • for a three-dimensional block of size (Dx, Dy, Dz), the thread ID of a thread of index (x, y, z) is (x + y* Dx + z *Dx *Dy) Matrix Addition Example Using 1 Block // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = threadIdx.x; int j = threadIdx.y; C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation with one block of N * N * 1 threads int numBlocks = 1; dim3 threadsPerBlock(N, N); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); } Matrix Addition Example Using Multiple Blocks // Kernel definition __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) { int i = blockIdx.x * blockDim.x + threadIdx.x; int j = blockIdx.y * blockDim.y + threadIdx.y; if (i < N && j < N) C[i][j] = A[i][j] + B[i][j]; } int main() { ... // Kernel invocation dim3 threadsPerBlock(16, 16); dim3 numBlocks(N/threadsPerBlock.x, N/threadsPerBlock.y); MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C); } Overview of the CUDA Device Memory Model CUDA API Functions for Device Global Memory Management. CUDA API Functions for Data Transfer Between Memories. Matrix-Matrix Multiplication Example Matrix-Matrix Multiplication Example void MatrixMulOnDevice(float* M, float* N, float* P, int Width) { int size = Width * Width * sizeof(float); // 1. Load M and N to device memory cudaMalloc(Md, size); cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice) ; cudaMalloc(Nd, size); cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice); // Allocate P on the device cudaMalloc(Pd, size); // 2. Kernel invocation code – to be shown later … // 3. Read P from the device cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost); // Free device matrices cudaFree(Md); cudaFree(Nd); cudaFree (Pd); } Kernel Function // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue stores the Pd element that is computed by the thread float Pvalue = 0; for (int k = 0; k < Width; ++k) { float Mdelement = Md[ty * Width + k]; float Ndelement = Nd[k * Width + tx]; Pvalue += Mdelement * Ndelement; } // Write the matrix to device memory each thread writes one element Pd[ty * Width + tx] = Pvalue; } Host Code that Launches a Kernel // Setup the execution configuration dim3 dimBlock(WIDTH, WIDTH); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, WIDTH); Block / Grid Definition Examples dim3 dimBlock(4, 2, 2); dim3 dimGrid(2, 2, 1); KernelFunction<<<dimGrid, dimBlock>>>(…); dim3 dimBlock(16, 16, 1); dim3 dimGrid(100, 1, 1); KernelFunction<<<dimGrid, dimBlock>>>(…); • Note: the dimension variables can be given as contents of variables ; they do not need to be compile-time constants Matrix-Matrix Multiplication on Multiple Blocks A simple example of using multiple blocks to calculate Pd global__ void MatrixMulKernel(float **Md, float **Nd, float **Pd, int Width) { // Calculate the row index of the Pd element and M int Row = blockId.y * TILE_WIDTH + threadId.y; // Calculate the column index of Pd and N Int Col = blockId.x * TILE_WIDTH + threadId.x; Pvalue = 0; // each thread computes one element of the block sub-matrix for (int k = 0; k < Width; ++k) Pvalue += Md[Row][k] * Nd[k][Col]; Pd[Row][Col] = Pvalue; } CUDA Memories CUDA Variable Type Qualifiers Small Matrix Multiplication Example • Assume 2x2 blocks • Observation: thread0,0 and thread1,0 both access row 0 of Md. • Both threads access these Md elements from the global memory. • If we manage to have thread0,0 and thread1,0 to collaborate so that these Md elements are only loaded from global memory once, we can reduce the total number of accesses to the global memory by half. • In general, we can see that every Md and Nd element are accessed exactly twice during the execution of block0,0. Therefore, if we can have all the four threads to collaborate in their accesses to global memory, we can reduce the traffic to the global memory by half. Global memory accesses performed by threads in block0,0 • The potential reduction of global memory traffic in matrix multiplication is proportional to the dimension of the blocks used. • With NxN blocks, the potential reduction of global memory traffic would be N. Tiling Md and Nd to utilize shared memory • Let the threads to collaboratively load Md and Nd elements into the shared memory before they individually use these elements in their dot product calculation. • Note that the size of the shared memory is quite small and one must be careful not to exceed the capacity of the shared memory when loading these Md and Nd elements into the shared memory. • This can be accomplished by dividing the Md and Nd matrices into smaller tiles. • The size of these tiles is chosen so that they can fit into the shared memory. • For simplicity, tile dimensions can be chosen to be equal to the block dimensions Calculation of the matrix indices in tiled multiplication Execution phases of a tiled matrix multiplication algorithm (block size=2) Tiled Matrix Multiplication Kernel using shared memories __global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, int Width) { __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; __shared__ float Nds[TILE_WIDTH][TILE_WIDTH]; int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Identify the row and column of the Pd element to work on int Row = by * TILE_WIDTH + ty; int Col = bx * TILE_WIDTH + tx; float Pvalue = 0; // Loop over the Md and Nd tiles required to compute the Pd element for (int m = 0; m < Width/TILE_WIDTH; ++m) { // Coolaborative loading of Md and Nd tiles into shared memory Mds[tx][ty] = Md[Row*Width + m*TILE_WIDTH + tx]; Nds[tx][ty] = Nd[(m*TILE_WIDTH + ty) * Width + Col)]; __syncthreads(); for (int k = 0; k < TILE_WIDTH; ++k) Pvalue += Mds[tx][k] * Nds[k][ty]; __syncthreads(); } Pd[Row*Width+Col] = Pvalue; } Thread Assignment in GeForce-8 Series GPU Devices • In the GeForce-8 series hardware, the execution resources are organized into Streaming Multiprocessors. For example, • GeForce 8800GTX implementation has 16 Streaming Multiprocessors (SMs). Up to 8 blocks can be assigned to each SM as long as there are enough resources to satisfy the needs of all the blocks. With 16 SMs Multiprocessors in a GeForce 8800 GTX processor, up to 128 blocks can be simultaneously assigned to all Streaming Multiprocessors. Thread Assignment in GeForce-8 Series GPU Devices • Most grids contain much more than 128 blocks. • The run-time system maintains a list of blocks that need to execute and assigns new blocks to Streaming Multiprocessors as they complete the execution of blocks previously assigned to them. • In the GeForce 8800GTX design, up to 768 threads can be assigned to each SM. • This could be in the form of : 3 blocks of 256 threads each, 6 blocks of 128 threads each, etc. • With 16 SMs in GeForce 8800 GTX, there can be up to 12,288 threads simultaneously residing in SMs for execution. Thread Scheduling • Implementation specific. • In the GeForce 8800GTX, once a block is assigned to a SM, it is further divided into 32-thread units called Warps. • Warps are the unit of thread scheduling in SMs. •Each warp consists of 32 threads of consecutive threadId values: thread 0 through 31 form the first warp, 32 through 63 the second warp, and so on. • Example: If each block has 256 threads, we should be able to determine the number of warps that reside in each SM. Each block has 256/32 or 8 warps. With three blocks in each SM, we have 8*3 = 24 warps in each SM. This is in fact the maximal number of warps that can reside in each SM in GeForce 8800GTX, since there can be no more than 768 threads in each SM and this amounts to 768/32 = 24 warps. • SMs are designed such that only one of these warps will be actually executed by the hardware at any point in time. Warp Based Thread Scheduling Warp Based Thread Scheduling • Why do we need so many warps (if only one of them can execute at any point in time) ? • This is how these processors efficiently execute long latency operations such as access to the global memory. • When an instruction executed by threads in a warp needs to wait for the result of a previously initiated long-latency operation, the warp is placed into a waiting area. • One of the other resident warps who are no longer waiting for results is selected for execution. • If more than one warp is ready for execution, a priority mechanism is used to select one for execution. Other Models Compute Capability Threads / Warp Warps / Multiprocessor Threads / Multiprocessor Thread Blocks / Multiprocessor Shared Memory / Multiprocessor (bytes) Register File Size 1.0 1.1 1.2 1.3 32 24 768 8 16384 8192 32 24 768 8 16384 8192 32 32 1024 8 16384 16384 32 32 1024 8 16384 16384 Divergence in Execution • At any point in time, the hardware selects and executes one warp at a time. •An instruction is run for all threads in the same warp, before moving to the next instruction. •This style of execution is motivated by hardware cost constraints: it allows the cost of fetching and processing an instruction to be amortized among a large number of threads. • It works well when all threads within a warp follow the same control flow path when working their data. • For an if-then else construct, the execution works well when either all threads execute the then part or all execute the else part. • When threads within a warp take different control flow paths, that is when some threads execute the then part and others execute the else part, the simple execution style no longer works well. In such situation, the execution of the warp will require multiple passes through these divergent paths. • One pass will be needed for those threads that follow the then part and another pass for those that follow the else part. These passes are sequential to each other, thus will add to the execution time. •When threads in the same warp follow different paths of control flow, we say that these threads diverge in their execution. All Reduce (total on all threads) __global__ void sumall(int * A,unsigned int N) { for (N = N >> 1 ; N ; N = N >> 1 ) { int temp = A[threadIdx.x] + A[threadIdx.x^N]; __syncthreads(); A[threadIdx.x] = temp; __syncthreads(); } } Memory Coalescing • If an application can make use of data from multiple, consecutive locations before moving on to other locations, the DRAMs can supply the data at much higher rate than if a truly random sequence of locations were accessed. • In order to achieve anywhere close to the advertised 84.6GB/sec global memory bandwidth for G80, a kernel must arrange its data accesses so that each request to the DRAMs is for a large number of consecutive DRAM locations. • NVIDIA hardware detects whether the threads access consecutive global memory locations. • In this case, the hardware combines, or coalesces, all these accesses into a consolidated access to the DRAMs that requests all consecutive locations involved. Matrix Layout Memory Coalescing Global memory accessed will be coalesced into a single access (1 transaction) if: – The size of the memory element accessed by each thread is either 4,8, or 16 bytes – The elements form a contiguous block of memory – The Nth element is accessed by the Nth thread in the half-warp – The address of the first element is aligned to 16 times the element’s size. A half-warp of 16 threads can coordinate global memory accesses into a single transaction Examples Examples