CUDA All you wanted to know about it, but was afraid to ask! Paulo Ivson Netto Santos Waldemar Celes Filho Nov 2007 CUDA is aimed at GPGPU What is GPGPU ? General Purpose computation using GPU – Applications other than 3D graphics – GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes – Large data arrays, streaming throughput – Fine-grain SIMD parallelism – Floating point (FP) computation Applications – see //GPGPU.org – Game effects (FX) physics, image processing – Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting, etc, etc © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Importance of Data Parallelism GPUs are designed for graphics – Highly parallel tasks Data-parallel processing – GPUs architecture is ALU-heavy Multiple pipelines, multiple ALUs per pipe – Large memory latency – HUGE memory bandwidth – Hide memory latency (with more computation) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CPU vs GPU Design Strategies and Tactics CPU Strategy: Make a few threads run fast – Tactics – minimize latency Big Cache – build for hit Instruction/Data Prefetch Speculative Execution limited by “perimeter” – communication bandwidth GPU Strategy: Make many threads run fast – Tactics – maximize throughput Small Cache – build for miss Parallelism (1000s of threads) Pipelining limited by “area” – compute capability © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign What a GPU looks like? from graphics point of view GeForce 7800 GTX Parallelism 8 Vertex Engines Triangle Setup/Raster Z-Cull Shader Instruction Dispatch Fragment Crossbar Memory Partition © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Partition 24 Pixel Shaders 16 Raster Operation Pipelines Memory Partition Memory Partition G80 replaces the pipeline model The future of GPUs is programmable processing So – build the architecture around the processor Host Data Assembler Setup / Rstr / ZCull SP SP SP TF SP SP TF L1 TF L1 SP © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SP SP TF L1 L1 SP SP TF L1 L2 FB Pixel Thread Issue SP TF L2 FB SP SP TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB Work Distribution for Graphics Vertices are serially distributed to all the SM’s SPA processes vertices in parallel Vertices are serially gathered from the SM’s – And sent to Primitive Setup Pixels are serially distributed in parallel tiles SPA processes pixels in parallel Pixels are sent to ROP/FB © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign G80 vs. G7x GeForce 7 GeForce 8800 SM3 SM4 8 128* 6ppc 32ppc 24 128* Up to 32ppc Up to 192ppc Memory Bandwidth 51 GB/sec 96 GB/sec Compressed Bandwidth 204 GB/sec 768 GB/sec Shader Model Vertex Shaders HDR Texture Filtering Dedicated Shader Pipes ROP Processing © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Common GPGPU Constraints Dealing with graphics API – Working with the corner cases of the graphics API Addressing modes – Limited texture size/dimension Shader capabilities – Limited outputs Instruction sets – Lack of Integer & bit ops Communication limited – Between pixels – Scatter a[i] = p © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Just what is CUDA anyway? “Compute Unified Device Architecture” General purpose programming model – User kicks off batches of threads on the GPU – GPU is viewed as a dedicated super-threaded co-processor Targeted software stack – Compute oriented drivers, language, and tools Driver for loading computation programs into GPU – – – – – – Standalone driver - optimized for computation Interface designed for compute - graphics free API Data sharing with OpenGL buffer objects Guaranteed maximum download & readback speeds Explicit GPU memory management Debugging support on the CPU! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA/G80 Advantage Over Dual Core 20x CUDA Performance 197x 47x 10x Rigid Body Physics Solver Matrix Numerics Wave Equation BLAS1: 60+ GB/s BLAS3: 100+ GFLOPS FDTD: 1.2 Gcells/s FFT: 52 GFLOPS Biological Sequence Match SSEARCH: 5.2 Gcells/s (GFLOPS as defined by benchFFT) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Finance Black Scholes: 4.7 GOptions/s GPU: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: – Is a coprocessor to the CPU or host – Has its own DRAM (device memory) – Runs many threads in parallel Identify data-parallel portions of an application Execute them on the device as kernels – Which run in parallel on many threads Differences between GPU and CPU threads – GPU threads are extremely lightweight Very little creation overhead – GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Thread Batching: Grids and Blocks Grid of thread blocks – Corresponds to one kernel – All threads access global memory Device Grid 1 Kernel 1 Thread block – A batch of threads that can cooperate with each other – Share data through a low latency shared memory – Barrier synchronization for hazard-free shared memory accesses Host Threads from different blocks cannot cooperate © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NVDIA Block and Thread IDs Threads and blocks have IDs – Each thread can decide what data to work on – Block ID: 1D or 2D – Thread ID: 1D, 2D, or 3D Multidimensional data – Image processing – Solving PDEs on volumes – … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NVDIA CUDA Device Memory Overview Each thread can: (Device) Grid – – – – – R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory – Read only per-grid texture memory The host can R/W global, constant, and texture memories Host Block (0, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory Texture Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Block (1, 0) Local Memory Local Memory Global, Constant, and Texture Memories (Device) Grid Global memory Block (0, 0) – Communicating data between host and device – Visible to all threads Shared Memory Registers Texture and Constant memories – Read-only data initialized by host – Visible to all threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Block (1, 0) Host Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory Courtesy: NVDIA Local Memory A Common Programming Pattern Local and global memory reside in DRAM – Much slower access than shared memory Profitable way of performing computation – Block data and computation to take advantage of fast shared memory – Partition data into data subsets that fit into shared memory – Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign A Common Programming Pattern Texture and Constant memory also reside in device memory (DRAM) – Much slower access than shared memory – But… cached! – Highly efficient access for read-only data Carefully divide data according to access patterns – – – – – R/O no structure constant memory R/O array structured texture memory R/W shared within Block shared memory R/W registers spill to local memory R/W inputs/results global memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign That’s it! Or not... so many things still missing! How to code? 1. • API, SDK, etc How does it actually work in the GPU? 2. • HW details that make all the difference How to get the best of it? 3. • Tips and tricks to get those GFLOPs! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA API Extended C Declspecs – global, device, shared, local, constant __device__ float filter[N]; __global__ void convolve (float *image) __shared__ float region[M]; ... Keywords region[threadIdx] = image[i]; __syncthreads() ... – threadIdx, blockIdx Intrinsics – __syncthreads Runtime API – Memory, symbol, execution management Function launch © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); { Extended C Integrated source (foo.cu) cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly CPU Host Code foo.s foo.cpp OCG gcc / cl G80 SASS foo.sass © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Device Memory Allocation cudaMalloc() (Device) Grid – Allocates the device Global Memory – Requires two parameters Block (0, 0) Shared Memory Address of a pointer to the allocated object Size of allocated object cudaFree() – Frees object from device Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Host Block (1, 0) Shared Memory Register s Register s Register s Register s Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memor y Local Memor y Local Memor y Local Memor y Global Memory Constant Memory Texture Memory CUDA Device Memory Allocation Code example: – Allocate 256 * 256 single precision float array – Use “d” suffix to indicate device data structure float* elementsd; int size = 256 * 256 * sizeof(float); cudaMalloc( (void**)&dataOnDevice, size ); cudaFree( dataOnDevice ); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Host-Device Data Transfer (Device) Grid cudaMemcpy() – – Block (0, 0) Memory data transfer Requires four parameters 1. 2. 3. 4. Shared Memory Pointer to destination Pointer to source Number of bytes copied Type of transfer – – – – Host to Host Host to Device Device to Host Device to Device Host Shared Memory Register s Register s Register s Register s Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memor y Local Memor y Local Memor y Local Memor y Global Memory Constant Memory Texture Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Block (1, 0) CUDA Host-Device Data Transfer (cont.) Code example: – – – – Transfer a 64 * 64 single precision float array elements is in host memory elementsd is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy( elementsd, elements, size, cudaMemcpyHostToDevice ); cudaMemcpy( elements, elementsd, size, cudaMemcpyDeviceToHost ); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Function Declarations Executed on the: Only callable from the: __device__ float DeviceFunc() device device __global__ void KernelFunc() device host host host __host__ float HostFunc() __global__ defines a kernel function – Must return void __device__ and __host__ can be used together __host__ is optional © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA Function Declarations __device__ functions cannot have their address taken For functions executed on the device: – No recursion – No static variable declarations inside the function – No variable number of arguments © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Calling a Kernel – Thread Creation Kernel functions are called with an execution configuration __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc<<< DimGrid, DimBlock, SharedMemBytes >>>(...); Calls to a kernel function are asynchronous – But only one kernel active at a time per GPU – Implicit synchronizations Second kernel launch Memory read backs – Explicit synchronizations cudaThreadSynchronize() © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Some Additional API Features math functions, thread and block ids, etc Application Programming Interface The API is an extension to the C programming language It consists of: – Language extensions To target portions of the code for execution on the device – A runtime library split into: A common component providing built-in vector types and a subset of the C runtime library in both host and device codes A host component to control and access one or more devices from the host A device component providing device-specific functions © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Language Extensions: Built-in Variables dim3 gridDim; – Dimensions of the grid in blocks – Grids are at most 2D! gridDim.z is unused dim3 blockDim; – Dimensions of the block in threads dim3 blockIdx; – Block index within the grid dim3 threadIdx; – Thread index within the block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Common Runtime Component Provides: – Built-in vector types – A subset of the C runtime library supported in both host and device codes © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Built-in Vector Types [u]char[1..4], [u]short[1..4], [u]int[1..4], [u]long[1..4], float[1..4] – Structures accessed with x, y, z, w fields: uint4 param; int y = param.y; dim3 – Based on uint3 – Used to specify dimensions © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Mathematical Functions pow, sqrt, cbrt, hypot exp, exp2, expm1 log, log2, log10, log1p sin, cos, tan, asin, acos, atan, atan2 sinh, cosh, tanh, asinh, acosh, atanh ceil, floor, trunc, round Etc. – When executed on the host, a given function uses the C runtime implementation if available – These functions are only supported for scalar types, not vector types © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Host Runtime Component Provides functions to deal with: – – – Device management (including multi-device systems) Memory management Error handling Initializes the first time a runtime function is called A host thread can invoke a kernel on only one device – Multiple host threads required to run on multiple devices © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Management Device memory allocation – cudaMalloc(), cudaFree() Memory copy from host to device, device to host, device to device – cudaMemcpy(), cudaMemcpy2D(), cudaMemcpyToSymbol(), cudaMemcpyFromSymbol() Memory addressing – cudaGetSymbolAddress() – Used to transfer data to constant memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Device Mathematical Functions Some mathematical functions (e.g. sin(x)) have a less accurate, but faster device-only version (e.g. __sin(x)) – – – – __pow __log, __log2, __log10 __exp __sin, __cos, __tan © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Device Synchronization Function void __syncthreads(); Synchronizes all threads in a block Once all threads have reached this point, execution resumes normally Avoid RAW/WAR/WAW hazards when accessing shared or global memory Allowed in conditional constructs only if the conditional is uniform across the entire thread block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Graphics Interoperability one last API bit... Overview Interface to exchange data between OpenGL / D3D and CUDA without reading it back to the host Buffer objects can be mapped into the CUDA address space and then used as global memory – Textures can be accessed by casting them to buffer objects Data can be accessed as any other global data in the device code Useful for – – – – Frame post-processing Visualization Physical Simulation … © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign OpenGL Interoperability Mapping GL buffer object to CUDA cudaError_t cudaGLMapBufferObject( unsigned int bufobj, void **Ptr, cudaContext_t ctxt = def) Unmapping GL buffer object from CUDA cudaError_t cudaGLUnmapBufferObject( unsigned int bufobj, cudaContext_t ctxt = def) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign OpenGL Interoperability Example (from simpleGL in the SDK) float *dptr; cudaGLMapBufferObject( vbo, (void**) &dptr); dim3 grid( 1, 1, 1); dim3 block( num_threads, 1, 1); kernel<<< grid, block>>>(dptr); cudaGLUnmapBufferObject( vbo ); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Practical Code Example AKA: breaking the inertia with a simple, illustrative (= useless) example Matrix Multiplication Illustrates the basic features of – Global Memory usage – Memory transfer API – Thread allocation – Local, register usage – Thread ID usage – Only example, not efficient! i.e. Leave shared memory usage for later © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign A Matrix Data Type NOT part of CUDA – 2D matrix – single precision float elements – width * height elements – data elements allocated and attached to elements © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign typedef struct { int width; int height; float* elements; } Matrix; Square Matrix Multiplication P = M * N of size WIDTH x WIDTH Without blocking One thread handles one element of P P WIDTH M WIDTH N WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign WIDTH Step 1: Matrix Data Transfers // Allocate the host memory M where we will copy to device Matrix AllocateMatrix(const int height, const int width, float initVal) { Matrix M; M.width = MATRIX_SIZE; M.height = MATRIX_SIZE; int size = MATRIX_SIZE * MATRIX_SIZE * sizeof(float); M.elements = (float*) malloc(size); for (int i = 0; i < height; i++) { for (int j = 0; j < width; j++) { M.elements[i*width + j] = initVal; } } return M; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 2: Validation Method // Matrix multiplication on the (CPU) host in double precision // For simplicity, we will assume that all dimensions are equal void MatrixMulOnHost(const Matrix M, const Matrix N, Matrix P) { for (int i = 0; i < M.height; ++i) for (int j = 0; j < N.width; ++j) { double sum = 0; for (int k = 0; k < M.width; ++k) { double a = M.elements[i * M.width + k]; double b = N.elements[k * N.width + j]; sum += a * b; } P.elements[i * N.width + j] = sum; } } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Multiply Using One Thread Block N Grid 1 One Block of threads compute matrix P Block 1 2 4 – Each thread computes one element of P Each thread – Loads a row of matrix M – Loads a column of matrix N – Perform one multiply and addition for each pair of M and N elements – Compute to off-chip memory access ratio close to 1:1 (not very high) 2 Thread (2, 2) Size of matrix limited by the number of threads allowed in a thread block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 6 3 2 5 4 48 MATRIX_SIZE M P Step 3: Host-side Main Code int main(void) { // Allocate and initialize the matrices Matrix M = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 1); Matrix N = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 1); Matrix P = AllocateMatrix(MATRIX_SIZE, MATRIX_SIZE, 0); // M * N on the device MatrixMulOnDevice(M, N, P); // Free matrices FreeMatrix(M); FreeMatrix(N); FreeMatrix(P); return 0; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 4: Host-side Code // Matrix multiplication on the device void MatrixMulOnDevice(const Matrix M, const Matrix N, Matrix P) { // Load M and N to the device Matrix Md = AllocateDeviceMatrix(M); CopyToDeviceMatrix(Md, M); Matrix Nd = AllocateDeviceMatrix(N); CopyToDeviceMatrix(Nd, N); // Allocate P on the device Matrix Pd = AllocateDeviceMatrix(P); CopyToDeviceMatrix(Pd, P); // Clear memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 4: Host-side Code (cont.) // Setup the execution configuration dim3 dimBlock(MATRIX_SIZE, MATRIX_SIZE); dim3 dimGrid(1, 1); // Launch the device computation threads! MatrixMulKernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd); // Read P from the device CopyFromDeviceMatrix(P, Pd); // Free device matrices FreeDeviceMatrix(Md); FreeDeviceMatrix(Nd); FreeDeviceMatrix(Pd); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 5: Device-side Kernel // Matrix multiplication kernel – thread specification __global__ void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; // Pvalue is used to store the element of the matrix that is computed by the thread float Pvalue = 0; for (int k = 0; k < MATRIX_SIZE; ++k) { float Melement = M.elements[ty * M.width + k]; float Nelement = N.elements[k * N.width + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; each thread writes one element P.elements[ty * P.width + tx] = Pvalue; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 3: Some Loose Ends // Allocate a device matrix of same size as M. Matrix AllocateDeviceMatrix(const Matrix M) { Matrix Mdevice = M; int size = M.width * M.height * sizeof(float); cudaMalloc((void**)&Mdevice.elements, size); return Mdevice; } // Free a device matrix. void FreeDeviceMatrix(Matrix M) { cudaFree(M.elements); } void FreeMatrix(Matrix M) { free(M.elements); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Step 3: Some Loose Ends // Copy a host matrix to a device matrix. void CopyToDeviceMatrix(Matrix Mdevice, const Matrix Mhost) { int size = Mhost.width * Mhost.height * sizeof(float); cudaMemcpy(Mdevice.elements, Mhost.elements, size, cudaMemcpyHostToDevice); } // Copy a device matrix to a host matrix. void CopyFromDeviceMatrix(Matrix Mhost, const Matrix Mdevice) { int size = Mdevice.width * Mdevice.height * sizeof(float); cudaMemcpy(Mhost.elements, Mdevice.elements, size, cudaMemcpyDeviceToHost); } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Performance Results (??) Core 2 Duo 2.4GHz vs 8800 GTS 640MB Matrix size = 16x16 1 block of 256 threads Host processing time: 0.005550 (ms) Device processing time: 0.398564 (ms) I told you it was an illustrative example! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Performance Results Of course, we can try cheating and make 12 multiplications in parallel Matrix size = 16x16 12 blocks of 256 threads each Host processing time: 0.062140 (ms) Device processing time: 0.396850 (ms) Hmm... since it is still illustrative, lets experiment a little more! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Shared Memory __global__ void matrixMulSimpleKernelShared( float* m, float* n, float* p ) { const int tx = threadIdx.x; const int ty = threadIdx.y; float sum = 0; __shared__ float MMs[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float NNs[BLOCK_SIZE][BLOCK_SIZE]; MMs[ty][tx] = m[ty*BLOCK_SIZE + tx]; NNs[ty][tx] = n[ty*BLOCK_SIZE + tx]; __syncthreads(); for( int k = 0; k < BLOCK_SIZE; ++k ) { sum += MMs[ty][k] * NNs[k][tx]; } p[ty*BLOCK_SIZE + tx] = sum; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Constant Memory __constant__ float Mc[BLOCK_SIZE*BLOCK_SIZE]; __constant__ float Nc[BLOCK_SIZE*BLOCK_SIZE]; __global__ void matrixMulSimpleKernelConstant( float* m, float* n, float* p ) { const int tx = threadIdx.x; const int ty = threadIdx.y; float sum = 0; for( int k = 0; k < BLOCK_SIZE; ++k ) { const float a = Mc[ty*BLOCK_SIZE + k]; const float b = Nc[k*BLOCK_SIZE + tx]; sum += a * b; } p[ty*BLOCK_SIZE + tx] = sum; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Constant Memory (host) int byteTotal = BLOCK_SIZE*BLOCK_SIZE*sizeof(float); cudaMemcpyToSymbol( Mc, m, byteTotal ) ; cudaMemcpyToSymbol( Nc, n, byteTotal ) ; then call kernel © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Texture Memory texture<float, 2> texM; texture<float, 2> texN; __global__ void matrixMulSimpleKernelTexture( float* p ) { const int tx = threadIdx.x; const int ty = threadIdx.y; float sum = 0; for( int k = 0; k < BLOCK_SIZE; ++k ) { const float a = tex2D( texM, k, ty ); const float b = tex2D( texN, tx, k ); sum += a * b; } p[ty*BLOCK_SIZE + tx] = sum; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Texture Memory (host) // Allocate arrays for texture access cudaArray* mArray = NULL; cudaArray* nArray = NULL; cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>(); cudaMallocArray( &mArray, &channelDesc, BLOCK_SIZE, BLOCK_SIZE ); cudaMallocArray( &nArray, &channelDesc, BLOCK_SIZE, BLOCK_SIZE ); // Bind the arrays to the textures cudaBindTextureToArray( texM, mArray ) ; cudaBindTextureToArray( texN, nArray ) ; // Set M texture parameters texM.addressMode[0] = cudaAddressModeClamp; texM.addressMode[1] = cudaAddressModeClamp; texM.filterMode = cudaFilterModePoint; texM.normalized = false; // Set N texture parameters texN.addressMode[0] = cudaAddressModeClamp; texN.addressMode[1] = cudaAddressModeClamp; texN.filterMode = cudaFilterModePoint; texN.normalized = false; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication Texture Memory (host, pt 2) int byteTotal = BLOCK_SIZE*BLOCK_SIZE*sizeof(float); cudaMemcpyToArray( mArray, 0, 0, m, byteTotal, cudaMemcpyHostToDevice ); cudaMemcpyToArray( nArray, 0, 0, n, byteTotal, cudaMemcpyHostToDevice ); then call kernel © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Performance Results? About the same Constant memory seems faster (about 0.01ms) Not really any difference, still slower than CPU We will see the proper way of doing Matrix Multiplication in a few slides! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Useful Information on Tools like...err... DEBUGGING! Compilation Any source file containing CUDA language extensions must be compiled with nvcc nvcc is a compiler driver – Works by invoking all the necessary tools and compilers like cudacc, g++, cl, ... nvcc can output: – Either C code – That must then be compiled with the rest of the application using another tool Or object code directly © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Linking Any executable with CUDA code requires two dynamic libraries: – The CUDA runtime library (cudart) – The CUDA core library (cuda) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (nvcc deviceemu) runs completely on the host using the CUDA runtime – No need of any device and CUDA driver – Each device thread is emulated with a host thread When running in device emulation mode, one can: – Use host native debug support (breakpoints, inspection, etc.) – Access any device-specific data from host code and vice-versa – Call any host function from device code (e.g. printf) and viceversa – Detect deadlock situations caused by improper usage of __syncthreads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Device Emulation Mode Pitfalls Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results. Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode Results of floating-point computations will slightly differ because of: – Different compiler outputs, instruction sets – Use of extended precision for intermediate results There are various options to force strict single precision on the host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign CUDA SDK CUBLAS – Blas level 1, 2 and 3 ready-to-use functions CUFFT – Discrete Fast Fourier Transform – API similar to popular FFTWin CUDPP (still beta) – Parallel primitives – Prefix sum, sort, reduction, etc Full support for clusters running Rocks © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Hardware now, take a breath... G80 Thread Computing Pipeline The future of GPUs is programmable processing So – build the architecture around the processor Host Input Assembler Setup / Rstr / ZCull SP SP SP TF SP SP TF L1 TF L1 SP © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SP SP TF L1 L1 SP SP TF L1 L2 FB Pixel Thread Issue SP TF L2 FB SP SP TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB G80 Thread Computing Pipeline Processors execute computing threads Alternative operating mode specifically for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Load/store Load/store GeForce 8800 Series Technical Specs Maximum number of threads per block: 512 Maximum size of each dimension of a grid: 65,535 Number of streaming multiprocessors (SM): – – GeForce 8800 GTX: 16 @ 1.35 GHz GeForce 8800 GTS: 12 @ 1.2 GHz Device memory: – – GeForce 8800 GTX: 768 MB GeForce 8800 GTS: 640 MB Shared memory per multiprocessor: 16KB divided in 16 banks Constant memory: 64 KB Warp size: 32 threads (16 Warps/Block) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign What is the GPU Good at? Data-parallel processing – Same computation executed on many data elements in parallel – Low control flow overhead With high SP floating point arithmetic intensity – Many calculations per memory access – Currently need high floating point to integer ratio © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign What is the GPU Good at? High floating-point arithmetic intensity + Many data elements = Memory access latency can be hidden with calculations instead of big caches Still need to avoid bandwidth saturation! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Drawbacks of (legacy) GPGPU Model how it was done prior to CUDA Hardware Limitations Memory accesses are done as pixels – Only gather: can read data from other pixels Control Cache ALU ALU ALU ... Control Cache ALU ALU ALU ... … … d0 d1 (Each d3 d4 write d5 to one d7 d2 shader d6 No scatter: can only pixel) DRAM – Control Cache DRAM ALU ALU ALU ... Control Cache Less programming flexibility d0 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign d1 d2 d3 ALU ALU ALU ... … d4 d5 d6 d7 … Hardware Limitations Applications can easily be limited by DRAM memory bandwidth Control Cache DRAM ALU ALU ALU ... d0 d1 d2 d3 Control Cache ALU ALU ALU ... d4 d5 d6 d7 … Waste of computation power due to data starvation © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign But with CUDA what we can do differently Scatter CUDA provides generic DRAM memory addressing – Gather: Control Cache DRAM ALU ALU ALU ... d0 d1 d2 d3 Control Cache ALU ALU ALU ... … d4 d5 d6 d7 … – And scatter: no longer limited to write one pixel Control Cache DRAM ALU ALU ALU ... d0 d1 d2 d3 Control Cache ALU ALU ALU ... … d4 d5 d6 d7 … More programming flexibility © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign On-Chip Shared Memory CUDA enables access to a parallel on-chip shared memory for efficient inter-thread data sharing Control Cache Shared memory DRAM ALU ALU ALU ... d0 d1 d2 d3 d0 d1 d2 d3 Control Cache Shared memory ALU ALU ALU ... … d4 d5 d6 d7 d4 d5 d6 d7 Big memory bandwidth savings © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign … Memory Model & Hardware how does the CUDA memory model work in the hardware? CUDA Memory Spaces Grid Each thread can: – Read/write per-thread registers – Read/write per-thread local memory – Read/write per-block shared memory – Read/write per-grid global memory – Read only per-grid constant memory – Read only per-grid texture memory The host can read/write global, constant, and texture memory Host Block (0, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory Texture Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Block (1, 0) Local Memory Local Memory Hardware Implementation The local, global, constant, and texture spaces are regions of device memory Each multiprocessor has: – – A set of 32-bit registers per processor On-chip shared memory – Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Processor 1 Registers Processor 2 A read-only constant cache – Where the shared memory space resides Device To speed up access to the constant memory space To speed up access to the texture memory space © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign … Instruction Unit Processor M Constant Cache Texture Cache A read-only texture cache Registers Device memory Global, constant, texture memories Memory Summary Memory Location Cached Access Local Off-chip No Read/write One thread Shared On-chip N/A - resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Who Access Times Register – Shared Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, no cache - *slow* Constant Memory – DRAM, no cache - *slow* Global Memory – dedicated HW - single cycle Local Memory – dedicated HW - single cycle DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Processing Model & Hardware now that we can reach the data, how do we actually process it? CUDA: A Set of SIMD Multiprocessors A set of 16 multiprocessors Each multiprocessor – – Multiprocessor N Multiprocessor 2 Multiprocessor 1 At each clock cycle – A set of 8 processors (32-bit) Single Instruction Multiple Data architecture (shared instruction unit) Device The multiprocessor executes the same instruction on a group of threads called a warp The number of threads in a warp is the warp size © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Processor 1 Processor 2 … Instruction Unit Processor M Threads, Warps, Blocks There are (up to) 32 threads in a Warp – Only <32 when there are fewer than 32 total threads There are (up to) 16 Warps in a Block Each Block (and thus, each Warp) executes on a single SM G80 has 16 SMs At least 16 Blocks required to “fill” the device More is better – If resources (registers, thread space, shared memory) allow, more than 1 Block can occupy each SM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Execution Model (review) Each thread block of a grid is split into warps, each gets executed by one multiprocessor (SM) – The way a block is split into warps is always the same Each thread block is executed by one multiprocessor – Each warp contains threads of consecutive, increasing thread indices with the first warp containing thread 0 So that the shared memory space resides in the on-chip shared memory A multiprocessor can execute multiple blocks concurrently – – Shared memory and registers are partitioned among the threads of all concurrent blocks So, decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix: Reloaded a better approach to matrix multiplication Recall: Matrix Multiplication Device-Side Kernel Function M N WIDTH for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; P WIDTH ty tx WIDTH © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign WIDTH Idea: Use Shared Memory to reuse global memory data Each input element is read by WIDTH threads Load each element into Shared Memory Several threads use the local version Drastically reduce the memory bandwidth – Tiled algorithms © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Tiled Multiply Using Thread Blocks bx 0 1 2 tx BLOCK_SIZE bsize-1 N BLOCK_SIZE One thread computes one element of Psub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape M P by 0 1 2 1 ty Psub bsize-1 BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE WIDTH WIDTH 2 WIDTH 0 BLOCK_SIZE 012 WIDTH One block computes one square sub-matrix Psub of size BLOCK_SIZE Shared Memory Usage Each SMP has 16KB shared memory – Each Thread Block uses 2*256*4B = 2KB of shared memory. – Potentially up to 8 Thread Blocks actively executing – For BLOCK_SIZE = 16, this allows up to 8*512 = 4,096 pending loads In practice, there will probably be up to half of this due to scheduling to make use of SPs. – The next BLOCK_SIZE 32 would lead to 2*32*32*4B= 8KB shared memory usage per Thread Block, allowing only up to 2 Thread Blocks active at the same time © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign First-order Size Considerations Each Thread Block should have a minimal of 192 threads – BLOCK_SIZE of 16 gives 16*16 = 256 threads A minimal of 32 Thread Blocks – A 1024*1024 P Matrix gives 64*64 = 4096 Thread Blocks Each thread block perform – 2*256 = 512 float loads from global memory – for 256 * (2*16) = 8,192 mul/add operations – Memory bandwidth no longer a limiting factor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Kernel Execution Configuration // Setup computation // Matrix size could be 1024 dim3 dimBlock( BLOCK_SIZE, BLOCK_SIZE ); dim3 dimGrid( MATRIX_SIZE/BLOCK_SIZE, MATRIX_SIZE/BLOCK_SIZE ); // Launch device kernel matrixMulKernel<<< dimGrid, dimBlock >>>( md, nd, pd ); For very large N and M dimensions, one will need to add another level of blocking and execute the second-level blocks sequentially (several kernels); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Kernel Code: initialization __global__ void matrixMulKernel( float* m, float* n, float* p ) { // Block and thread index int bx = blockIdx.x; int by = blockIdx.y; int tx = threadIdx.x; int ty = threadIdx.y; // Index of the first and last sub-matrix of A processed by the block int mBegin = MATRIX_SIZE * BLOCK_SIZE * by; int mEnd = mBegin + MATRIX_SIZE - 1; // Step size used to iterate through the sub-matrices of A int mStep = BLOCK_SIZE; // Index of the first and last sub-matrix of B processed by the block int nBegin = BLOCK_SIZE * bx; int nStep = BLOCK_SIZE * MATRIX_SIZE; // sum is used to store the element of the block sub-matrix that is computed by the thread float sum = 0; // Declaration of the shared memory arrays Ms and Ns used to store the sub-matrices of A and B __shared__ float Ms[BLOCK_SIZE][BLOCK_SIZE]; __shared__ float Ns[BLOCK_SIZE][BLOCK_SIZE]; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Tiled Multiply Using Thread Blocks bx 0 1 tx bsize-1 BLOCK_SIZE 012 N BLOCK_SIZE One thread computes one element of Psub Assume that the dimensions of M and N are multiples of BLOCK_SIZE and square shape M WIDTH One block computes one square sub-matrix Psub of size BLOCK_SIZE P by 0 1 2 1 ty Psub bsize-1 BLOCK_SIZE BLOCK_SIZE BLOCK_SIZE WIDTH WIDTH 2 WIDTH 0 BLOCK_SIZE 2 Kernel Code: main computation // Loop over all the sub-matrices of A and B required to compute the block sub-matrix for( int a = mBegin, b = nBegin; a <= mEnd; a+=mStep, b+=nStep ) { // Load the matrices from device memory to shared memory; // each thread loads one element of each matrix MS(ty, tx) = m[a + MATRIX_SIZE * ty + tx]; NS(ty, tx) = n[b + MATRIX_SIZE * ty + tx]; // Synchronize to make sure the matrices are loaded __syncthreads(); // Multiply the two matrices together; each thread computes one element of the block sub-matrix for( int k = 0; k < BLOCK_SIZE; ++k ) sum += MS(ty, k) * NS(k, tx); // Synchronize to make sure that the preceding computation is done before // loading two new sub-matrices of A and B in the next iteration __syncthreads(); } // Write the block sub-matrix to device memory; each thread writes one element int c = MATRIX_SIZE * BLOCK_SIZE * by + BLOCK_SIZE * bx; p[c + MATRIX_SIZE*ty + tx] = sum; } © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign This code should run at about 45 GFLOPS Performance Results 6 4.838223 Processing Time (ms) 5 4 3 CPU GPU 2 1 0.241846 0.251588 0.26058 0.480575 0.301256 0.29736 0 0 0.00553520 0.03742840 0.125462 60 80 Matrix Size (NxN) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 100 120 140 Performance Results GPU block sizes Size = 4: 314 ms Size = 8: 156 ms Size = 16: 65 ms Matrix: 1024 x 1024 CPU: 5420 ms GPU: 65 ms 6000 Processing Time (ms) 5000 4000 3000 CPU GPU 2000 1000 0 0 200 400 600 Matrix Size (NxN) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 800 1000 1200 Performance Results Important note: CPU code not optimized! GPU seems almost 100x faster than CPU But – Even if we can get a 2x speed-up in CPU (!) – GPU would still be about 50x faster! Do the math! – Fastest Core 2 Duo has peak at ~10 GFLOPs – Previous CUDA code is ~45 GFLOPs © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Matrix Multiplication: Verdict No need to code and optimize this! Use CUBLAS – CUDA Blas level 1, 2 and 3 library – Ready for use, optimized like hell! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign HW Architecture everybody take a deep breath now...... Streaming Processor Array (SPA) TPC TPC TPC Texture Processor Cluster TPC TPC TPC TPC Streaming Multiprocessor Instruction L1 SM TPC Data L1 Instruction Fetch/Dispatch Shared Memory TEX SP SM SP SP SP SFU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SFU SP SP SP SP Texture Processor Cluster (TPC) TPC SM0 T e x t u r e L 1 C a c h e Instruction Fetch Instruction L 1 Cache Instruction Decode T e x t u r e Shared Memory S F U SP0 R R SP 4 SP1 R R SP 5 SP2 R R SP 6 SP3 R R SP 7 S F U L 2 Constant L1 Cache SM1 Instruction Fetch U n i t I & C Instruction L 1 Cache Instruction Decode Shared Memory S F U SP0 R R SP 4 SP1 R R SP 5 SP2 R R SP 6 SP3 R R SP 7 Constant L 1 Cache Load/Store © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign S F U C a c h e Texture-Processor Cluster (TPC) – Texture unit, L1 texture cache – 2 Streaming Multiprocessors (SM) – 8 FP MAD / clock – L2 Instruction & Data Caches Memory and Texture access – Texture, load/store interfaces – Registers decouple latency Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) – 8 Streaming Processors (SP) – 2 Super Function Units (SFU) Multi-threaded instruction dispatch – 1 to 768 threads active – SIMD instruction per 16/32 threads – Cover latency of texture/memory loads Hot clock 1.35 GHz – 20+ GFLOPS local register file (RFn) 16 KB shared memory DRAM texture and memory access © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Streaming Multiprocessor (SM) Instruction Fetch Instruction L 1 Cache L1 Fill Thread / Instruction Dispatch Work Shared Memory S F U Control SP0 RF0 RF4 SP4 SP1 RF1 RF5 SP5 SP2 RF2 RF6 SP6 SP3 RF3 RF7 SP7 Results S F U Load Texture Constant L1 Cache Load from Memory L1 Fill Store to Store to Memory Streaming Processor (SP) One scalar ALU – Serves as datapath for 1 thread of a warp – Each SM has 8 SP – Each SM has 2 SFU Threads – A warp instruction can issue every clock – Need ~8 warps to typically saturate the MAD/SFU pipes © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SM Instruction Buffer Fetch one warp instruction/cycle – – from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle – – from any warp - instruction buffer slot operand scoreboarding used to prevent hazards I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select Issue selection based on round-robin/age of warp SM broadcasts SIMD instruction to 32 Threads of a Warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign MAD SFU Scoreboarding All operands of all instructions in the Instruction Buffer are scoreboarded – prevents hazards – cleared instructions are eligible for issue Decoupled Memory/Processor pipelines – any thread can continue to issue instructions until scoreboarding prevents issue – allows Memory/Processor ops to proceed in shadow of Memory/Processor ops © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Branching Conditional branch to label, subroutine call SM schedules each Warp independently SM executes 32 threads of a Warp as a SIMD instruction – SM enables/disables sets of threads when branches diverge Synchronization – Re-converge diverged threads in a Warp Barrier Synchronization – CUDA uses barrier instruction to synchronize multiple Warps in a Thread Block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SM Register File Register File (RF) – 32 KB – Provides 4 operands/clock TEX pipe can also read/write RF – 2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F Shared Mem Operand Select MAD © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign C$ L1 SFU Constants Immediate address constants Indexed address constants Constants stored in memory, and cached on chip – L1 per SM I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign SFU Shared Memory Each SM has 16 KB of Shared Memory – 16 banks of 32bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block – read and write access Not used explicitly for pixel shader programs – we dislike pixels talking to each other © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Execution Pipes Scalar MAD pipe – – – FMUL,FADD,FMAD integer ops, conversions One instruction/clock Scalar SFU pipe – one instruction/4 clocks – R F C$ L1 Shared Mem also supports FMUL, MOV TEX pipe (external to SM, shared by all SM’s in a TPC) LD/ST pipe – Multithreaded Instruction Buffer RCP,RSQ,LG2,EX2,SIN,COS – I$ L1 thread register spill to memory, used for indexable registers CUDA has both global and local memory access through LD/ST © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Operand Select MAD SFU Texture (Memory read) Clustering/Batching Use another independent Texture memory read to hide Texture memory latency – Instead of: – – – – TEX 0 (long latency) Dependent MATH 0 TEX 1 (long latency) Dependent MATH 1 Do: – – – – Use same thread to help hide own latency TEX 0 (long latency) TEX 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! – But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Load/Store (Memory read/write) Clustering/Batching Use LD to hide LD latency (non-dependent LD ops only) – Instead of: – – – – LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: – – – – Use same thread to help hide own latency LD 0 (long latency) LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! – But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Performance Issues the real gems of how to code efficient algorithms in CUDA (and G80+ in general!) CUDA Instruction Performance Instruction cycles (per warp) = sum of – – – Operand read cycles Instruction execution cycles (both memory access and FP) Result update cycles Therefore instruction throughput depends on – – – Nominal instruction throughput Memory latency Memory bandwidth © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Maximizing Instruction Throughput Minimize use of low-throughput instructions – Maximize use of high-bandwidth memory – – – – Will cover specifics later Maximize use of shared memory Maximize locality and synchrony of cached accesses Minimize accesses to (uncached) global and local memory Maximize coalescing of global memory accesses Optimize performance by overlapping memory accesses with HW computation – High arithmetic intensity programs – i.e. high ratio of math to memory transactions Many concurrent threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput int and float add, shift, min, max and float mul, mad: 2 cycles per warp – int multiply (*) is by default 32-bit – requires multiple cycles / warp Use __mul24() / __umul24() intrinsics for 2-cycle 24-bit int multiply Integer divide and modulo are expensive – – – Compiler will convert literal power-of-2 divides to shifts Be explicit in cases where compiler can’t tell that divisor is a power of 2! Useful trick: foo % n == foo & (n-1) if n is a power of 2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Arithmetic Instruction Throughput Reciprocal, reciprocal square root, sin/cos, log, exp: 8 cycles per warp – – These are the versions prefixed with “__” Examples:__rcp(), __sin(), __exp() Other functions are combinations of the above – – y / x == rcp(x) * y == 10 cycles per warp sqrt(x) == rcp(rsqrt(x)) == 16 cycles per warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Runtime Math Library There are two types of runtime math operations – __func(): direct mapping to hardware ISA – func() : compile to multiple instructions Fast but low accuracy (see prog. guide for details) Examples: __sin(x), __exp(x), __pow(x,y) Slower but higher accuracy (5 ulp or less) Examples: sin(x), exp(x), pow(x,y) The -use_fast_math compiler option forces every func() to compile to __func() © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Inside the Hardware Need ~8 warps to typically saturate the MAD/SFU pipes Avoid many SFU calls (RCP,RSQ,SIN,etc) Optimized for FMUL operations SMs share instruction slots by integer ops, loads, stores, etc and floating point operations – The more floating point you fit, the more flops you get Keep instruction workload constant: spread fmads, memory fetches, etc © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Make your program float-safe! Future hardware will have double precision support – – – G80 is single-precision only Double precision will have additional performance cost Careless use of double or undeclared types may run more slowly on G80+ Important to be float-safe (be explicit whenever you want single precision) to avoid using double precision where it is not needed – Add ‘f’ specifier on float literals: – foo = bar * 0.123; foo = bar * 0.123f; // double assumed // float explicit Use float version of standard library functions foo = sin(bar); foo = sinf(bar); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign // double assumed // single precision explicit Deviations from IEEE-754 Addition and Multiplication are IEEE 754 compliant – However, often combined into multiply-add (FMAD) – Maximum 0.5 ulp error Intermediate result is truncated Division is non-compliant (2 ulp) Not all rounding modes are supported Denormalized numbers are not supported No mechanism to detect floating-point exceptions © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign GPU Floating Point Features G80 SSE IBM Altivec Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles Supported, 1000’s of cycles Flush to zero NaN support Yes Yes Yes No Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity Flags No Yes Yes Some Square root Software only Hardware Software only Software only Division Software only Hardware Software only Software only Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit log2(x) and 2^x estimates accuracy 23 bit No 12 bit No © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign How thread blocks are partitioned Thread blocks are partitioned into warps – – Partitioning is always the same – – Thread IDs within a warp are consecutive and increasing Warp 0 starts with Thread ID 0 Thus you can use this knowledge in control flow (Covered next) However, DO NOT rely on any ordering between warps – If there are any dependencies between threads, you must __syncthreads() to get correct results © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Branching and Divergence “This way! No, that way!” SIMD Operation The SM is a multithreaded SIMD machine – SIMD allows overhead of fetch-decode-schedule to be amortized across many threads (the threads of a warp) – Implication is that higher percentage of SM area is computation units => better perf/area than say a multicore CPU However, only works if threads are truly “coherent” (in lock step executing the exact same instructions on different data sets) – Branches represent opportunities for thread “divergence” – When threads of a warp diverge, we lose a degree of SIMD – Loss of SIMD increases exponentially for each divergence until all threads of a warp are executed one-at-a-time – At that point, for a warp size of W, we operate at efficiency 1/W © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Coding for Performance in the Face of Branches If you have to branch… – Do it as little as possible – The “divergence distance” in a shader should be as small as possible The distance between a branch that can diverge and an instruction which “resyncs” a divergence – a join point – SM provides means in the ISA to converge a set of threads at some common point © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Control Flow Instructions Main performance concern with branching is divergence – – Threads within a single warp take different paths Different execution paths must be serialized Avoid divergence when branch condition is a function of thread ID – Example with divergence: – If (threadIdx.x > 2) { } Branch granularity < warp size Example without divergence: If (threadIdx.x / WARP_SIZE > 2) { } Branch granularity is a whole multiple of warp size © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Instruction Predication in G80 Comparison instructions set condition codes (CC) Instructions can be predicated to write results only when CC meets criterion (CC != 0, CC >= 0, etc.) Compiler tries to predict if a branch condition is likely to produce many divergent warps – – If guaranteed not to diverge: only predicates if < 4 instructions If not guaranteed: only predicates if < 7 instructions May replace branches with instruction predication ALL predicated instructions take execution cycles – Those with false conditions don’t write their output – Or invoke memory loads and stores Saves branch instructions, so can be cheaper than serializing divergent paths © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Performance Memory Bandwith Memory Instructions Memory instructions take 2 cycles per warp – – – Issue global and local memory loads / stores (not cached) Constant and texture loads (cached) Shared memory reads / writes Example __shared__ float shared[]; __device__ float global[]; shared[threadIdx.x] = global[threadIdx.x]; 2 cycles to issue read from global (device) memory, 2 cycles to issue write to shared memory 200-300 cycles to read a float from global (device) memory – But can be hidden by scheduling independent math instructions or even other loads / stores if there are enough active threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Bandwidth Effective bandwidth depends on access patterns Minimize device memory accesses – Much lower bandwidth than on-chip shared memory Common CUDA kernel structure: 1. 2. 3. 4. 5. Notes: Load data from global memory to shared memory __syncthreads() Process the data in shared memory with many threads __syncthreads() (if needed) Store results from shared memory to global memory – Steps 2-4 may be repeated, looped, etc. – Step 4 is not necessary if there is no dependence of stored data on other threads © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Global, Local and Shared memory Local and global device memory not cached on GeForce 8800 Series GPUs – High latency, launching more threads hides latency – Important to minimize accesses, optimize coherence – Coalesce global memory accesses (more later) Shared memory is on-chip, very high bandwidth – Low latency – Like a user-managed per-multiprocessor cache – But must be careful to avoid bank conflicts (more later) © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Texture and Constant Memory Texture partition is cached – Uses the texture cache also used for graphics – Optimized for 2D spatial locality – Best performance when threads of a warp read locations that are close together in 2D Constant memory is cached – 2 cycles per address read within a single warp Total cost 2 cycle if all threads in a warp read same address Total cost 32 cycles if all threads read different addresses © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Registers Register reads are generally “free” But delays can be caused by – Register read-after-write dependencies – Register memory bank conflicts Register bank conflicts are minimized by thread scheduler – No programmer control – No need to pack data into float4 or int4 types © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Data Transfers Device memory to host memory bandwidth much lower than device memory to device bandwidth Minimize transfers – Intermediate data structures can be allocated, operated on, and deallocated without ever copying them to host memory Group transfers – One large transfer much better than many small ones © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Performance Hiding Latencies Highlights So Far Whenever make memory access – Make as many computations as possible to hide latency! The same computation executed on many data elements in parallel – Low control flow overhead Many calculations per memory access © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Texture (Memory read) Clustering/Batching Use another independent Texture memory read to hide Texture memory latency – Instead of: – – – – TEX 0 (long latency) Dependent MATH 0 TEX 1 (long latency) Dependent MATH 1 Do: – – – – Use same thread to help hide own latency TEX 0 (long latency) TEX 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! – But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Load/Store (Memory read/write) Clustering/Batching Use LD to hide LD latency (non-dependent LD ops only) – Instead of: – – – – LD 0 (long latency) Dependent MATH 0 LD 1 (long latency) Dependent MATH 1 Do: – – – – Use same thread to help hide own latency LD 0 (long latency) LD 1 (long latency - hidden) MATH 0 MATH 1 Compiler handles this! – But, you must have enough non-dependent LDs and Math © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Load Groups FP:LD ratio from 8:1 to 32:1 Need high data re-use for memory operands – Mimics FP:TEX => FP:LD ratio – Higher ratios imply less memory BW to keep FP units busy – Make use of the local memory as a type of SW-controlled cache – higher data reuse rates Larger “LD groups” – Code programs to dispatch multiple loads before the first “use” of a load result © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Access Strategies Coalescing Global Memory Access Coalesced Loads and Stores __local__ and __device__ are not cached on G80 – Important to minimize accesses, optimize coherence If per-thread memory accesses for a single warp form a contiguous range of addresses, accesses will be coalesced into a single access – – Coalesced accesses are much faster than non-coalesced Non-coalesced accesses are serialized Thread N within a warp should access address BaseAddress + size * N – size is byte size of each read/written memory block – 4, 8, or 16 BaseAddress is aligned to 16 * size © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Memory Access Strategies Shared Memory Bank Conflicts Parallel Memory Architecture In a parallel machine, many threads access memory – – Each bank can service one address per cycle – Therefore, memory is divided into banks Essential to achieve high bandwidth A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict – Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Conflicting accesses are serialized Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Bank Addressing Examples No Bank Conflicts – Linear addressing stride == 1 No Bank Conflicts – Random 1:1 Permutation Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 15 Bank 15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Bank Addressing Examples 2-way Bank Conflicts – Linear addressing stride == 2 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 8 Thread 9 Thread 10 Thread 11 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 8-way Bank Conflicts – Linear addressing stride == x8 8 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 15 Thread 15 Bank 0 Bank 1 Bank 2 x8 Bank 7 Bank 8 Bank 9 Bank 15 How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks – So bank = address % 16 – Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: – – If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: – – – Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Linear Addressing Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks – 16 on G80, so s must be odd © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign s=1 Thread 0 Thread 1 Bank 0 Bank 1 Thread 2 Thread 3 Bank 2 Bank 3 Thread 4 Bank 4 Thread 5 Thread 6 Bank 5 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 s=3 Thread 0 Thread 1 Bank 0 Bank 1 Thread 2 Thread 3 Bank 2 Bank 3 Thread 4 Bank 4 Thread 5 Thread 6 Bank 5 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 Data types and bank conflicts This has no conflicts if type of shared is 32-bits: foo = shared[baseIndex + threadIdx.x] But not if the data type is smaller – 4-way bank conflicts: __shared__ char shared[]; foo = shared[baseIndex + threadIdx.x]; – Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 2-way bank conflicts: __shared__ short shared[]; foo = shared[baseIndex + threadIdx.x]; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Structs and Bank Conflicts Struct assignments compile into as many memory accesses as there are struct members: struct vector { float x, y, z; }; struct myType { float f; char c; }; __shared__ struct vector vectors[64]; __shared__ struct myType myTypes[64]; This has no bank conflicts; struct size is 3 words – Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 15 Bank 15 3 accesses per thread, contiguous banks (no common factor with 16) struct vector v = vectors[baseIndex + threadIdx.x]; This has 2-way bank conflicts; struct size is 5 bytes (2 accesses per thread) struct myType m = myTypes[baseIndex + threadIdx.x]; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Common Bank Conflict Patterns (1) Each thread loads 2 elements into shared mem: – 2-way-interleaved loads result in 2-way bank conflicts: int tid = threadIdx.x; shared[2*tid] = global[2*tid]; shared[2*tid + 1] = global[2*tid+1]; – Better to not interleave: shared[tid] = global[tid]; shared[tid + blockDim.x] = global[tid + blockDim.x]; © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 8 Thread 9 Thread 10 Thread 11 Bank 15 Thread 0 Thread 1 Bank 0 Bank 1 Thread 2 Thread 3 Bank 2 Bank 3 Thread 4 Bank 4 Thread 5 Thread 6 Bank 5 Bank 6 Thread 7 Bank 7 Thread 15 Bank 15 Common Bank Conflict Patterns (2) Operating on 2D array of floats in shared memory – – Each thread processes a row So threads in a block access the columns simultaneously (example: row 1 in purple) 16-way bank conflicts: rows all start at bank 0 Solution 1) pad the rows – e.g. image processing Example: 16x16 block – – Bank Indices without Padding 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 0 1 2 3 4 5 6 7 15 Add one float to the end of each row Solution 2) transpose before processing – – Suffer bank conflicts during transpose But possibly save them later © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 15 Bank Indices with Padding 1 2 3 4 5 6 7 15 0 2 3 4 5 6 7 8 0 1 3 4 5 6 7 8 9 1 2 4 5 6 7 8 9 10 2 3 5 6 7 8 9 10 11 3 4 6 7 8 9 10 11 12 4 5 7 8 9 10 11 12 13 5 6 8 9 10 11 12 13 14 7 8 15 0 1 2 3 4 5 6 14 15 Resource Management and Occupancy Scarce Resources Performance Variables Shorter programs are overhead limited Longer programs are instruction-rate limited – Must have enough threads per thread block – at least 192, more is better – Must have enough thread blocks – at least 32, more is better RF load balancing – RF space commonly in high demand © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Performance Variables (2) Compiler quality important for good performance – Minimize register usage in CUDA programs Reduces spilling to memory – Interleave non-dependent FP/DATA ops maximizes issue rate – Cluster non-dependent texture and memory reads decreases program latency © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Resource Management and Occupancy Optimizing Threads per Block Optimizing threads per block Given: total threads in a grid – Enough threads per block to keep machine busy – If multiple blocks exist that aren’t all waiting at a __syncthreads(), machine can stay busy Blocks / multiprocessors > 2 increases efficiency – Cover memory latency Enough blocks to avoid idle multiprocessors during syncs – Choose block size / # blocks to maximize utilization of the device Blocks stream through machine in pipeline fashion Per-block resources at most half of total available – – Shared memory and registers So multiple blocks can coexist in a multiprocessor © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Optimizing threads per block Choose threads per block as a multiple of warp size – More threads per block == fewer regs per thread – Avoid wasting computation on under-populated warps Kernel invocations can fail if too many registers are used Heuristics – Minimum: 64 threads per block – 192 or 256 threads a better choice – Only if multiple concurrent blocks Usually still enough regs to compile and invoke successfully Blocks per grid > 100 to scale to future devices 1000 blocks per grid will scale across multiple generations Of course this depends on problem size © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Optimizing threads per block Only up to 8 thread blocks can be active on a multiprocessor – If your block is only of size 8, max active threads is 8x8=64 (occupancy = 64/768 = 1/12th). A compute intensive kernel with few threads per block may be faster – 64 threads/block with loop unrolling nearly 2x faster than 256 threads/block without unrolling © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Parameterize Your Application Parameterization helps adaptation to different GPUs GPUs vary in many ways – – – – – # of multiprocessors Shared memory size Register file size Threads per block Memory bandwidth You can even make apps self-tuning (like FFTW) – “Experiment” mode discovers and saves optimal config © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Design Evaluation Key questions to ask – How many threads can be supported? – How many threads are needed? – How are the data structures shared? – Is there enough work in each thread between synchronizations to make parallel execution worthwhile? © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Design Evaluation Example – Matrix Multiplication Each thread likely need at least 32 FOPS between synchronizations to make parallel processing worthwhile At least 192 threads are needed in a block to fully utilize the hardware The M and N sub-blocks of each thread group must fit into 16KB of Shared Memory The design will likely end up with about 16 ● 16 subblocks given all the constraints The minimal matrix size is around 1K elements in each dimension to make parallel execution worthwhile © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Conclusion CUDA and GeForce 8800 can achieve great results on data-parallel computations if you use a few performance strategies – Optimize for memory locality – Size thread blocks to maximize multiprocessor utilization and reduce memory stalls – Ensure memory addresses are coalesced – Avoid shared memory bank conflicts © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign General Guidelines Blocks count at least double of number of SMs: 16x2=32 Shared memory allocated per block should be at most half of total available shared memory: 16 / 2 = 8k Threads per block must be a multiple of 64 – – – – Number of blocks per grid at least 100. Maximize arithmetic intensity – Ideally each SM has more than 192 threads Minimum of 64 threads only if there are lots of concurrent blocks Usually, 192 or 256 threads per block is good Maximum is 512 threads per block Hide memory access latencies Branches – – – – May not be worth it if instruction count is 5 or less Pre-compute values if possible Avoid branches when result may be pre-determined Granularity of 16x16 (16x4 is ok) Make neighboring threads follow same execution path! © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign Possible Performance Hotspots (to get 100% occupancy) 1. registers <= 10 2. threads/block mod 32 == 0 3. warps/block is a divisor of 24 4. shared mem/block <= 16Kb * (warps/block) / 24 - any alignment constraint? constant memory does not come into it as it is the same for all blocks and then run N * 16 * 24 / (warps/block) blocks, assuming they all execute for the same time. Since there are only a few thread counts that satisfy those requirements, maybe we can summarize like this: Max registers: 10 Threads per Block........Max shared mem (bytes) 96...............................2048 128.............................2730 192.............................4096 256.............................5461 384.............................8192 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign References CUDA home page – http://developer.nvidia.com/object/cuda.html Official CUDA forum – http://forums.nvidia.com/index.php?showforum=62 University of Illinois Parallel Computing Course – – – – – http://courses.ece.uiuc.edu/ece498/al/ Presentations Pod-casts Nvidia chief scientist David Kirk present at all classes! Source of inspiration for slides shown © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign