GPU Tutorial: How To Program for GPUs Krešimir Ćosić1, (1) University of Split, Croatia High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Overview CUDA Hardware architecture Programming model Convolution on GPU High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA ‘Compute Unified Device Architecture’ – • Massively parallel architecture – • over 8000 threads is common C for CUDA (C++ for CUDA) – • Hardware and software architecture for issuing and managing computations on GPU C/C++ language with some additions and restrictions Enables GPGPU – ‘General Purpose Computing on GPUs’ High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU: a multithreaded coprocessor SM SP: scalar processor SP SP SP SP ‘CUDA core’ SP SP SP SP Executes one thread SP SP SP SP SP SP SP SP SM streaming multiprocessor SHARED MEMORY 32xSP (or 16, 48 or more) Fast local ‘shared memory’ (shared between SPs) 16 KiB (or 64 KiB) High Performance Computing with GPUs: An Introduction GLOBAL MEMORY (ON DEVICE) Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ GPU: SMs o 30xSM on GT200, o 14xSM on Fermi For example, GTX 480: 14 SMs x 32 cores = 448 cores on a GPU SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GDDR memory GLOBAL MEMORY 512 MiB - 6 GiB (ON DEVICE) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ How To Program For GPUs Parallelization Decomposition to threads Memory shared memory, global memory SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Important Things To Keep In Mind Avoid divergent branches Threads of single SM must be executing the same code Code that branches heavily and unpredictably will execute slowly Threads shoud be independent as much as possible Synchronization and communication can be done efficiently only for threads of single multiprocessor High Performance Computing with GPUs: An Introduction SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ How To Program For GPUs Parallelization Memory shared memory, global memory Enormous processing power Decomposition to threads Avoid divergence Thread communication Synchronization, no interdependencies SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SHARED MEMORY GLOBAL MEMORY (ON DEVICE) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Programming model High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Thread blocks BLOCK 1 Threads grouped in thread blocks THREAD THREAD THREAD (0,0) (0,1) (0,2) 128, 192 or 256 threads in a block THREAD THREAD THREAD (1,0) (1,1) (1,2) • One thread block executes on one SM – All threads sharing the ‘shared memory’ – 32 threads are executed simultaneously (‘warp’) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Thread blocks Blocks execute on SMs - execute in parallel - execute independently! BLOCK 1 • Blocks form a GRID • Thread ID unique within block • Block ID unique within grid High Performance Computing with GPUs: An Introduction THREAD THREAD THREAD (0,0) (0,1) (0,2) THREAD THREAD THREAD (1,0) (1,1) (1,2) BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 BLOCK 6 BLOCK 7 BLOCK 8 Grid Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Code that executes on GPU: Kernels Kernel - a simple C function - executes on GPU - Executes in parallel as many times as there are threads The keyword __global__ tells the compiler to make a function a kernel (and compile it for the GPU, instead of the CPU) High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Convolution To get one pixel of output image: - multiply (pixelwise) mask with image at corresponding position - sum the products High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ __global__ void Convolve( float* img, int imgW, int imgH, float* filt, int filtW, int filtH, float* out) { const int nThreads = blockDim.x * gridDim.x; const int idx = blockIdx.x * blockDim.x + threadIdx.x; const int outW = imgW – filtW + 1; const int outH = imgH – filtH + 1; const int nPixels = outW * outH; for(int < nPixels; curPixel += nThreads) for (int curPixel y = 0; y =< idx; outH;curPixel y++) {for (int x = 0; x < outW; x++) int x = curPixel % outW; { int y = curPixel / outW; float sum = 0; for (int filtY = 0; filtY < filtH; filtY++) for (int filtX = 0; filtX < filtW; filtX++) { int sx = x + filtX; int sy = y + filtY; sum+= img[sy*imgW + sx] * filt[filtY*filtW + filtX]; } out[y * outW + x] = sum; } } Setup and data transfer cudaMemcpy transfer data to and from GPU (global memory) cudaMalloc Allocate memory on GPU (global memory) GPU is the ‘device’, CPU is the ‘host’ Kernel call syntax High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ int main() { ... float* img ... int imgW, imgH ... float* imgGPU; cudaMalloc((void**)& imgGPU, imgW * imgH * cudaMemcpy( imgGPU, // img, // imgW * imgH * sizeof(float), // cudaMemcpyHostToDevice // ); sizeof(float)); Destination Source Size in bytes Direction float* filter ... int filterW, filterH ... float* filterGPU; cudaMalloc((void**)& filterGPU, filterW * filterH * sizeof(float)); cudaMemcpy( filterGPU, // Destination filter, // Source filterW * filterH * sizeof(float), // Size in bytes cudaMemcpyHostToDevice // Direction ); int resultW = imgW – filterW + 1; int resultH = imgH – filterH + 1; float* result = (float*) malloc(resultW * resultH * sizeof(float)); float* resultGPU; cudaMalloc((void**) &resultGPU, resultW * resultH * sizeof(float)); /* Call the GPU kernel */ dim3 block(128); dim3 grid(30); Convolve<<<grid, block>>> ( imgGPU, imgW, imgH, filterGPU, filterW, filterH, resultGPU ); cudaMemcpy( result, resultGPU, resultW * resultH * sizeof(float), cudaMemcpyDeviceToHost ); cudaThreadExit(); ... } // // // // Desination Source Size in bytes Direction High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Speedup Linear combination of 3 filters sized 15x15 Image size: 2k x 2k CPU: Core 2 @ 2.0 GHz (1 core) GPU: Tesla S1070 (GT200 ) 30xSM, 240 CUDA cores, 1.3 GHz CPU: 6.58 s GPU: 0.21 s 0.89 Mpixels/s 27.99 Mpixels/s 31 times faster! High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA capabilities 1.0 GeForce 8800 Ultra/GTX/GTS 1.1 GeForce 9800 GT, GTX, GTS 250 + atomic instructions … 1.2 GeForce GT 220 1.3 Tesla S1070, C1060, GeForce GTX 275,285 + double precision (slow) … 2.0 Tesla C2050, GeForce GTX 480, 470 + ECC, L1 and L2 cache, faster IMUL, faster atomics, faster double precision on Tesla cards … High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ CUDA essentials developer.nvidia.com/object/cuda_3_1_downloads.html Download Driver Toolkit (compiler nvcc) SDK (examples) (recommended) CUDA Programmers guide High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Other tools ‘Emulator’ Executes on CPU Slow Simple profiler cuda-gdb (Linux) Paralel Nsight (Vista) simple profiler on-device debugger High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ ... ... High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ Logical thread hierarchy Thread ID – unique within block Block ID – unique within grid To get globally unique thread ID: Combine block ID and thread ID Threads can access both shared and global memory High Performance Computing with GPUs: An Introduction Krešimir Ćosić <cosic.kresimir@gmail.com>, Thursday, August 12th, 2010. LSST All Hands Meeting 2010, Tucson, AZ