Lecture 8: CUDA CUDA A scalable parallel programming model for GPUs and multicore CPUs Provides facilities for heterogeneous programming Allows the GPU to be used as both a graphics processor and a computing processor. Pollack’s Rule Performance increase is roughly proportional to the square root of the increase in complexity performance √complexity Power consumption increase is roughly linearly proportional to the increase in complexity power consumption complexity CUDA SPMD (Single Program Multiple Data) Programming Model Programmer writes code for a single thread and GPU runs thread instances in parallel Extends C and C++ CUDA Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism CUDA Kernel: a function designed to be executed by many threads Thread block: a set of concurrent threads that can cooperate among themselves through barrier synchronization and through shared-memory access Grid: a set of thread blocks execute the same kernel program function designed to be executed by many threads CUDA Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallelism and task parallelism CUDA __ global__ void mykernel (int a, …) { ... } main() { ... nblocks = N/512; // max. 512 threads per block mykernel <<< nblocks, 512 >>> (aa, …); ... } CUDA CUDA Thread management is performed by hardware Max. 512 threads per block The number of blocks can exceed the number of processors Blocks execute independently and in any order Threads can communicate through shared memory Atomic memory operations exist on the global memory CUDA Memory Types Local Memory: private to a thread Shared Memory: shared by all threads of the block __shared__ Device memory: shared by all threads of an application __device__ Pollack’s Rule Performance increase is roughly proportional to the square root of the increase in complexity performance √complexity Power consumption increase is roughly linearly proportional to the increase in complexity power consumption complexity CUDA Three key abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA provides fine-grained data parallelism and thread parallelism nested within coarse-grained data parallel and task parallelism CUDA __ global__ void mykernel (float* a, …) { ... } main() { ... int nbytes=N*sizeof(float); float* ha=(float*)malloc(nbytes); float* da=0; cudaMalloc((void**)&da, nbytes); cudaMemcpy(da, ha, nbytes, CudaMemcpyHosttoDevice); mykernel <<< N/blocksize, blocksize >>> (da, …); cudaMemcpy(ha, da, nbytes, CudaMemcpyDevicetoHost); cudaFree(da); ... } CUDA Synchronization Barrier: Threads wait until all threads in the block arrive at the barrier __syncthreads() The thread increments barrier count and the scheduler marks it as waiting. When all threads arrive barrier, scheduler releases all waiting threads. CUDA __ global__ void shift_reduce (int *inp, int N, int *tot) { unsigned int tid = threadIdx.x; unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; __shared__ int x[blocksize]; x[tid] = (i<N) ? inp[i] : 0; __synchthreads(); for (int s=blockDim.x / 2; s>0; s=s/2) { if (tid<s) x[tid] += x[tid+s]; __synchthreads(); } if (tid==0) atomicAdd(tot, x[tid]); } CUDA SPMD (Single Program Multiple Data) programming model All threads execute the same program Threads coordinate with barrier synchronization Threads of a block express fine-grained data parallelism and thread parallelism Independent blocks of a grid express coarse-grained data parallelism Independent grids express coarse-grained task parallelism CUDA CUDA Scheduler Hardware management and scheduling of threads and thread blocks Scheduler has minimal runtime overhead CUDA Multithreading Memory and texture fetch latency requires hundreds of processor clock cycles While one thread is waiting for a load or texture fetch, the processor can execute another thread Thousands of independent threads can keep many processors busy CUDA GPU Multiprocessor Architecture Lightweight thread creation Zero-overhead thread scheduling Fast barrier synchronization Each thread has its own Private registers Private per-thread memory PC Thread execution state Support very fine-grained parallelism CUDA CUDA GPU Multiprocessor Architecture Each SP core contains scalar integer and floating-point units is hardware multithreaded supports up to 64 threads is pipelined and executes one instruction per thread per clock has a large register file (RF), 1024 32-bit registers, registers are partitioned among the assigned threads (Programs declare their register demand; compiler optimizes register allocation. Ex: (a) 32 registers per thread => 256 threads per block, or (b) fewer registers – more threads, or (c) more registers – fewer threads) CUDA Single Instruction Multiple Thread (SIMT) SIMT: a processor architecture that applies one instruction to multiple independent threads in parallel Warp: the set of parallel threads that execute the same instruction together in a SIMT architecture Warp size is 32 threads (4 threads per SP, executed in 4 clock cycles) Threads in a warp start at the same program address, but they can branch and execute independently. Individual threads may be inactive due to independent branching CUDA CUDA SIMT Warp Execution There are 4 thread lanes per SP An issued warp instruction executes in 4 processor cycles Instruction scheduler selects a warp every 4 clocks The controler: • • • • • Collects thread programs into warps Allocates a warp Allocates registers for the warp threads (it can start a warp only when it can allocate the requested register count) Starts warp execution When all threads exit, it frees the registers CUDA Streaming Processor (SP) Has 1024 32-bit registers (RF) Can perform 32-bit and 64-bit integer operations: arithmetic, comparison, conversion, logic operations Can perform 32-bit floating-point operations: add, multiply, min, max, multiply-add, etc. SFU (Special Function Unit) Pipelined unit Generates one 32-bit floating-point function per cycle: square root, sin, cos, 2x, log2x CUDA CUDA Memory System Global Memory – external DRAM Shared Memory – on chip Per-thread local memory – external DRAM Constant memory – in external DRAM and cached in shared memory Texture memory – on chip Project Performance Measurement, Evaluation and Prediction of Multicore and GPU systems Multicore systems CPU performance (instruction execution time, pipelining, etc.) Cache performance Performance using algorithmic structures GPU systems (NVIDIA-CUDA) GPU core performance (instruction execution time, pipelining, etc.) Global and shared memory performance (2) Performance using algorithmic structures GPU performance in MATLAB environment