Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3 Agenda GPU architecture review CUDA First of two or three dedicated classes Acknowledgements Many slides are from Kayvon Fatahalian's From Shader Code to a Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03fatahalian_gpuArchTeraflop_BPS_SIGGRAPH201 0.pdf David Kirk and Wen-mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/ GPU Architecture Review GPUs are: Parallel Multithreaded Many-core GPUs have: Tremendous computational horsepower High memory bandwidth GPU Architecture Review GPUs are specialized for Compute-intensive, highly parallel computation Graphics! Transistors are devoted to: Processing Not: Data caching Flow control GPU Architecture Review Transistor Usage Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf Threading Hardware in G80 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA Fixed-function pipeline 3D API: OpenGL or Direct3D 3D API Commands 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Primitive Assembly Pre-transformed Fragments Pre-transformed Vertices Programmable Vertex Processor Programmable Fragment Processor Transformed Fragments GPU Front End Pixel Pixel Location Updates Stream Rasterization Raster Frame and Operations Buffer Interpolation Assembled Primitives Transformed Vertices GPU Command & Data Stream Vertex Index Stream Programmable pipeline 3D API: OpenGL or Direct3D 3D API Commands 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Primitive Assembly Pre-transformed Fragments Pre-transformed Vertices Programmable Vertex Processor Programmable Fragment Processor Transformed Fragments GPU Front End Pixel Pixel Location Updates Stream Rasterization Raster Frame and Operations Buffer Interpolation Assembled Primitives Transformed Vertices GPU Command & Data Stream Vertex Index Stream 3D API: OpenGL or Direct3D 3D API Commands Unified Programmable pipeline 3D Application Or Game CPU-GPU Boundary (AGP/PCIe) Pixel Pixel Location Updates Stream Rasterization Raster Frame and Operations Buffer Interpolation Primitive Assembly Pre-transformed Fragments Pre-transformed Vertices Unified Vertex, Fragment, Geometry Processor Transformed Fragments Assembled Primitives Transformed Vertices GPU Command & Data Stream GPU Front End Vertex Index Stream General Diagram (6800/NV40) TurboCache Uses PCI-Express bandwidth to render directly to system memory Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates from main memory Local memory used to cache data and to deliver peak performance when needed NV40 Vertex Processor An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle NV40 Fragment Processors Early termination from mini z buffer and z buffer checks; resulting sets of 4 pixels (quads) passed on to fragment units Why NV40 series was better Massive parallelism Scalability Lower end products have fewer pixel pipes and fewer vertex shader units Computation Power 222 million transistors First to comply with Microsoft’s DirectX 9 spec Dynamic Branching in pixel shaders Dynamic Branching Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of consecutive pixels that take the same branch) Better distribution of blocks of pixels between the different quad engines General Diagram (7800/G70) General Diagram (7800/G70) General Diagram (6800/NV40) GeForce Go 7800 – Power Issues Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game GeForce 7800 GTX Parallelism 8 Vertex Engines Z-Cull Triangle Setup/Raster Shader Instruction Dispatch Fragment Crossbar Memory Partition Memory Partition 24 Pixel Shaders 16 Raster Operation Pipelines Memory Partition Memory Partition G80 – Graphics Mode The future of GPUs is programmable processing So – build the architecture around the processor Host Input Assembler Setup / Rstr / ZCull Vtx Thread Issue SP SP SP TF SP TF L1 SP TF L1 SP SP SP TF L1 L1 SP SP TF L1 L2 FB Pixel Thread Issue SP TF L2 FB SP SP TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor L2 FB G80 CUDA mode – A Device Example Processors execute computing threads New operating mode/HW interface for computing Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Global Memory Load/store Load/store Why Use the GPU for Computing ? The GPU has evolved into a very flexible and powerful processor: GFLOPS It’s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS: G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra GPU in every PC and workstation NV30 = GeForce FX 5800 What is Behind such Evolution? The GPU is specialized for an compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing rather than data caching and flow control ALU ALU ALU ALU Control CPU GPU Cache DRAM DRAM The fast-growing video game industry exerts What is (Historical) GPGPU ? General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications – see //GPGPU.org Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of the graphics API Fragment Program Addressing modes Limited Input Registers Temp Registers Output Registers FB Limited Instruction sets Lack outputs of Integer & bit ops Communication limited Between pixels Scatter a[i] = p Texture Constants texture size/dimension Shader capabilities per thread per Shader per Context Memory An Example of Physical Reality Behind CUDA CPU (host) GPU w/ local DRAM (device) Arrays of Parallel Threads • A CUDA kernel is executed by an array of threads – All threads run the same code (SPMD) – Each thread has an ID that it uses to compute memory addresses and make control decisions threadID 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks Threads within a block cooperate via shared memory, atomic operations and barrier synchronization Thread Block 0 Thread Block N - 1 Threads blocks cannot cooperate Thread Block 0in different threadID 0 1 2 3 4 5 6 … float x = input[threadID]; float y = func(x); output[threadID] = y; … 7 0 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … 0 … 1 2 3 4 5 6 7 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks All threads share data memory space Device Grid 1 Kernel 1 A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution Host For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA Block and Thread IDs Threads and blocks have IDs Grid 1 So each thread can decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Device Image processing Solving PDEs on volumes … Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) Courtesy: NDVIA CUDA Device Memory Space Overview Each thread can: (Device) Grid R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory Host The host can R/W global, constant, and texture memories Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory Texture Memory Local Memory Local Memory Global, Constant, and Texture Memories (Long Latency Accesses) Global memory (Device) Grid Main means of communicating R/W Data between host and device Contents visible to all threads Block (0, 0) Shared Memory Registers Texture and Constant Memories Constants initialized by host Contents visible to all threads Block (1, 0) Host Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory Courtesy: NDVIA Local Memory Block IDs and Thread IDs Host Each thread uses IDs to decide what data to work on Device Grid 1 Kernel 1 Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) Simplifies memory addressing when processing multidimensional data Image processing (0,0,1) (1,0,1) (2,0,1) (3,0,1) Thread Thread Thread Thread (0,0,0) (1,0,0) (2,0,0) (3,0,0) Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) Courtesy: NDVIA Figure 3.2. An Example of CUDA Thread Org CUDA Memory Model Overview Global memory Main means of communicating R/W Data between host and device Contents visible to all threads Long latency Host access We will focus on global memory for Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Global Memory Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Parallel Computing on a GPU 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications GeForce 8800 Available in laptops, desktops, and clusters Tesla D870 GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA Tesla S870 Single-Program Multiple-Data (SPMD) CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU CPU Serial Code thread blocks Grid 0 GPU Parallel Kernel KernelA<<< nBlk, nTid >>>(args); ... CPU Serial Code Grid 1 GPU Parallel Kernel KernelB<<< nBlk, nTid >>>(args); ... Grids and Blocks A kernel is executed as a grid of thread blocks All threads share global memory space A thread block is a batch of threads that can cooperate with each other by: Host Device Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (0, 1) Block (1, 1) Grid 2 Kernel 2 Block (1, 1) (0,0,1) (1,0,1) (2,0,1) (3,0,1) Synchronizing their Thread Thread Thread Thread execution using barrier (0,0,0) (1,0,0) (2,0,0) (3,0,0) Efficiently sharing data Thread Thread Thread Thread (0,1,0) (1,1,0) (2,1,0) (3,1,0) through a low latency Courtesy: NDVIA shared memory Two threads fromFigure two 3.2. An Example of CUDA Thread Org CUDA Thread Block Programmer declares (Thread) Block: Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads All threads in a Block execute the same thread program Threads share data and synchronize while doing their share of the work Threads have thread id numbers within Block CUDA Thread Block Thread Id #: 0123… m Thread program Courtesy: John Nickolls, NVIDIA GeForce-8 Series HW Overview Streaming Processor Array TPC TPC TPC Texture Processor Cluster … TPC TPC Streaming Multiprocessor Instruction L1 SM TPC Data L1 Instruction Fetch/Dispatch Shared Memory TEX SP SM SP SP SP SFU SFU SP SP SP SP CUDA Processor Terminology SPA TPC Streaming Processor Array (variable across GeForce 8-series, 8 in GeForce8800) Texture Processor Cluster (2 SM + TEX) SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block SP Streaming Processor Streaming Multiprocessor (SM) Streaming Multiprocessor (SM) Multi-threaded instruction dispatch 8 Streaming Processors (SP) 2 Super Function Units (SFU) 1 to 512 threads active Shared instruction fetch per 32 threads Cover latency of texture/memory loads 20+ GFLOPS 16 KB shared memory texture and global memory Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP G80 Thread Computing Pipeline Processors computing threads The future ofexecute GPUs is programmable processing Alternative operating mode specifically for Generates Thread grids based on computing So – build the architecture around the kernel calls processor Host Host Input Input Assembler Assembler Setup / Rstr / ZCull Thread Execution Manager Vtx Thread Issue SP SP SP SP SP SP Geom Thread Issue SP SP SP SP Pixel Thread Issue SP SP SP SP SP SP Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Parallel Data TF Cache Texture Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Texture L1 Load/store L2 Load/storeL2 FB FB Load/store L2 Load/store L2 FB Global Memory FB Thread Processor Load/store L2 Load/store L2 FB FB Thread Life Cycle in HW Grid is launched on the SPA Thread Blocks are serially distributed to all the SM’s Each SM launches Warps of Threads Potentially >1 Thread Block per SM 2 levels of parallelism SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, Host Device Grid 1 Kernel 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) SM Executes Blocks t0 t1 t2 … tm SM 0 SM 1 MT IU SP t0 t1 t2 … tm MT IU Blocks SP Blocks Shared Memory Threads are assigned to SMs in Block granularity Shared Memory TF Texture L1 Up to 8 Blocks to each SM as resource allows SM in G80 can take up to 768 threads L2 Memory Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently Thread Scheduling/Execution Each Thread Blocks is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps Block 1 Warps … t0 t1 t2 … t31 … Block 2 Warps … t0 t1 t2 … t31 … Streaming Multiprocessor Instruction L1 Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP SM Warp Scheduling SM multithreaded Warp scheduler SM hardware implements zero-overhead Warp scheduling time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 .. . warp 8 instruction 12 warp 3 instruction 96 Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle from instruction L1 cache into any instruction buffer slot Issue one “ready-to-go” warp instruction/cycle I$ L1 from any warp - instruction buffer slot operand scoreboarding used to prevent hazards Issue selection based on roundrobin/age of warp Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Scoreboarding All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited prevents hazards cleared instructions are eligible for issue Decoupled Memory/Processor pipelines TB1, W1 stall TB2, W1 stall TB3, W2 stall any thread can continue to issue instructions until TB1 TB2 TB3 issue TB3 TB2 TB1 TB1 TB1 TB3 scoreboarding prevents W1 W1 W1 W2 W1 W1 W2 W3 W2 Instruction: 1 2 3 4 5 6 1 2 1 2 1 2 3 4 7 8 1 2 1 2 3 4 allows Memory/Processor ops to proceed in Time of other waiting TB = Thread Block, W = Warp ops shadow Memory/Processor Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles? For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM! For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM! There are 8 warps but each warp is only half full. There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule. There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension Memory Hardware in G80 CUDA Device Memory Space: Review Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory The host can Host Read only per-grid constant R/W global, memory constant, and Read only per-grid texture memory texture (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory Texture Memory Local Memory Local Memory Parallel Memory Sharing Thread Local Memory Local Memory: thread Block Shared Memory Grid 0 ... Grid 1 ... Private per thread Auto variables, register spill Shared Memory: Block per- per- Shared by threads of the same block Inter-thread communication Global Sequential Global Memory: Memory Grids perapplication in Time Shared by all threads Inter-Grid communication SM Memory Architecture t0 t1 t2 … tm SM 0 SM 1 MT IU SP t0 t1 t2 … tm MT IU Blocks SP Blocks Shared Memory Shared Memory Threads in a block share data & results TF Texture L1 Courtesy: John Nicols, NVIDIA L2 Memory In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor SM Register File Register File (RF) 32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF I$ L1 2 SMs share 1 TEX Load/Store pipe can also read/write RF Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MAD SFU Programmer View of Register 3 blocks 4 blocks File There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a Matrix Multiplication Example If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * 2560 + change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 More on Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level Let’s program this thing! GPU Computing History 2001/2002 – researchers see GPU as dataparallel coprocessor The GPGPU field is born 2007 – NVIDIA releases CUDA – Compute Uniform Device Architecture GPGPU shifts to GPU Computing CUDA 2008 – Khronos releases OpenCL specification CUDA Abstractions A hierarchy of thread groups Shared memories Barrier synchronization CUDA Terminology Host – typically the CPU Code written in ANSI C Device – typically the GPU (data-parallel) Code written in extended ANSI C Host and device have separate memories CUDA Program Contains both host and device code CUDA Terminology Kernel – data-parallel function Invoking a kernel creates lightweight threads on the device Threads are generated and scheduled with hardware Does a kernel remind you of a shader in OpenGL? CUDA Kernels Executed N times in parallel by N different CUDA threads Thread ID Declaration Specifier Execution Configuration CUDA Program Execution Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies Grid – one or more thread blocks 1D or 2D Block – array of threads 1D, 2D, or 3D Each block in a grid has the same number of threads Each thread in a block can Synchronize Access shared memory Thread Hierarchies Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies Block – 1D, 2D, or 3D Example: Index into vector, matrix, volume Thread Hierarchies Thread ID: Scalar thread identifier Thread Index: threadIdx 1D: Thread ID == Thread Index 2D with size (Dx, Dy) Thread ID of index (x, y) == x + y Dy 3D with size (Dx, Dy, Dz) Thread ID of index (x, y, z) == x + y Dy + z Dx Dy Thread Hierarchies 2D Index 1 Thread Block 2D Block Thread Hierarchies Thread Block Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Thread Hierarchies Thread Block Group of threads G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads Reside on same processor core Share memory of that core Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Thread Hierarchies Block Index: blockIdx Dimension: blockDim 1D or 2D Thread Hierarchies 16x16 Threads per block 2D Thread Block Thread Hierarchies Example: N = 32 16x16 threads per block (independent of N) threadIdx ([0, 15], [0, 15]) 2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16 i = [0, 1] * 16 + [0, 15] Thread Hierarchies Thread blocks execute independently In any order: parallel or series Scheduled in any order by any number of cores Allows code to scale with core count Thread Hierarchies Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf Thread Hierarchies Threads in a block Share (limited) low-latency memory Synchronize execution To coordinate memory accesses __syncThreads() Barrier – threads in block wait until all threads reach this Lightweight Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Host can transfer to/from device Global memory Constant memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers cudaMalloc() Allocate global memory on device cudaFree() Frees memory Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Pointer to device memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Size in bytes Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory Does this remind you of VBOs in OpenGL? CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory CUDA Memory Transfers cudaMemcpy() Memory transfer Host to host Host to device Device to host Device to device Host Device Global Memory All transfers are asynchronous CUDA Memory Transfers Host to device Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Destination (device) Source (host) Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf CUDA Memory Transfers Host Device Global Memory Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply P=M*N Assume M and N are square for simplicity Is this data-parallel? Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply 1,000 x 1,000 matrix 1,000,000 dot products Each 1,000 multiples and 1,000 adds Matrix Multiply: CPU Implementation void MatrixMulOnHost(float* M, float* N, float* P, int width) { for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; } } Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Skeleton Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply Step 1 Add CUDA memory transfers to the skeleton Matrix Multiply: Data Transfer Allocate input Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Allocate output Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Read back from device Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: Data Transfer Does this remind you of GPGPU with GLSL? Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply Step 2 Implement the kernel in CUDA C Matrix Multiply: CUDA Kernel Accessing a matrix, so using a 2D block Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel Each kernel computes one output Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel Where did the two outer for loops in the CPU implementation go? Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply: CUDA Kernel No locks or synchronization, why? Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply Step 3 Invoke the kernel in CUDA C Matrix Multiply: Invoke Kernel One block with width by width threads Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf Matrix Multiply Block 1 One Block of threads compute matrix Pd Nd Grid 1 2 4 Each thread computes one element of Pd 2 Thread (2, 2) 6 Each thread Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition for each pair of Md and Nd elements Compute to off-chip memory access ratio close to 1:1 (not very high) Size of matrix limited by the number of threads allowed in a thread block © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 3 2 5 4 48 WIDTH Md Pd 125 Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt Matrix Multiply What is the major performance problem with our implementation? What is the major limitation?