Getting Started with GPU Computing Dan Negrut Assistant Professor Simulation-Based Engineering Lab Dept. of Mechanical Engineering University of Wisconsin-Madison San Diego August 30, 2009 Acknowledgement Colleagues helping to organize the GPU Workshop: Sara McMains, Krishnan Suresh, Roshan D’Souza Wen-mei W. Hwu NVIDIA Corporation My students Hammad Mazhar Toby Heyn 2 Acknowledgements: Financial Support [Dan Negrut] NSF NVIDIA Corporation British Aerospace Engineering (BAE), Land Division Argonne National Lab 3 Overview Parallel computing: why, and why now? (15 mins) GPU Programming: The democratization of parallel computing (60 mins) NVIDIA’s CUDA, a facilitator of GPU computing Comments on the execution configuration and execution model The memory layout Gauging resource utilization IDE support Comments on GPU computing (15 mins) Sources of information Beyond CUDA 4 Scientific Computing: A Change of Tide... A paradigm shift taking place in Scientific Computing Moving from sequential to parallel data processing Triggered by changes in the microprocessor industry 5 CPU: Three Walls to Serial Performance Memory Wall Instruction Level Parallelism (ILP) Wall Source: excellent article, “The ManyCore Inflection Point for Mass Market Computer Systems”, by John L. Manferdelli, Microsoft Corporation http://www.ctwatch.org/quarterly/articles /2007/02/the-many-core-inflection-pointfor-mass-market-computer-systems/ Power Wall 6 Memory Wall There is a growing disparity of speed between CPU and memory access outside the CPU chip S. Cray: “Anyone can build a fast CPU. The trick is to build a fast system” 7 Memory Wall The processor often data starved (idle) due to latency and limited communication bandwidth beyond chip boundaries From 1986 to 2000, CPU speed improved at an annual rate of 55% while memory access speed only improved at 10%. Some fixes Strong push for ever growing caches to improve the average memory reference time to fetch or write data Hyper-threading Technology (HTT) 8 The Power Wall “Power, and not manufacturing, limits traditional general purpose microarchitecture improvements” (F. Pollack, Intel Fellow) Leakage power dissipation gets worse as gates get smaller, because gate dielectric thicknesses must proportionately decrease W / cm2 Nuclear reactor Pentium II Pentium i386 i486 Pentium 4 Core DUO Pentium III Pentium Pro Technology from older to newer (μm) Adapted from F. Pollack (MICRO’99) 9 The Power Wall Power dissipation in clocked digital devices is proportional to the square of clock frequency imposing natural limit on clock rates Significant increase in clock speed without heroic (and expensive) cooling is not possible. Chips would simply melt. 10 The Power Wall Clock speed increased by a factor of 4,000 in less than two decades The ability of manufacturers to dissipate heat is limited though… Look back at the last five years, the clock rates are pretty much flat 2010 Intel’s Sandy Bridge microprocessor architecture, to go up to 4.0 GHz 11 The Bright Spot: Moore’s Law 1965 paper: Doubling of the number of transistors on integrated circuits every two years Moore himself wrote only about the density of components (or transistors) at minimum cost Increase in transistor count to some extent as a rough measure of computer processing performance http://news.cnet.com/Images-Moores-Law-turns-40/2009-1041_3-5649019.html 12 Micro2015: Evolving Processor Architecture, Intel® Developer Forum, March 2005 Intel’s Vision: Evolutionary Configurable Architecture Large, Scalar cores for high single-thread performance Scalar plus many core for highly threaded workloads Multi-core array • CMP with ~10 cores Many-core array • CMP with 10s-100s low power cores • Scalar cores • Capable of TFLOPS+ • Full System-on-Chip • Servers, workstations embedded… Dual core • Symmetric multithreading CMP = “chip multi-processor” Presentation Paul Petersen, Sr. Principal Engineer, Intel 13 Putting things in perspective… The way business has been run in the past It will probably change to this… Rely exclusively on frequency increase Parallelism is primary method of performance improvement For the commoner: Don’t bother parallelizing an application (after all, you get a meager speedup) No scientific computing application relies on one core chips Less than linear scaling for a multiprocessor is failure Sub-linear speedups are ok as long as you beat the sequential Slide Source: Berkeley View of Landscape 14 Some numbers would be good… 15 GPU vs. CPU Flop Rate Comparison (single precision rate for GPU) Seymour Cray: "If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?" 16 Key Parameters GPU, CPU GPU – NVIDIA Tesla C1060 CPU – Intel core I7 975 Extreme Processing Cores 240 4 Memory 4 GB - 32 KB L1 cache / core - 256 KB L2 (I&D)cache / core - 8 MB L3 (I&D) shared by all cores Clock speed 1.33 GHz 3.20 GHz Memory bandwidth 102 GB/s 32.0 GB/s Floating point operations/s 933 x 109 Single Precision 70 x 109 Double Precision 17 The GPU Hardware 18 19 GPU: Underlying Hardware NVIDIA nomenclature used below, reminiscent of GPU’s mission The hardware organized as follows: One Stream Processor Array (SPA)… … has a collection of Texture Processor Clusters (TPC, ten of them on C1060) … each a¾¾ ¾® …and each TPC has three Stream Multiprocessors (SM) … …and each SM is made up of eight Stream or Scalar Processor (SP) 20 NVIDIA TESLA C1060 21 240 Scalar Processors 4 GB device memory Memory Bandwidth: 102 GB/s Clock Rate: 1.3GHz Approx. $1,250 Layout of Typical Hardware Architecture CPU (the host) GPU w/ local DRAM (the device) 22 GPGPU Computing GPGPU computing: “General Purpose” GPU computing The GPU can be used for more than just graphics: the computational resources are there, and they are most of the time underutilized GPU can be used to accelerate data parallel parts of an application 23 GPGPU: Pluses and Minuses Simple architecture optimized for compute intensive task High precision floating point arithmetic support Large data arrays, streaming throughput Fine-grain SIMD (Singe Instruction Multiple Data) parallelism Low-latency floating point (FP) computation 32bit floating point IEEE 754 However, GPU was only programmable relying on graphics library APIs 24 GPGPU: Pluses and Minuses [Cntd.] Dealing with graphics API Addressing modes Limited texture size/dimension Shader capabilities Input Registers Fragment Program Limited outputs per thread per Shader per Context Texture Constants Temp Registers Instruction sets Lack of Integer & bit ops Output Registers FB Memory Communication limited Between pixels Only gather (can read data from other pixels), but no scatter (can only write to one pixel) Summing Up: Mapping computation problems to graphics rendering pipeline tedious… 25 CUDA: Addressing the Minuses in GPGPU “Compute Unified Device Architecture” It represents a general purpose programming model Targeted software stack User kicks off batches of threads on the GPU Scientific computing oriented drivers, language, and tools Driver for loading computation programs into GPU Standalone Driver - Optimized for computation Interface designed for compute - graphics free API Guaranteed maximum download & readback speeds Explicit GPU memory management 26 The CUDA Execution Model GPU Computing – The Basic Idea The GPU is linked to the CPU by a reasonably fast connection The idea is to use the GPU as a co-processor Farm out big parallelizable tasks to the GPU Keep the CPU busy with the control of the execution and “corner” tasks 28 GPU Computing – The Basic Idea [Cntd.] You have to copy data onto the GPU and later fetch results back. For this to pay off, the data transfer should be overshadowed by the number crunching that draws on that data GPUs also work in asynchronous mode Data transfer for future task can happen while the GPU processes current job 29 Some Nomenclature… The HOST This is your CPU executing the “master” thread The DEVICE This is the GPU card, connected to the HOST through a PCIe X16 connection The HOST (the master thread) calls the DEVICE to execute a KERNEL When calling the KERNEL, the HOST also has to inform the DEVICE how many threads should each execute the KERNEL This is called “defining the execution configuration” 30 Calling a Kernel Function, Details A kernel function must be called with an execution configuration: __global__ void KernelFoo(...); // declaration dim3 DimGrid(100, 50); dim3 DimBlock(4, 8, 8); // 5000 thread blocks // 256 threads per block KernelFoo<<< DimGrid, DimBlock>>>(...arg list here…); Any call to a kernel function is asynchronous By default, execution on host doesn’t wait for kernel to finish 31 Example The host call below instructs the GPU to execute the function (kernel) “foo” using 25,600 threads Two arguments are passed down to each thread executing the kernel “foo” In this execution configuration, the host instructs the device that it is supposed to run 100 blocks each having 256 threads in it The concept of block it’s important, since it represents the entity that gets executed by an SMs 32 30,000 Feet Perspective This is how your C code looks like This is how the code gets executed on the hardware in heterogeneous computing 33 34 More on the Execution Model There is a limitation on the number of blocks in a grid: The grid of blocks can be organized as a 2D structure: max of 65535 by 65535 grid of blocks (that is, no more than 4,294,836,225 blocks for a kernel call) Threads in each block: The threads can be organized as a 3D structure (x,y,z) The total number of threads in each block cannot be larger than 512 35 Kernel Call Overhead How much time is it burnt by the CPU calling the GPU? No arguments in the kernel call Values reported below are averages over 100,000 kernel calls GT 8800 series, CUDA 1.1: 0.115305 milliseconds Tesla C1060, CUDA 1.3: 0.088493 milliseconds Arguments present in the kernel call GT 8800 series, CUDA 1.1: 0.146812 milliseconds Tesla C1060, CUDA 1.3: 0.116648 milliseconds 36 Languages Supported in CUDA Note that everything is done in C Yet minor extensions are needed to flag the fact that a function actually represents a kernel, that there are functions that will only run on the device, etc. Called “C with extensions” FOTRAN is supported, ongoing project with the Portland Group (PGI) There is support for C++ programming (operator overload, for instance) 37 CUDA Function Declarations (the “C with extensions” part) Executed on the: Only callable from the: __device__ float myDeviceFunc() device device __global__ void device host host host __host__ float myHostFunc() __global__ defines a kernel function myKernelFunc() Must return void For a full list, see CUDA Reference Manual 38 Block Execution Scheduling Issues Who’s Executing Here? [The Stream Multiprocessor (SM)] The SM represents the quantum of scalability on NVIDIA’s architecture My laptop: 4 SMs The Tesla C1060: 30 SMs Stream Multiprocessor Stream Multiprocessor (SM) 8 Scalar Processors (SP) 2 Special Function Units (SFU) It’s where a block lands for execution Instruction L1 Instruction Fetch/Dispatch Shared Memory SP Multi-threaded instruction dispatch From 1 up to 1024 (!) threads active Shared instruction fetch per 32 threads 16 KB shared memory + 16 KB of registers DRAM texture and memory access Data L1 SP SP SP SFU SFU SP SP SP SP 40 Scheduling on the Hardware Host Device Grid is launched on the SPA Grid 1 Thread Blocks are serially distributed to all the SMs Kernel 1 Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Potentially >1 Thread Block per SM Each SM launches Warps of Threads SM schedules and executes Warps that are ready to run As Warps and Thread Blocks complete, resources are freed Block (0, 0) Grid 2 Kernel 2 Block (1, 1) SPA can launch next Block[s] in line Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) NOTE: Two levels of scheduling: For running [desirably] a large number of blocks on a small number of SMs (16/14/etc.) For running up to 32 warps of threads on the 8 SPs available on each SM Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 41 SM Executes Blocks t0 t1 t2 … tm SM 0 SM 1 MT IU SP t0 t1 t2 … tm MT IU Blocks SP Blocks Threads are assigned to SMs in Block granularity Shared Memory Shared Memory Up to 8 Blocks to each SM (doesn’t mean you’ll have eight though…) One SM can take up to 1024 threads TF Texture L1 L2 This is 32 warps Could be 256 (threads/block) * 4 blocks Or 128 (threads/block) * 8 blocks, etc. Threads run concurrently but time slicing is involved SM assigns/maintains thread id #s SM manages/schedules thread execution Memory 42 There is NO time slicing for block execution Thread Scheduling/Execution Each Thread Block is divided in 32-thread Warps This is an implementation decision, not part of the CUDA programming model Warps are the basic scheduling units in SM Block 1 Warps … t0 t1 t2 … t31 … If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? HK-UIUC Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only *one* of the 24 Warps will be selected for instruction fetch and execution. t0 t1 t2 … t31 … Streaming Multiprocessor Instruction L1 Block 2 Warps … Data L1 Instruction Fetch/Dispatch Shared Memory SP SP SP SP SFU SFU SP SP SP SP 43 SM Warp Scheduling SM hardware implements zero-overhead Warp scheduling SM multithreaded Warp scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 35 .. . warp 8 instruction 12 Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 Side-comment: Suppose your code has one global memory access every four instructions Then, a minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency warp 3 instruction 36 HK-UIUC 44 Review: The CUDA Programming Model GPU Architecture Paradigm: Single Instruction Multiple Data (SIMD) What’s the overall software (application) development model? CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks CPU Serial Code GPU Parallel Kernel KernelA<<< nBlkA, nTidA >>>(args); ... Grid 0 CPU Serial Code GPU Parallel Kernel KernelB<<< nBlkB, nTidB >>>(args); ... Grid 1 45 The CPU perspective of the GPU… The GPU is viewed as a compute device that: Is a co-processor to the CPU or host Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads When a kernel is invoked, you will have to instruct the GPU how many threads are supposed to run this kernel You have to indicate the number of blocks of threads You have to indicated how many threads are in each block 46 Caveats [1] Flop rates for GPUs are reported for single precision operations Double precision is supported but the rule of thumb is that you get about a 4X slowdown relative to single precision Also, some small deviations from IEEE754 exist Combinations of multiplication and addition in one operation is not compliant 47 Caveats [2] There is no synchronization between threads that live in different blocks If all threads need to synchronize, this is accomplished by getting out of the kernel and invoking another one Average overhead for kernel launch ¼ 90-110 microseconds (small…) IMPORTANT: Global, constant, and texture memory spaces are persistent across successive kernels calls made by the same application 48 CUDA Memory Spaces 49 The Memory Space The memory space is the union of Registers Shared memory Device memory, which can be Remarks Global memory Constant memory Texture memory The constant memory is cached The texture memory is cached The global memory is NOT cached Mem Bandwidth, Device Memory: 102 Gb/s 50 CUDA Runtime Partitioning of the Memory Space The device memory is split in global, constant and texture memory (Device) Grid Block (0, 0) Note the presence of local memory, which is virtual memory If too many registers are needed for computation the data overflow is stored in local memory “Local” means that it’s local, or specific, to one thread In fact local memory is part of the global memory Host Long access times for local mem Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Global Memory Constant Memory 51 Block (1, 0) Texture Memory Local Memory Local Memory CUDA Device Memory Space (Device) Grid Each thread can: At thread level: R/W registers At thread level: R/W local memory At block level: R/W shared memory At grid level: R/W global memory At grid level: Read only constant memory At grid level: Read only texture memory The host can R/W global, constant, and texture memories NOTE: the texture, constant, and global memory are persistent across kernels called by the same application Host Block (0, 0) Block (1, 0) Shared Memory Registers Registers Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory HK-UIUC 52 Access Times Register – dedicated HW - single cycle Shared Memory – dedicated HW - single cycle Local Memory – DRAM, no cache - *slow* Global Memory – DRAM, no cache - *slow* Constant Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Texture Memory – DRAM, cached, 1…10s…100s of cycles, depending on cache locality Instruction Memory (invisible) – DRAM, cached HK-UIUC 53 Compute Capabilities, Things Change Fast… Credit: NVIDIA 54 Most Common Programming Pattern [interacting with the device memory space] Sequence of steps most commonly used in GPU computing: Step 1: Host allocates memory on the device Step 2: Host copies data into the device Step 3: Host invokes a kernel that gets executed in parallel and which processes/uses data from the device memory for useful computation Step 4: Host copies back results from the device 55 56 CUDA Device Memory Allocation cudaMalloc() Allocates object in the device Global Memory Requires two parameters (Device) Grid Block (0, 0) Block (1, 0) Shared Memory Address of a pointer to the allocated object Size of allocated object Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory cudaFree() Frees object from device Global Memory Pointer to freed object Host Global Memory Constant Memory Texture Memory HK-UIUC 57 CUDA Host-Device Data Transfer cudaMemcpy() memory data transfer Requires four parameters HK-UIUC Basically 8 Gb/s (each way) Block (1, 0) Shared Memory Host to Host Host to Device Device to Host Device to Device Things happen over a PCIe 2.0 16X connection Block (0, 0) Pointer to source Pointer to destination Number of bytes copied Type of transfer (Device) Grid Host Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory 58 CUDA Host-Device Data Transfer (cont.) Example: Transfer a number of “size” bytes M is in host memory and Md is in device memory cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md.elements, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md.elements, size, cudaMemcpyDeviceToHost); 59 CUDA GPU Programming ~ Resource Management Considerations ~ 60 What Do I Mean By “Resource Management”? The GPU is a resourceful device What do you have to do to make sure you capitalize on these resources? In other words, how can you ensure that all the SPs are busy all the time? To fully exploit the GPU’s potential it is important How many threads you decide to use What memory requirements are associated with a thread How much shared memory gets allocated/used by one block of threads 61 Resource Management – The Key Actors: Threads, Warps, Blocks A collection of 32 Threads makes up a Warp Warp is something virtual, it’s how the GPU groups the threads together for execution A Block has at the most 512 threads, that is, 16 Warps Threads are organized in a 3D fashion; each thread has an (Tx,Ty,Tz) unique thread ID Threads in a block get to use together the shared memory Each Block of threads is executed on a single SM If you run an application with 100 blocks of threads and your GPU has 16 SMs (GTX 8800, for instance), chances are each SM will get to execute about 6 or 7 blocks 62 Resource Management – The Key Actors: Threads, Warps, Blocks [Cntd.] Host A kernel is executed as a grid of blocks Grid: up to 65535 X 65535 blocks Each block has a unique (Bx, By) unique ID Grid 1 Kernel 1 The threads that belong to the *same* block can cooperate with each other by: Synchronizing their execution Device For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Shared memory is allocated per block Threads from two different blocks cannot cooperate!!! This has important software design implications Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2 Kernel 2 Block (1, 1) Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Thread Thread Thread Thread Thread (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) Thread Thread Thread Thread Thread (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) 63 Execution Model, Key Observations [1 of 2] Each block is executed on *one* Stream Multiprocessor (SM) There is no time slicing when executing a block of threads Each block is split into warps of threads executed one at a time by the eight SPs of the SM (time slicing in warp execution is constantly done) 64 Execution Model, Key Observations [2 of 2] A Stream Multiprocessor can execute multiple blocks concurrently Shared memory and registers are partitioned among the threads of all concurrent blocks Decreasing shared memory usage (per block) and register usage (per thread) increases number of blocks that can run concurrently (very desirable) The shared memory “belongs” to the block, not to the threads (which merely use it…) The shared memory space resides in the on-chip shared memory and it “spans” (or encompasses) a thread block 65 Some Hard Constraints Max number of warps that one SM can service simultaneously: [1 of 2] 32 (on the latest generation of GPUs) Max number of blocks that one SM can process simultaneously: 8 (it’s been like this for a while) 66 Some Hard Constraints [2 of 2] The number of registers available on each SM is limited: 16 Kb on latest NVIDIA hardware The amount of shared memory available to each SM is limited 16 Kb today 67 The Concept of Occupancy Ideally, you want to have 32 warps serviced at the same time by one SM This keeps the SM busy and hides latencies associated with memory access Examples: Two blocks with 512 threads running together on one SM: 100% occupancy Four blocks of 256 threads each running on one SM: 100% occupancy 16 blocks with 64 threads each – not good, can’t have more than 8 blocks running on a SM Effectively this scenario gives you 50% occupancy 68 The Concept of Occupancy [Cntd.] What prevents you from getting high occupancy? Many warps means many threads and possibly many blocks Many blocks ) you can’t have too much shared mem allocated to each one of them Total amount of shared memory in one SM: 16 Kb Many threads ) you can’t have too many registers used by each thread Size of the register file in one SM: 16 Kb 69 Examples, Occupancy of HW Example 1: If each of your blocks gets assigned 20 Kb of shared memory, the kernel will fail to launch Not enough memory on the SM to run a block Example 2: If your blocks each uses 5 Kb of shared mem, you can have three blocks running on one SM (there will be some shared mem that will go unused) Example 3: Like Example 2 above, and you have 512 threads per block, each thread uses 16 registers. Will one SM be able to handle 2 blocks? Total number of registers ) 512 X 2 X 16 = 16,384 out of the 16,384 are used ) ok Number of warps: 2 blocks X 512 threads = 1024 threads = 32 warps ) ok in CUDA 1.3 You actually have 100% occupancy, maxed out on registers, and lots of shared mem left 70 Resource Utilization There is an “occupancy calculator” that can tell you what percentage of the HW gets utilized by your kernel Assumes the form of an Excel spreadsheet Requires the following input Threads per block Registers per thread Shared memory per block Google “occupancy calculator cuda” to access it 71 72 CUDA GPU Code Development 73 Code Development Support How do I compile? How do I link? How do I debug? How do I profile? 74 The CUDA Way: Extended C Declaration specifications: __global__ void convolve (float *image) threadIdx, blockIdx region[threadIdx.x] = image[i]; __syncthreads() ... Intrinsics { __shared__ float region[M]; ... Keywords global, device, shared, local, constant __device__ float filter[N]; __syncthreads image[j] = result; } Runtime API HK-UIUC For memory, symbol, execution management Kernel launch // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve<<<100, 10>>> (myimage); 75 Compiling CUDA nvcc C/C++ CUDA Application Compile driver Invokes cudacc, gcc, cl, etc. CPU Code NVCC PTX Parallel Thread eXecution Like assembly language PTX Code PTX to Target ld.global.v4.f32 mad.f32 Compiler {$f1,$f3,$f5,$f7}, [$r9+0]; $f1, $f5, $f3, $f1; G80 … GPU Target code Courtesy NVIDIA 76 More on the nvcc compiler File suffix How the nvcc compiler interprets the file .cu CUDA source file, containing host and device code .cup Preprocessed CUDA source file, containing host code and device functions .c ‘C’ source file .cc, .cxx, .cpp C++ source file .gpu GPU intermediate file (device code only) .ptx PTX intermediate assembly file (device code only) .cubin CUDA device only binary file 77 Compiling CUDA extended C 78 http://sbel.wisc.edu/Courses/ME964/2008/Documents/nvccCompilerInfo.pdf Gauging Memory Use on GPU Compile with the “–keep” flag and investigate the .cubin file: Use compile architecture {sm_10} abiversion {1} modname {cubin} code { name = _Z21MatVecMulKernelShared6Matrix6VectorS0_ lmem = 0 smem = 1068 reg = 8 bar = 1 const { segname = const segnum = 1 offset = 0 bytes = 8 mem { 0x000000ff 0x0000042c } } bincode { 0x10004209 0x0023c780 0xa000000d 0x04000780 0x1000c801 0x0423c780 0x301fce11 0xec300780 79 Debugging Using the Device Emulation Mode An executable compiled in device emulation mode (nvcc -deviceemu) runs entirely on the host using the CUDA runtime No need of any device and CUDA driver Each device thread is emulated with a host thread In Developer Studio project select the “EmuDebug” or “EmuRelease” build configurations When running in device emulation mode, one can: Use host native debug support (breakpoints, variable QuickWatch and edit, etc.) Access any device-specific data from host code and vice-versa Call any host function from device code (e.g. printf) and vice-versa Detect deadlock situations caused by improper usage of __syncthreads 80 Device Emulation Mode Pitfalls HK-UIUC [1/3] Emulated device threads execute sequentially, so simultaneous accesses of the same memory location by multiple threads could produce different results 81 Device Emulation Mode Pitfalls HK-UIUC [2/3] Dereferencing device pointers on the host or host pointers on the device can produce correct results in device emulation mode, but will generate an error in device execution mode 82 Device Emulation Mode Pitfalls Results of floating-point computations will slightly differ because of: Different compiler outputs, instruction sets Use of extended precision for intermediate results HK-UIUC [3/3] There are various options to force strict single precision on the host 83 Concluding Remarks 84 GPU Computing in Engineering Who stands to benefit in the Engineering community? FEA Monte Carlo Molecular Dynamics Granular Dynamics Image processing Agent-based modeling … Generally, any application that fits the SIMD paradigm 85 Credit: NVIDIA Corporation 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana Video Transcoding Elemental Tech MATLAB Computing AccelerEyes Astrophysics RIKEN 50x – 150x 149X 47X 20X 130X 30X Financial simulation Oxford Linear Algebra Universidad Jaime 3D Ultrasound Techniscan Quantum Chemistry U of Illinois, Urbana Gene Sequencing U of Maryland A Word on HPC beyond GPU We are witnessing a very momentous transformation Shift from sequential to parallel computing The support for parallel computing is very homogeneous in structure GPU not alone in this race of capitalizing on parallel computing for scientific apps 87 Parallel Computing, SW Side… Other options for leveraging parallel computing in scientific applications Threads (Posix, Windows) OpenMP MPI standard (see MPICH implementation) Intel’s Thread Building Block (TBB) library OpenCL standard for heterogeneous computing AMD and NVIDIA provided implementations, Apple to follow up shortly 88 Parallel Computing, HW Side… Hardware options for HPC GPU (NVIDIA) The “fusion” idea (Intel’s Larrabee, AMD’s Fusion) Cell Blades Cluster computing (IBM’s BlueGene/P, Q,…) Cloud Computing 89 Sources of Information, GPU Computing Read, in this order: Lots of very good examples come with the CUDA SDK distribution NVIDIA CUDA Development Tools 2.3: Getting Started (short doc, July 09) NVIDIA CUDA Programming Guide 2.3 (July 09) NVIDIA CUDA C Programming Best Practices Guide 2.3 (short doc, July 09) NVIDIA CUDA Reference Manual 2.3 (comprehensive, July 09) More than 25 applications ready to compile/run Makefiles available, ready for use Lots of good code available for reuse + templates for applications Online material NVIDIA website: code available for many application fields Libs: thrust (http://code.google.com/p/thrust/), cudpp (http://gpgpu.org/developer/cudpp) Course on GPU programming: http://sbel.wisc.edu/Courses/ME964/2008/index.htm Conclusions In the middle of a shift to parallel computing Hardware changes at higher pace CUDA – a bright spot in a software landscape otherwise pretty bleak GPU computing not the silver bullet GPU for right application can deliver amazing benefits at small time and financial investments In general, investing in parallel programming skills bound to pay off 91 Thank You. 92 Review, Execution Model Move data to device, launch kernel, transfer relevant data back to host Kernel is a C function executed on the device Each thread executes the kernel, this happens in parallel 93 Review, Key Concepts Kernel = GPU program executed by each parallel thread in a block Block = a 3D collection of threads that can cooperate in using the block’s shared memory and can synchronize during execution Grid = 2D array of blocks of threads that execute a kernel Device ´ GPU = set of stream multiprocessors (30 SMs) Stream Multiprocessor = 8 scalar processors + shared mem + registers Memory Location Cached Access Who Local Off-chip No Read/write One thread Shared On-chip N/A - resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host Off-chip means on-device; i.e., slow access time. 94 Performance Vision of the Future Growing gap! Presentation Paul Petersen, Sr. Principal Engineer, Intel “SD”: Software Development Multi-core Era Frequency Era Time 2007 “Parallelism for Everyone” Parallelism changes the game A large percentage of people who provide applications are going to have to care about parallelism in order to match the capabilities of their competitors. competitive pressures = demand for parallel applications 95