GPU Architecture & Implications David Luebke NVIDIA Research GPU Architecture CUDA provides a parallel programming model The Tesla GPU architecture implements this This talk will describe the characteristics, goals, and implications of that architecture © NVIDIA Corporation 2007 G80 GPU Implementation: Tesla C870 681 million transistors 470 mm2 in 90 nm CMOS 128 thread processors 518 GFLOPS peak 1.35 GHz processor clock 1.5 GB DRAM 76 GB/s peak 800 MHz GDDR3 clock 384 pin DRAM interface ATX form factor card PCI Express x16 170 W max with DRAM Block Diagram Redux G80 (launched Nov 2006) 128 Thread Processors execute kernel threads Up to 12,288 parallel threads active Per-block shared memory (PBSM) accelerates processing Host Input Assembler Thread Execution Manager Thread Processors Thread Processors PBSM PBSM PBSM PBSM Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors Thread Processors PBSM PBSM PBSM PBSM PBSM PBSM PBSM PBSM Load/store © NVIDIA Corporation 2007 Global Memory PBSM PBSM PBSM PBSM Streaming Multiprocessor (SM) Processing elements SM MT IU t0 t1 … tB 8 scalar thread processors (SP) 32 GFLOPS peak at 1.35 GHz 8192 32-bit registers (32KB) ½ MB total register file space! usual ops: float, int, branch, … SP Hardware multithreading up to 8 blocks resident at once up to 768 active threads in total Shared Memory © NVIDIA Corporation 2007 16KB on-chip memory low latency storage shared amongst threads of a block supports thread communication Goal: Scalability Scalable execution Program must be insensitive to the number of cores Write one program for any number of SM cores Program runs on any size GPU without recompiling Hierarchical execution model Decompose problem into sequential steps (kernels) Decompose kernel into computing parallel blocks Decompose block into computing parallel threads Hardware distributes independent blocks to SMs as available Blocks Run on Multiprocessors Kernel launched by host ... Device processor array MT IU MT IU MT IU MT IU MT IU SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory ... Device Memory MT IU MT IU MT IU SP SP SP SP Shared Memory Shared Memory Shared Memory Shared Memory Goal: easy to program Strategies: Familiar programming language mechanics C/C++ with small extension Simple parallel abstractions Simple barrier synchronization Shared memory semantics Hardware-managed hierarchy of threads Hardware Multithreading Hardware allocates resources to blocks blocks need: thread slots, registers, shared memory blocks don’t run until resources are available SM MT IU SP Hardware schedules threads threads have their own registers any thread not waiting for something can run context switching is (basically) free – every cycle Shared Memory Hardware relies on threads to hide latency i.e., parallelism is necessary for performance © NVIDIA Corporation 2007 Goal: Performance per millimeter For GPUs, perfomance == throughput Strategy: hide latency with computation not cache Heavy multithreading – already discussed by Kevin Implication: need many threads to hide latency Occupancy – typically need 128 threads/SM minimum Multiple thread blocks/SM good to minimize effect of barriers Strategy: Single Instruction Multiple Thread (SIMT) Balances performance with ease of programming SIMT Thread Execution Groups of 32 threads formed into warps always executing same instruction shared instruction fetch/dispatch some become inactive when code path diverges hardware automatically handles divergence SM MT IU SP Warps are the primitive unit of scheduling pick 1 of 24 warps for each instruction slot Shared Memory SIMT execution is an implementation choice © NVIDIA Corporation 2007 sharing control logic leaves more space for ALUs largely invisible to programmer must understand for performance, not correctness SIMT Multithreaded Execution Weaving: the original parallel thread technology is about 10,000 years old Warp: a set of 32 parallel threads that execute a SIMD instruction SM multithreaded instruction scheduler time warp 8 instruction 11 warp 1 instruction 42 warp 3 instruction 95 .. . warp 8 instruction 12 warp 3 instruction 96 12 SM hardware implements zero-overhead warp and thread scheduling Each SM executes up to 768 concurrent threads, as 24 SIMD warps of 32 threads Threads can execute independently SIMD warp automatically diverges and converges when threads branch Best efficiency and performance when threads of a warp execute together SIMT across threads (not just SIMD across data) gives easy single-thread scalar programming with SIMD efficiency gh07 Hot3D: Tesla GPU Computing © NVIDIA Corporation 2007 Memory Architecture Direct load/store access to device memory treated as the usual linear sequence of bytes (i.e., not pixels) Texture & constant caches are read-only access paths On-chip shared memory shared amongst threads of a block I Cache MT IU important for communication amongst threads provides low-latency temporary storage (~100x less than DRAM) Shared Memory SP Texture Cache Constant Cache Device Memory © NVIDIA Corporation 2007 Host Memory PCIe Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics NO: CUDA compiles directly to the hardware GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… NO: warps are 32-wide …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… NOPE …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. NO: scalar thread processors GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient: No – 4-10x perf/W advantage, up to 89x reported for some studies GPUs don’t do real floating point Myths of GPU Computing GPUs layer normal programs on top of graphics GPUs architectures are: Very wide (1000s) SIMD machines… …on which branching is impossible or prohibitive… …with 4-wide vector registers. GPUs are power-inefficient: GPUs don’t do real floating point GPU Floating Point Features G80 SSE IBM Altivec Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles Supported, 1000’s of cycles Flush to zero NaN support Yes Yes Yes No Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity Flags No Yes Yes Some Square root Software only Hardware Software only Software only Division Software only Hardware Software only Software only Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit log2(x) and 2^x estimates accuracy 23 bit No 12 bit No Do GPUs Do Real IEEE FP? G8x GPU FP is IEEE 754 Comparable to other processors / accelerators More precise / usable in some ways Less precise in other ways GPU FP getting better every generation Double precision support shortly Goal: best of class by 2009 Questions? David Luebke dluebke@nvidia.com Applications & Sweet Spots GPU Computing Sweet Spots Applications: High arithmetic intensity: Dense linear algebra, PDEs, n-body, finite difference, … High bandwidth: Sequencing (virus scanning, genomics), sorting, database… Visual computing: Graphics, image processing, tomography, machine vision… © NVIDIA Corporation 2007 GPU Computing Example Markets Computational Geoscience Computational Chemistry Computational Medicine Computational Modeling Computational Science Computational Biology Computational Finance © NVIDIA Corporation 2007 Image Processing Applications - Condensed 3D image analysis Adaptive radiation therapy Acoustics Astronomy Audio Automobile vision Bioinfomatics Biological simulation Broadcast Cellular automata Computational Fluid Dynamics Computer Vision Cryptography CT reconstruction Data Mining Digital cinema/projections Electromagnetic simulation Equity training © NVIDIA Corporation 2007 Film Financial - lots of areas Languages GIS Holographics cinema Imaging (lots) Mathematics research Military (lots) Mine planning Molecular dynamics MRI reconstruction Multispectral imaging nbody Network processing Neural network Oceanographic research Optical inspection Particle physics Protein folding Quantum chemistry Ray tracing Radar Reservoir simulation Robotic vision/AI Robotic surgery Satellite data analysis Seismic imaging Surgery simulation Surveillance Ultrasound Video conferencing Telescope Video Visualization Wireless X-ray GPU Computing Sweet Spots From cluster to workstation The “personal supercomputing” phase change From lab to clinic From machine room to engineer, grad student desks From batch processing to interactive From interactive to real-time GPU-enabled clusters A 100x or better speedup changes the science Solve at different scales Direct brute-force methods may outperform cleverness New bottlenecks may emerge Approaches once inconceivable may become practical © NVIDIA Corporation 2007 New Applications Real-time options implied volatility engine Ultrasound imaging Swaption volatility cube calculator HOOMD Molecular Dynamics Manifold 8 GIS Also… Image rotation/classification Graphics processing toolbox Microarray data analysis Data parallel primitives © NVIDIA Corporation 2007 Astrophysics simulations SDK: Mandelbrot, computer vision Seismic migration The Future of GPUs GPU Computing drives new applications Reducing “Time to Discovery” 100x Speedup changes science and research methods New applications drive the future of GPUs and GPU Computing Drives new GPU capabilities Drives hunger for more performance Some exciting new domains: Vision, acoustic, and embedded applications Large-scale simulation & physics © NVIDIA Corporation 2007 Accuracy & Performance GPU Floating Point Features G80 SSE IBM Altivec Cell SPE Precision IEEE 754 IEEE 754 IEEE 754 IEEE 754 Rounding modes for FADD and FMUL Round to nearest and round to zero All 4 IEEE, round to nearest, zero, inf, -inf Round to nearest only Round to zero/truncate only Denormal handling Flush to zero Supported, 1000’s of cycles Supported, 1000’s of cycles Flush to zero NaN support Yes Yes Yes No Overflow and Infinity support Yes, only clamps to max norm Yes Yes No, infinity Flags No Yes Yes Some Square root Software only Hardware Software only Software only Division Software only Hardware Software only Software only Reciprocal estimate accuracy 24 bit 12 bit 12 bit 12 bit Reciprocal sqrt estimate accuracy 23 bit 12 bit 12 bit 12 bit log2(x) and 2^x estimates accuracy 23 bit No 12 bit No © NVIDIA Corporation 2007 Do GPUs Do Real IEEE FP? G8x GPU FP is IEEE 754 Comparable to other processors / accelerators More precise / usable in some ways Less precise in other ways GPU FP getting better every generation Double precision support shortly Goal: best of class by 2009 © NVIDIA Corporation 2007 CUDA Performance Advantages Performance: BLAS1: 60+ GB/sec BLAS3: 127 GFLOPS FFT: 52 benchFFT* GFLOPS FDTD: 1.2 Gcells/sec SSEARCH: 5.2 Gcells/sec Black Scholes: 4.7 GOptions/sec How: Leveraging shared memory GPU memory bandwidth GPU GFLOPS performance Custom hardware intrinsics __sinf(), __cosf(), __expf(), __logf(), … VMD: 290 GFLOPS All benchmarks are compiled code! © NVIDIA Corporation 2007 GPGPU vs. GPU Computing Problem: GPGPU OLD: GPGPU – trick the GPU into general-purpose computing by casting problem as graphics Turn data into images (“texture maps”) Turn algorithms into image synthesis (“rendering passes”) Promising results, but: Tough learning curve, particularly for non-graphics experts Potentially high overhead of graphics API Highly constrained memory layout & access model Need for many passes drives up bandwidth consumption © NVIDIA Corporation 2007 Solution: CUDA NEW: GPU Computing with CUDA CUDA = Compute Unified Driver Architecture Co-designed hardware & software for direct GPU computing Hardware: fully general data-parallel architecture General thread launch Global load-store Parallel data cache Scalar architecture Integers, bit operations Double precision (soon) Software: program the GPU in C Scalable data-parallel execution/memory model © NVIDIA Corporation 2007 C with minimal yet powerful extensions Graphics Programming Model Graphics Application Vertex Program Rasterization Fragment Program Display © NVIDIA Corporation 2007 Streaming GPGPU Programming OpenGL Program to Add A and B Start by creating a quad Vertex Program “Programs” created with raster operation Rasterization Fragment Program CPU Reads Texture Memory for Results © NVIDIA Corporation 2007 Read textures as input to OpenGL shader program Write answer to texture memory as a “color” All this just to do A + B What’s Wrong With GPGPU? Application Input Registers Vertex Program Rasterization Texture Pixel Program Pixel Program Temp Registers Display Output Registers © NVIDIA Corporation 2007 Constants What’s Wrong With GPGPU? APIs are specific to graphics Application Vertex Program Rasterization Input Registers Limited texture size and dimension Fragment Program Limited instruction set No thread communication Fragment Program Texture Constants Temp Registers Limited local storage Display Output Registers Limited shader outputs © NVIDIA Corporation 2007 No scatter Building a Better Pixel Input Registers Texture Fragment Program Constants Registers Output Registers © NVIDIA Corporation 2007 Building a Better Pixel Thread Features Millions of instructions Full Integer and Bit instructions Thread Number No limits on branching, looping 1D, 2D, or 3D thread ID allocation Texture Thread Program Constants Registers Output Registers © NVIDIA Corporation 2007 Global Memory Features Fully general load/store to GPU memory Thread Number Untyped, not fixed texture types Pointer support Texture Thread Program Constants Registers Global Memory © NVIDIA Corporation 2007 Parallel Data Cache Features Dedicated on-chip memory Thread Number Shared between threads for inter-thread communication Explicitly managed As fast as registers Texture Thread Program Constants Registers Parallel Data Cache Global Memory © NVIDIA Corporation 2007 Example Algorithm - Fluids Goal: Calculate PRESSURE in a fluid Pressure = Sum of neighboring pressures Pn’ = P1 + P2 + P3 + P4 So the pressure for each particle is… Pressure1 = P1 + P2 + P3 + P4 Pressure2 = P3 + P4 + P5 + P6 Pressure depends on neighbors © NVIDIA Corporation 2007 Pressure3 = P5 + P6 + P7 + P8 Pressure4 = P7 + P8 + P9 + P10 Example Fluid Algorithm CPU Control Cache GPGPU DRAM P1 Pn’=P1+P2+P3+P4P2 P3 P4 Thread Execution Manager Control Pn’=P1+P2+P3+P4 P1, P2 P3, P4 ALU ALU Pn’=P1+P2+P3+P4 Control P1,P2 P3,P4 Control Pn’=P1+P2+P3+P4 Video Memory Single thread out of cache Parallel Data Cache Control ALU AL U GPU Computing with CUDA Control ALU ALU P1,P2 P3,P4 ALU Pn’=P1+P2+P3+P4 Shared Data P1 P2 P3 P4 P5 DRAM Pn’=P1+P2+P3+P4 Control Data/Computation Multiple passes through video memory ALU ALU Pn’=P1+P2+P3+P4 Program/Control © NVIDIA Corporation 2007 Parallel execution through cache Parallel Data Cache Bring the data closer to the ALU • • Addresses a fundamental problem of stream computing: The data are far from the FLOPS, video RAM latency is high Threads can only communicate their results through this high latency RAM GPGPU Control ALU Pn’=P1+P2+P3+P4 P1, P2 P3, P4 Control ALU Pn’=P1+P2+P3+P4 P1,P 2 P3,P 4 Video Memory Control ALU ALU P1,P2 P3,P4 Pn’=P1+P2+P3+P4 Multiple passes through video memory © NVIDIA Corporation 2007 Parallel Data Cache Bring the data closer to the ALU Thread Execution Manager Parallel Data Cache Control Stage computation for the parallel data cache Minimize trips to external memory Share values to minimize overfetch and computation Increases arithmetic intensity by keeping data close to the processors User managed generic memory, threads read/write arbitrarily ALU Pn’=P1+P2+P3+P4 Control ALU Pn’=P1+P2+P3+P4 Shared Data P1 P2 P3 P4 P5 DRAM Control ALU ALU Pn’=P1+P2+P3+P4 © NVIDIA Corporation 2007 Parallel execution through cache Streaming vs. GPU Computing Streaming GPGPU Gather in, Restricted write Memory is far from ALU No inter-element communication GPU Computing with CUDA ALU More general data parallel model CUDA Full Scatter / Gather PDC brings the data closer to the ALU App decides how to decompose the problem across threads Share and communicate between threads to solve problems efficiently ALU © NVIDIA Corporation 2007 GPU Design CPU/GPU Parallelism Moore’s Law gives you more and more transistors What do you want to do with them? CPU strategy: make the workload (one compute thread) run as fast as possible Tactics: – Cache (area limiting) – Instruction/Data prefetch – Speculative execution limited by “perimeter” – communication bandwidth …then add task parallelism…multi-core GPU strategy: make the workload (as many threads as possible) run as fast as possible Tactics: – Parallelism (1000s of threads) – Pipelining limited by “area” – compute capability © NVIDIA Corporation 2007 Background: Unified Design Unified Design Discrete Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core Shader C Shader D © NVIDIA Corporation 2007 obuffer obuffer obuffer obuffer Hardware Implementation: Collection of SIMT Multiprocessors Each multiprocessor is a set of SIMT thread processors Device Multiprocessor N Single Instruction Multiple Thread Multiprocessor 2 Each thread processor has: Multiprocessor 1 program counter, register file, etc. scalar data path read/write memory access Processor 1 Processor 2 … Instruction Unit Processor M Unit of SIMT execution: warp execute same instruction/clock Hardware handles thread scheduling and divergence transparently Warps enable a friendly data-parallel programming model! © NVIDIA Corporation 2007 Hardware Implementation: Memory Architecture Device Multiprocessor N The device has local device memory Can be read and written by the host and by the multiprocessors Each multiprocessor has: A set of 32-bit registers per processor on-chip shared memory A read-only constant cache A read-only texture cache © NVIDIA Corporation 2007 Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Processor 1 Registers Processor 2 Registers … Instruction Unit Processor M Constant Cache Texture Cache Device memory Hardware Implementation: Memory Model Grid Each thread can: Read/write per-block onchip shared memory Read per-grid cached constant memory Read/write non-cached device memory: Per-grid global memory Per-thread local memory Read cached texture memory Block (0, 0) Shared Memory Shared Memory Registers Registers Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) Local Memory Local Memory Local Memory Local Memory Global Memory Constant Memory Texture Memory © NVIDIA Corporation 2007 Block (1, 0) CUDA Programming CUDA SDK Libraries:FFT, BLAS,… Example Source Code Integrated CPU and GPU C Source Code NVIDIA C Compiler NVIDIA Assembly for Computing CUDA Driver Debugger Profiler GPU © NVIDIA Corporation 2007 CPU Host Code Standard C Compiler CPU CUDA: Features available to kernels Standard mathematical functions sinf, powf, atanf, ceil, etc. Built-in vector types float4, int4, uint4, etc. for dimensions 1..4 Texture accesses in kernels texture<float,2> my_texture; // declare texture reference float4 texel = texfetch(my_texture, u, v); © NVIDIA Corporation 2007 G8x CUDA = C with Extensions Philosophy: provide minimal set of extensions necessary to expose power Function qualifiers: __global__ void MyKernel() { } __device__ float MyDeviceFunc() { } Variable qualifiers: __constant__ float MyConstantArray[32]; __shared__ float MySharedArray[32]; Execution configuration: dim3 dimGrid(100, 50); // 5000 thread blocks dim3 dimBlock(4, 8, 8); // 256 threads per block MyKernel <<< dimGrid, dimBlock >>> (...); // Launch kernel Built-in variables and functions valid in device code: dim3 dim3 dim3 dim3 void gridDim; // Grid dimension blockDim; // Block dimension blockIdx; // Block index threadIdx; // Thread index __syncthreads(); // Thread synchronization © NVIDIA Corporation 2007 CUDA: Runtime support Explicit memory allocation returns pointers to GPU memory cudaMalloc(), cudaFree() Explicit memory copy for host ↔ device, device ↔ device cudaMemcpy(), cudaMemcpy2D(), ... Texture management cudaBindTexture(), cudaBindTextureToArray(), ... OpenGL & DirectX interoperability cudaGLMapBufferObject(), cudaD3D9MapVertexBuffer(), … © NVIDIA Corporation 2007 Example: Adding matrices w/ 2D grids CPU C program CUDA C program void addMatrix(float *a, float *b, float *c, int N) { int i, j, index; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { index = i + j * N; c[index]=a[index] + b[index]; } } } __global__ void addMatrix(float *a,float *b, float *c, int N) { int i=blockIdx.x*blockDim.x+threadIdx.x; int j=blockIdx.y*blockDim.y+threadIdx.y; int index = i + j * N; if ( i < N && j < N) c[index]= a[index] + b[index]; } void main() { ..... addMatrix(a, b, c, N); } © NVIDIA Corporation 2007 void main() { ..... // allocate & transfer data to GPU dim3 dimBlk (blocksize, blocksize); dim3 dimGrd (N/dimBlk.x, N/dimBlk.y); addMatrix<<<dimGrd,dimBlk>>>(a, b, c,N); } Example: Vector Addition Kernel // Compute vector sum C = A+B // Each thread performs one pair-wise addition __global__ void vecAdd(float* A, float* B, float* C) { int i = threadIdx.x + blockDim.x * blockIdx.x; C[i] = A[i] + B[i]; } © NVIDIA Corporation 2007 Example: Invoking the Kernel __global__ void vecAdd(float* A, float* B, float* C); void main() { // Execute on N/256 blocks of 256 threads each vecAdd<<< N/256, 256>>>(d_A, d_B, d_C); } © NVIDIA Corporation 2007 Example: Host code for memory // allocate host (CPU) memory float* h_A = (float*) malloc(N * sizeof(float)); float* h_B = (float*) malloc(N * sizeof(float)); … initalize h_A and h_B … // allocate float* d_A, cudaMalloc( cudaMalloc( cudaMalloc( device (GPU) memory d_B, d_C; (void**) &d_A, N * sizeof(float)); (void**) &d_B, N * sizeof(float)); (void**) &d_C, N * sizeof(float)); // copy host memory to device cudaMemcpy( d_A, h_A, N * sizeof(float),cudaMemcpyHostToDevice)); cudaMemcpy( d_B, h_B, N * sizeof(float),cudaMemcpyHostToDevice)); // execute the kernel on N/256 blocks of 256 threads each vecAdd<<<N/256, 256>>>(d_A, d_B, d_C); © NVIDIA Corporation 2007 A quick review device = GPU = set of multiprocessors Multiprocessor = set of processors & shared memory Kernel = GPU program Grid = array of thread blocks that execute a kernel Thread block = group of SIMD threads that execute a kernel and can communicate via shared memory Memory Location Cached Access Who Local Off-chip No Read/write One thread Shared On-chip N/A - resident Read/write All threads in a block Global Off-chip No Read/write All threads + host Constant Off-chip Yes Read All threads + host Texture Off-chip Yes Read All threads + host © NVIDIA Corporation 2007 Data-Parallel Programming Scan Literature Pre-Hibernation First proposed in APL by Iverson (1962) Used as a data parallel primitive in the Connection Machine (1990) Feature of C* and CM-Lisp Guy Blelloch used scan as a primitive for various parallel algorithms; his balanced-tree scan is used in the example here Blelloch, 1990, “Prefix Sums and Their Applications” Post-Democratization O(n log n) work GPU implementation by Daniel Horn (GPU Gems 2) Applied to Summed Area Tables by Hensley et al. (EG05) O(n) work GPU scan by Sengupta et al. (EDGE06) and Greß et al. (EG06) O(n) work & space GPU implementation by Harris et al. (2007) NVIDIA CUDA SDK and GPU Gems 3 Applied to radix sort, stream compaction, and summed area tables Parallel Reduction Complexity Log(N) parallel steps, each step S does N/2S independent ops Step Complexity is O(log N) For N=2D, performs S[1..D]2D-S = N-1 operations Work Complexity is O(N) – It is work-efficient i.e. does not perform more operations than a sequential algorithm With P threads physically in parallel (P processors), time complexity is O(N/P + log N) Compare to O(N) for sequential reduction Unrolling Last Steps Only one warp is active during the last few steps Unroll them and remove unneeded __syncthreads() for (unsigned int s = bd/2; s > 32; s >>= 1) { if (t < s) { data[t] += data[t + s]; } __syncthreads(); } if (t < 32) data[t] += data[t + 32]; if (t < 16) data[t] += data[t + 16]; if (t < 8) data[t] += data[t + 8]; if (t < 4) data[t] += data[t + 4]; if (t < 2) data[t] += data[t + 2]; if (t < 1) data[t] += data[t + 1]; Unrolling the Loop Completely When block #define STEP(d) \ size is known if (t < (d)) data[t] += data[t+(d)); at compile #define SYNC __syncthreads(); time, we can completely unroll the loop template <unsigned int bsize> __global__ void d_reduce(int *g_idata, It often is, int *g_odata) since the { ... maximum if (bsize == 512) STEP(512) SYNC thread block size of 512 if (bsize >= 256) STEP(256) SYNC constrains us if (bsize >= 128) STEP(128) SYNC if (bsize >= 64) STEP(64) SYNC Use if (bsize >= 32) { STEP(32) STEP(16) templates… STEP(8) STEP(4) STEP(2) STEP(1) } } GPU Computing Motivation Computing Challenge graphic Task Computing © NVIDIA Corporation 2007 Data Computing Extreme Growth in Raw Data Walmart Transaction Tracking Millions Millions YouTube Bandwidth Growth Source: Alexa, YouTube 2006 Source: Hedburg, CPI, Walmart BP Oil and Gas Active Data NOAA Weather Data NOAA NASA Weather Data in Petabytes 90 80 Terabytes 70 Petabytes 60 50 40 30 20 10 0 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 © NVIDIA Corporation 2007 Source: Jim Farnsworth, BP May 2005 Source: John Bates, NOAA Nat. Climate Center Computational Horsepower GPU is a massively parallel computation engine High memory bandwidth (5-10x CPU) High floating-point performance (5-10x CPU) © NVIDIA Corporation 2007 Benchmarking: CPU vs. GPU Computing G80 vs. Core2 Duo 2.66 GHz Measured against commercial CPU benchmarks when possible © NVIDIA Corporation 2007 “Free” Massively Parallel Processors It’s not science fiction, it’s just funded by them Asst Master Chief Harvard Success Stories Success Stories: Data to Design Acceleware EM Field simulation technology for the GPU 3D Finite-Difference and Finite-Element (FDTD) Modeling of: Cell phone irradiation MRI Design / Modeling Printed Circuit Boards Radar Cross Section (Military) 20X 700 600 500 Performance (Mcells/s) 400 200 100 Pacemaker with Transmit Antenna 10X 300 5X 1X 0 CPU © NVIDIA Corporation 2007 3.2 GHz 1 GPU 2 GPUs 4 GPUs EvolvedMachines 130X Speed up Simulate brain circuitry Sensory computing: vision, olfactory EvolvedMachines © NVIDIA Corporation 2007 Matlab: Language of Science 10X with MATLAB CPU+GPU Pseudo-spectral simulation of 2D Isotropic turbulence http://developer.nvidia.com/object/matlab_cuda.html http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m © NVIDIA Corporation 2007 MATLAB Example: Advection of an elliptic vortex 256x256 mesh, 512 RK4 steps, Linux, MATLAB file http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_vortex.m Matlab 168 seconds Matlab with CUDA (single precision FFTs) 20 seconds © NVIDIA Corporation 2007 MATLAB Example: Pseudo-spectral simulation of 2D Isotropic turbulence 512x512 mesh, 400 RK4 steps, Windows XP, MATLAB file http://www.amath.washington.edu/courses/571-winter-2006/matlab/FS_2Dturb.m MATLAB 992 seconds MATLAB with CUDA (single precision FFTs) 93 seconds © NVIDIA Corporation 2007 NAMD/VMD Molecular Dynamics 240X speedup Computational biology © NVIDIA Corporation 2007 http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/ Molecular Dynamics Example Case study: molecular dynamics research at U. Illinois Urbana-Champaign (Scientist-sponsored) course project for CS 498AL: Programming Massively Parallel Multiprocessors (Kirk/Hwu) Next slides stolen from a nice description of problem, algorithms, and iterative optimization process available at: http://www.ks.uiuc.edu/Research/vmd/projects/ece498/lecture/ © NVIDIA Corporation 2007 © NVIDIA Corporation 2007 Molecular Modeling: Ion Placement Biomolecular simulations attempt to replicate in vivo conditions in silico. Model structures are initially constructed in vacuum Solvent (water) and ions are added as necessary for the required biological conditions Computational requirements scale with the size of the simulated structure © NVIDIA Corporation 2007 Evolution of Ion Placement Code First implementation was sequential Virus structure with 10^6 atoms would require 10 CPU days Tuned for Intel C/C++ vectorization+SSE, ~20x speedup Parallelized /w pthreads: high data parallelism = linear speedup Parallelized GPU accelerated implementation: 3 GeForce 8800GTX cards outrun ~300 Itanium2 CPUs! Virus structure now runs in 25 seconds on 3 GPUs! Further speedups should still be possible… © NVIDIA Corporation 2007 Multi-GPU CUDA Coulombic Potential Map Performance Host: Intel Core 2 Quad, 8GB RAM, ~$3,000 3 GPUs: NVIDIA GeForce 8800GTX, ~$550 each 32-bit RHEL4 Linux (want 64-bit CUDA!!) 235 GFLOPS per GPU for current version of coulombic potential map kernel 705 GFLOPS total for multithreaded multi-GPU version © NVIDIA Corporation 2007 Three GeForce 8800GTX GPUs in a single machine, cost ~$4,650 Professor Partnership NVIDIA Professor Partnership Support faculty research & teaching efforts Small equipment gifts (1-2 GPUs) Significant discounts on GPU purchases Especially Quadro, Tesla equipment Useful for cost matching Research contracts Small cash grants (typically ~$25K gifts) Medium-scale equipment donations (10-30 GPUs) Easy Competitive Informal proposals, reviewed quarterly Focus areas: GPU computing, especially with an educational mission or component http://www.nvidia.com/page/professor_partnership.html © NVIDIA Corporation 2007