INTRODUCTION TO MASSIVELY PARALLEL COMPUTING Unit-4 • Since 2003, semiconductor industry follow two main trajectories: TWO MAIN TRAJECTORY • Multicore: seek to maintain the execution speed of sequential program. Reduce the Latency • Many core: improve the execution throughput of parallel application. Each heavily multi-threaded core is much smaller and some cores share control and instruction cache. 2 CPUS AND GPUS HAVE FUNDAMENTALLY DIFFERENT DESIGN PHILOSOPHIES ALU ALU ALU ALU Control GPU CPU Cache DRAM DRAM 3 MULTICORE CPU • Optimized for sequential program sophisticated control logic to allow instructions from a single thread to execute faster. In order to minimize the latency, large on-chip cache to reduce the longlatency memory access to cache accesses, the execution latency of each thread is reduced. However, the large cache memory (multiple megabytes, low-latency arithmetic units and sophisticated operand delivery logic consume chip area and power. • Latency-oriented design 4 MULTICORE CPU • Many applications are limited by the speed at which data can be moved from memory to processor. • CPU has to satisfy the requirements from legacy OS and I/O, more difficult to let memory bandwidth to increase. Usually 1/6 of GPU 5 • Shaped by the fast-growing video game industry that expects tremendous massive number of floating-pint calculations per video frame. MANY-CORE GPU • Motive to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. Solution is to optimize for the execution throughput of massive number of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduce area and power on memory and arithmetic allows designers to have more cores on a chip to increase the execution throughput. 6 MANY-CORE GPU • A large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache are provide to help control the bandwidth requirements so multiple threads that access the same memory do not need to go the DRAM. • Throughput-oriented design: that thrives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute. 7 CPU + GPU • GPU will not perform well on tasks on which CPUs are design to perform well. For program that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs. • When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs. Many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. 8 • The processors of choice must have a very large presence in the market place. • 400 million CUDA-enabled GPUs in use to date. • Practical form factors and easy accessibility GPU ADOPTION • Until 2006, parallel programs are run on data centers or clusters. Actual clinical applications on MRI machines are based on a PC and special hardware accelerators. GE and Siemens cannot sell racks to clinical settings. NIH refused to fund parallel programming projects. Today NIH funds research using GPU. 9 WHY MASSIVELY PARALLEL PROCESSOR • A quiet revolution and potential build-up • Calculation: 367 GFLOPS vs. 32 GFLOPS • Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s • Until last year, programmed through graphics API • GPU in every PC and workstation – massive volume and potential impact 10 ARCHITECTURE OF A CUDACAPABLE GPU Host Input Assembler Thread Execution Manager Two streaming multiprocessors form a building block Each has a number of streaming processors that share control logic and instruction cache. Each GPU comes with multiple gigabytes of DRAM (global memory). Offers High bandwidth off-chip, though with longer latency than typical system memory. High bandwidth makes up for the longer latency for massively parallel applications G80: 86.4 GB/s of memory bandwidth plus 8GB/s up and down 4Gcommunication bandwidth with CPU Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory 11 A good application runs 5k to 12k threads. CPU support 2 to 8 threads. 1 TFLOPS peak performance (25-50 times of current high-end microprocessors) 265 GFLOPS sustained for apps such as VMD GT200 CHARACTERISTICS Massively parallel, 128 cores, 90W Massively threaded, sustains 1000s of threads per app 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics 12 • Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” • Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products • These “Super-apps” represent and model physical, concurrent world • Various granularities of parallelism exist, but… • programming model must not hinder parallel implementation • data delivery needs careful management FUTURE APPS REFLECT A CONCURRENT WORLD 13 • STRETCHING TRADITIONAL ARCHITECTURES Traditional parallel architectures cover some super-applications • DSP, GPU, network apps, Scientific • The game is to grow mainstream architectures “out” or domainspecific architectures “in” • CUDA is latter Traditional applications Current architecture coverage New applications Domain-specific architecture coverage Obstacles 14 TEXTURE MAPPING EXAMPLE Texture mapping example: painting a world map texture image onto a globe object. 15 • MPI: scale up to 100,000 nodes. • CUDA shared memory for parallel execution. Programmers manage the data transfer between CPU and GPU and detailed parallel code construct. SOFTWARE EVOLVEMENT • OpenMP: shared memory. Not able to scale beyond a couple of hundred cores due to thread managements overhead and cache coherence. Compilers do most of the automation in managing parallel execution. • OpenCL (2009): Apple, Intel AMD/ATI, NViDia: proposed a standard programming model. Define language extension and run-time API. Application developed in OpenCL can run on any processors that support OpenCL language extension and API without code modification • OpenACC (2011): compiler directives to specific loops and regions of code to offload from CPU to GPU. More like OpenMP. 16 SPEEDUP OF APPLICATIONS 457 316 431 263 210 79 GPU Speedup Relative to CPU 60 50 40 Ke rn e l Ap p lic a tio n 30 20 10 0 H .2 6 4 LBM R C 5 -7 2 F EM R PES PN S SA XPY T PA C F FDTD M R I-Q M R IFHD • GeForce 8800 GTX vs. 2.2GHz Opteron 248 • 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads • 25 to 400 speedup if the function’s data requirements and control flow suit the GPU and the application is optimized • “Need for Speed” Seminar Series organized by Patel and Hwu this semester. 17 HISTORY OF GPU COMPUTING Scene Transformations GRAPHICS PIPELINE Lighting & Shading GPUs evolved as hardware and software algorithms evolve Viewing Transformations Rasterization EARLY GRAPHICS • Originally, no specialized graphics hardware • All processing in software on CPU, • Results transmitted to frame buffer first, external frame buffers later, internal frame buffers. CPU Frame buffer Display MORE DETAILED PIPELINE Geometry data Transform & lighting Culling, perspective divide, viewport mapping Simple functionality transferred to specialized hardware. Rasterization Simple texturing Depth test Frame buffer blending Geometry data ADD MORE FUNCTIONALI TY TO GPU. Simple functionality transferred to specialized hardware Transform & lighting Culling, perspective divide, viewport mapping Rasterization Simple texturing Depth test Frame buffer blending FIXED FUNCTION GPU PIPELINE • Pipeline implemented in hardware • Each stage does fixed task • Tasks are parameterized • Inflexible – fixed, parameterized functions • Vector-matrix operations (some parallelism). Scene Transformations CPU GPU Lighting & Shading Viewing Transformations Rasterization Frame buffer Display TECHNOLOGY ADVANCES • Hardware gets cheaper, smaller, and more powerful • Parallel architectures develop • Graphics processing get more sophisticated (environmental mapping, displacement mapping, sub-surface scattering) • Need more flexibility in GPUs. MAKE THIS PROGRAMMABLE: VERTEX SHADER Geometry data Transform & lighting Culling, perspective divide, viewport mapping Rasterization Make this programmable: Fragment Shader Complex texturing Depth test, alpha test, stencil test Frame buffer blending VERTEX SHADER • Graphics systems: convert everything to triangles • Pass vertices, normals, colors, texture coordinates to GPU processor • GPU: vertex-based computations, Independent of other vertices • Later, assemble into triangles. FRAGMENT SHADER • Fragment is triangle clipped to pixel Interpolate values • Multiple textures, Alpha, stencil, depth Independent of other fragments • Blend with contents of frame buffer. Geometry data Vertex Shader Introduce parallelism: add multiple units Vertex Shader Vertex Shader Culling, perspective divide, viewport mapping Rasterization Fragment Shader Fragment Shader Fragment Shader Alpha test, depth test, stencil test Frame buffer blending CPU Host Host Interface Vertex Control A FIXED FUNCTION GPU PIPELINE GPU Vertex Cache VS/T&L Triangle Setup Raster Shader ROP 29 FBI Texture Cache Frame Buffer Memory ANTI-ALIASING EXAMPLE Triangle Geometry Triangle Geometry 30 Aliased Aliased Anti-Aliased Anti-Aliased PROGRAMMABLE VERTEX AND PIXEL PROCESSORS 3D Application or Game 3D API Commands CPU 3D API: OpenGL or Direct3D CPU – GPU Boundary GPU Command & Data Stream GPU Front End Vertex Index Stream Primitive Assembly Pre-transformed Vertices Pixel Location Stream Rasterization & Interpolation Rasterized Transformed Pre-transformed Vertices Fragments Programmable Vertex Processor 31 GPU Assembled Polygons, Lines, and Points Pixel Updates Raster Operation s Framebuffer Transformed Fragments Programmable Fragment Processor An example of separate vertex processor and fragment processor in a programmable graphics pipeline • GPUs have moved away from the traditional fixed-function 3D graphics pipeline toward a flexible general-purpose computational engine. POWER • The raw computational power of a GPU dwarfs that of the most powerful CPU, and the gap is steadily widening. • GPUs have moved away from the traditional fixed-function 3D graphics pipeline toward a flexible general-purpose computational engine NEXT: UNIFY SHADERS • One set of shaders • Allocate to either vertices or fragments UNIFIED GRAPHICS PIPELINE Host Data Assembler Setup / Rstr / ZCull SP SP SP TF SP TF L1 SP TF L1 34 SP SP SP TF L1 L1 SP SP TF L1 L2 FB Pixel Thread Issue SP TF L2 FB SP SP TF L1 L2 FB SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor Vtx Thread Issue L2 FB Input Registers Fragment Program per thread per Shader per Context Texture Constants Temp Registers Output Registers FB 35 Memory The restricted input and output capabilities of a shader programming model. PIPELINE EVOLVED EVOLVED PIPELINE GPGPU • Make GPU more general – adapt certain types of programs to it’s pipelined, parallel architecture • Single GeForce 8800 chip achieves a sustained 330 billion floating-point operations per second (Gflops) on simple benchmarks • Cost-effective: graphics driving demand up, supply up, price down for GPUs • Finding uses in non-graphics applications. GEFORCE 8800 GTX SHADERS IN DIRECT3D • DirectX 9: Vertex Shader, Pixel Shader • DirectX 10: Vertex Shader, Geometry Shader, Pixel Shader • DirectX 11: Vertex Shader, Hull Shader, Domain Shader, Geometry Shader, Pixel Shader, Compute Shader • Observation: All of these shaders require the same basic functionality: Texturing (or Data Loads) and Math Ops. Processing is highly data-parallel • GPUs are highly multithreaded • Use thread switching to hide memory latency • Less reliance on multi-level caches • Graphics memory is wide and high-bandwidth GPU ARCHITECTURES Trend toward general purpose GPUs • Heterogeneous CPU/GPU systems • CPU for sequential code, GPU for parallel code Programming languages/APIs • DirectX, OpenGL • C for Graphics (Cg), High Level Shader Language (HLSL) • Compute Unified Device Architecture (CUDA) EVOLUTION OF GPU STREAM PROCESSING THE SUPERCOMPUTING REVOLUTION (1) THE SUPERCOMPUTING REVOLUTION (2) • Need to understand how CPUs and GPUs differ WHAT ACCOUNTS FOR THIS DIFFERENCE? • Latency Intolerance versus Latency Tolerance • Task Parallelism versus Data Parallelism • Multi-threaded Cores versus SIMT (Single Instruction Multiple Thread) Cores • 10s of Threads versus 10,000s of Threads LATENCY (1) • GPUs are designed for tasks that can tolerate latency • Example: Graphics in a game (simplified scenario): CPU Generate Frame 0 Generate Frame 1 Generate Frame 2 GPU Idle Render Frame 0 Render Frame 1 Latency between frame generation and rendering (order of milliseconds) • To be efficient, GPUs must have high throughput, i.e. processing millions of pixels in a single frame LATENCY (2) • CPUs are designed to minimize latency • Example: Mouse or keyboard input • Caches are needed to minimize latency • CPUs are designed to maximize running operations out of cache • Instruction pre-fetch • Out-of-order execution, flow control • CPUs need a large cache, GPUs do not • GPUs can dedicate more of the transistor area to computation horsepower CPU VERSUS GPU TRANSISTOR ALLOCATION • GPUs can have more ALUs for the same sized chip and therefore run many more threads of computation ALU ALU ALU ALU Control Cache DRAM DRAM CPU GPU • Modern GPUs run 10,000s of threads concurrently How do we: MANAGING THREADS ON A GPU • Avoid synchronization issues between so many threads? • Dispatch, schedule, cache, and context switch 10,000s of threads? • Program 10,000s of threads? Design GPUs to run specific types of threads: • Independent of each other – no synchronization issues • SIMD (Single Instruction Multiple Data) threads – minimize thread management • Reduce hardware overhead for scheduling, caching etc. • Program blocks of threads (e.g. one pixel shader per draw call, or group of pixels) Any problems which can be solved with this type of computation? DATA PARALLEL PROBLEMS Plenty of problems fall into this category (luckily J) Graphics, image & video processing, physics, scientific computing, … This type of parallelism is called data parallelism And GPUs are the perfect solution for them! In fact the more the data, the more efficient GPUs become at these algorithms Bonus: You can relatively easily add more processing cores to a GPU and increase the throughput • What we just described: • Given a (typically large) set of data (“stream”) • Run the same series of operations (“kernel” or “shader”) on all of the data (SIMD) STREAM PROCESSING • GPUs use various optimizations to improve throughput: • Some on-chip memory and local caches to reduce bandwidth to external memory • Batch groups of threads to minimize incoherent memory access • Bad access patterns will lead to higher latency and/or thread stalls. • Eliminate unnecessary operations by exiting or killing threads • Example: Z-Culling and Early-Z to kill pixels which will not be displayed • GPUs use stream processing to achieve high throughput • GPUs designed to solve problems that tolerate high latencies • High latency tolerance € Lower cache requirements • Less transistor area for cache € units TO SUMMARIZE • More computing units € high throughput More area for computing 10,000s of SIMD threads and • GPUs win • Additionally: • Threads managed by hardware € You are not required to write code for e thread and manage them yourself h c a • Easier to increase parallelism by adding more processors • So, fundamental unit of a modern GPU is a stream processor… G80 AND GT200 STREAMING PROCESSOR ARCHITECTURE The future of high throughput computing is programmable stream processing BUILDING A PROGRAMMABLE GPU So build the architecture around the unified scalar stream processing cores GeForce 8800 GTX (G80) was the first GPU architecture built with this new paradigm G80 REPLACES THE PIPELINE MODEL Host Input Assembler SP SP SP TF Vtx Thread Issue SP L2 SP SP SP TF L1 L1 SP SP TF L1 L2 FB Pixel Thread Issue SP TF L2 FB SP SP TF L1 L1 FB SP TF TF L1 SP Geom Thread Issue SP TF L1 L2 FB SP L1 L2 FB Thread Processor 128 Unified Streaming Processors Setup / Rstr / ZCull L2 FB GT200 ADDS MORE PROCESSING POWER Host CPU System Memory GPU Host Interface Viewport / Clip / Setup / Raster / ZCull Input Assemble Vertex Work Distribution Geometry Work Distribution Pixel Work Distribution Compute Work Distribution Interconnection Network ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM ROP L2 DRAM 8800GTX (HIGH-END G80) • 16 Stream Multiprocessors • Each one contains 8 unified streaming processors – 128 in total GTX280 (high-end GT200) • 24 Stream Multiprocessors • Each one contains 8 unified streaming processors – 240 in total INSIDE A STREAM MULTIPROCESSOR (SM) • Scalar register-based ISA • Multithreaded Instruction Unit • Up to 1024 concurrent threads • Hardware thread scheduling • In-order issue • 8 SP: Thread Processors • IEEE 754 32-bit floating point • 32-bit and 64-bit integer • 16K 32-bit registers • 2 SFU: Special Function Units • sin, cos, log, exp • Double Precision Unit • IEEE 754 64-bit floating point • Fused multiply-add • 16KB Shared Memory I-Cache MT Issue C-Cache SP SP SP SP SP SP SP SP SFU SFU DP Shared Memory Workloads are partitioned into blocks of threads among multiprocessors • a block runs to completion • a block doesn’t run until resources are available MULTIPROCESSOR PROGRAMMING MODEL Allocation of hardware resources • shared memory is partitioned among blocks • registers are partitioned among threads Hardware thread scheduling • any thread not waiting for something can run • context switching is free – every cycle MEMORY HIERARCHY OF G80 AND GT200 SM can directly access device memory (video memory) Not cached Read & write GT200: 140 GB/s peak SM can access device memory via texture unit Cached Read-only, for textures and constants GT200: 48 GTexels/s peak On-chip shared memory shared among threads in an SM important for communication amongst threads provides low-latency temporary storage G80 & GT200: 16KB per SM • For GPU, performance == throughput • Cache are limited in the memory hierarchy • Strategy: hide latency with computation, not cache PERFORMANCE PER MILLIMETER • Heavy multithreading • Switch to another group of threads when the current group is waiting for memory access • Implication: need large number of threads to hide latency • Occupancy: typically 128 threads/SM minimum • Maximum 1024 threads/SM on GT200 (total 1024 * 24 = 24,576 threads) • Strategy: Single Instruction Multiple Thread (SIMT) SIMT THREAD EXECUTION • Group 32 threads (vertices, pixels or primitives) into warps • Threads in warp execute same instruction at a time • Shared instruction fetch/dispatch • Hardware automatically handles divergence (branches) • Warps are the primitive unit of scheduling • Pick 1 of 24 warps for each instruction slot • SIMT execution is an implementation choice • Shared control logic leaves more space for ALUs • Largely invisible to programmer I-Cache MT Issue C-Cache SP SP SP SP SP SP SP SP SFU SFU DP Shared Memory SHADER BRANCHING PERFORMANCE • G8x/G9x/GT200 branch efficiency is 32 threads (1 warp) G80 – 32 pixel coherence PS Branching Efficiency 48 pixel coherence 16 • If threads diverge, both sides of branch will execute on all 32 • More efficient compared to architecture with branch efficiency of 48 threads number of coherent 4x4 tiles 14 12 10 8 6 4 2 0% 20% 40% 60% 80% 100% 120% • Execute in blocks can maximally exploits data parallelism • Minimize incoherent memory access • Adding more ALU yields better performance • Performs data processing in SIMT fashion • Group 32 threads into warps • G80 and GT200 Streaming Processor Architecture • Threads in warp execute same instruction at a time • Thread scheduling is automatically handled by hardware • Context switching is free (every cycle) • Transparent scalability. Easy for programming • Memory latency is covered by large number of in-flight threads • Cache is mainly used for read-only memory access (texture, constants). GPU BEYOND GRAPHICS Same components as a typical CPU ARCHITECTURE OFA GPU However,… More computing elements More types of memory Original GPUs had vertex and pixel shaders Specifically for graphics Modern GPUs are slightly different CUDA – Compute Unified Device Architecture COMPUTATIONAL ELEMENTS OF A GPU Streaming Processor – Core of the design Place where all of the computation takes place Streaming Multiprocessor Groups of streaming multiprocessors In addition to the SPs, these also contain the Special Function Units and Load/Store Units Instructional Schedulers Complex Control Logic STREAMING MULTIPROCESSOR ARCHITECTURE TYPES OFGPU MEMORY Global DRAM Slowest Performance Texture Cached Global Memory “Bound” at runtime Constant Cached Global Memory Shared Local to a block of threads Thread TERMINOLOGY Thread – The smallest grain of the hierarchy of device computation Block Block – A group of threads Grid Grid – A group of blocks Warp Warp – A group of 32 threads that are executed simultaneously on the device Kernel Kernel ‐ The creator of a grid for GPU execution GRIDS, BLOCKS, AND THREADS CUDA MEMORY Faster, per-block Fastest, per-thread Slower, global Read-only, cached HETEROGENEOUS COMPUTING Host: the CPU and its memory Device: the GPU and its memory