Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction © David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Two Main Trajectory • Since 2003, semiconductor industry follow two main trajectories: – Multicore: seek to maintain the execution speed of sequential program. Reduce the Latency – Many core: improve the execution throughput of parallel application. Each heavily multi-threaded core is much smaller and some cores share control and instruction cache. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 2 CPUs and GPUs have fundamentally different design philosophies ALU ALU ALU ALU Control CPU GPU Cache DRAM © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign DRAM 3 Multicore CPU • Optimized for sequential program sophisticated control logic to allow instructions from a single thread to execute faster. In order to minimize the latency, large on-chip cache to reduce the longlatency memory access to cache accesses, the execution latency of each thread is reduced. However, the large cache memory (multiple megabytes, low-latency arithmetic units and sophisticated operand delivery logic consume chip area and power. – Latency-oriented design © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 4 Multicore CPU • Many applications are limited by the speed at which data can be moved from memory to processor. – CPU has to satisfy the requirements from legacy OS and I/O, more difficult to let memory bandwidth to increase. Usually 1/6 of GPU © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 5 Many-core GPU • Shaped by the fast-growing video game industry that expects tremendous massive number of floating-pint calculations per video frame. • Motive to look for ways to maximize the chip area and power budget dedicated to floating-point calculations. Solution is to optimize for the execution throughput of massive number of threads. The design saves chip area and power by allowing pipelined memory channels and arithmetic operations to have long latency. The reduce area and power on memory and arithmetic allows designers to have more cores on a chip to increase the execution throughput. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 6 Many-core GPU • A large number of threads to find work to do when some of them are waiting for long-latency memory accesses or arithmetic operations. Small cache are provide to help control the bandwidth requirements so multiple threads that access the same memory do not need to go the DRAM. – Throughput-oriented design: that thrives to maximize the total execution throughput of a large number of threads while allowing individual threads to take a potentially much longer time to execute. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 7 CPU + GPU • GPU will not perform well on tasks on which CPUs are design to perform well. For program that have one or very few threads, CPUs with lower operation latencies can achieve much higher performance than GPUs. • When a program has a large number of threads, GPUs with higher execution throughput can achieve much higher performance than CPUs. Many applications use both CPUs and GPUs, executing the sequential parts on the CPU and numerically intensive parts on the GPUs. © David Kirkand Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 8 GPU adoption • The processors of choice must have a very large presence in the market place. – 400 million CUDA-enabled GPUs in use to date. • Practical form factors and easy accessibility – Until 2006, parallel programs are run on data centers or clusters. Actual clinical applications on MRI machines are based on a PC and special hardware accelerators. GE and Siemens cannot sell racks to clinical settings. NIH refused to fund parallel programming projects. Today NIH funds research using GPU. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 9 Why Massively Parallel Processor • A quiet revolution and potential build-up – – – Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until last year, programmed through graphics API – GPU in every PC and workstation – massive volume and potential impact © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 10 Architecture of a CUDA-capable GPU Host Input Assembler Two streaming multiprocessors form a building block Each has a number of streaming processors that share control logic and instruction cache. Each GPU comes with multiple gigabytes of DRAM (global memory). Offers High bandwidth off-chip, though with longer latency than typical system memory. High bandwidth makes up for the longer latency for massively parallel applications G80: 86.4 GB/s of memory bandwidth plus 8GB/s up and down 4Gcommunication bandwidth with CPU Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory A good application runs 5k to 12k threads. CPU support © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 2 to 8 threads. 11 ECE 408, University of Illinois, Urbana-Champaign GT200 Characteristics • 1 TFLOPS peak performance (25-50 times of current highend microprocessors) • 265 GFLOPS sustained for apps such as VMD • Massively parallel, 128 cores, 90W • Massively threaded, sustains 1000s of threads per app • 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics “I think they're right on the money, but the huge performance differential (currently 3 GPUs ~= 300 SGI Altix Itanium2s) will invite close scrutiny so I have to be careful what I say publically until I triple check those numbers.” -John Stone, VMD group, Physics UIUC © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 12 Future Apps Reflect a Concurrent World • Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” – Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products – These “Super-apps” represent and model physical, concurrent world • Various granularities of parallelism exist, but… – programming model must not hinder parallel implementation – data delivery needs careful management © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 13 Stretching Traditional Architectures • Traditional parallel architectures cover some super-applications – DSP, GPU, network apps, Scientific • The game is to grow mainstream architectures “out” or domain-specific architectures “in” – CUDA is latter Traditional applications Current architecture coverage New applications Domain-specific architecture coverage © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign Obstacles 14 Software Evolvement • MPI: scale up to 100,000 nodes. • CUDA shared memory for parallel execution. Programmers manage the data transfer between CPU and GPU and detailed parallel code construct. • OpenMP: shared memory. Not able to scale beyond a couple of hundred cores due to thread managements overhead and cache coherence. Compilers do most of the automation in managing parallel execution. • OpenCL (2009): Apple, Intel AMD/ATI, NViDia: proposed a standard programming model. Define language extension and run-time API. Application developed in OpenCL can run on any processors that support OpenCL language extension and API without code modification • OpenACC (2011): compiler directives to specific loops and regions of code to offload from CPU to GPU. More like OpenMP. © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2010 ECE 408, University of Illinois, Urbana-Champaign 15