Programming of Massively Parallel Processors Introduction Based on slides by David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 and those used by Sanjay Patel in ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 1 Goals • Learn how to program massively parallel processors and achieve – high performance – functionality and maintainability – scalability across future generations • Acquire technical knowledge required to achieve the above goals – principles and patterns of parallel programming – processor architecture features and constraints – programming API, tools and techniques David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 2 Machine Problems • Your own PCs running G80 emulators – Better debugging environment – Sufficient for first couple of weeks • NVIDIA boards – The Cyprus Institute Computation-based Science and Technology Research Center (CaSToRC) NVIDIA GPU cluster accounts (more on hardware later) – Much faster but less debugging support ©David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 3 Texts 1. Draft textbook by Prof. Hwu and Prof. Kirk available at the UIUC course website http://courses.ece.illinois.edu/ece498/al/Syllabus.html 2. NVIDIA, NVidia CUDA Programming Guide, NVidia, 2007 (reference book) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 4 Tentative Planned Lectures • Introduction / GPU Computing and CUDA Intro • CUDA threading model • CUDA memory model • CUDA memory model, tiling • CUDA performance David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 5 Why Massively Parallel Processing? • A quiet revolution and potential build-up – Calculation: 1 TFLOPS vs. 100 GFLOPS – Memory Bandwidth: ~10x Many-core GPU Multi-core CPU Courtesy: John Owens Figure 1.1. Enlarging Perform ance Gap betw een GPUs and CPUs. – GPU in every PC– massive volume and potential impact David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 6 CPU & GPU Different Design Philosophies • CPU design is optimized for sequential code performance – Sophisticated control allow instructions from a single thread to execute in parallel or out of order – Large caches to reduce access latencies to main memory • GPU design optimizes for the execution of a massive number of threads – hardware uses a large number of threads to find work to do when some of threads are waiting for long-latency memory accesses (NO Large caches) – much more chip area is dedicated to the floating-point calculations David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 7 CPU & GPU Different Design Philosophies (cont.) David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 8 G80 GPU Chip (2006) • 16 Streaming Multiprocessors (SMs) • SM has 8 streaming processors (SPs), for a total of 128 cores (16*8); 90Watts – supports up to 768 threads per core (SP) – about 12,000 threads total • SP has a MAD, MUL & special function units • Pair of SMs form a building block David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 9 More G80 Characteristics • 367 GFLOPS peak performance (25-50 times of current high-end microprocessors) • 265 GFLOPS sustained for some applications • 30-100 times speedup over high-end microprocessors on scientific and media applications: medical imaging, molecular dynamics David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 10 G80 Chip Host Input Assembler Thread Execution Manager Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Parallel Data Cache Texture Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign 11 Fermi (2010) ~1.5TFLOPS (SP)/~800GFLOPS (DP) 230 GB/s DRAM Bandwidth David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 12 Future Applications Reflect a Concurrent World • Exciting applications in future mass computing market have been traditionally considered “supercomputing applications” – Molecular dynamics simulation, Video and audio coding and manipulation, 3D imaging and visualization, Consumer game physics, and virtual reality products – These “Super-apps” represent and model physical, concurrent world • Various granularities of parallelism exist, but… – programming model must not hinder parallel implementation – data delivery needs careful management David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 13 Speedup of Applications 457 316 431 263 210 79 GPU Speedup Relative to CPU 60 50 40 Ke rn e l Ap p lic a tio n 30 20 10 0 H .2 6 4 LBM R C 5 -7 2 F EM R PES PN S SA XPY T PA C F FDTD M R I-Q M R IFHD • GeForce 8800 GTX vs. 2.2GHz Opteron 248 • 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads • 25 to 400 speedup if the function’s data requirements and control flow suit the GPU and the application is optimized David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 14