Carlo del Mundo Department of Electrical and Computer Engineering Ubiquitous Parallelism Are You Equipped To Code For Multi- and Many- Core Platforms? Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start? 2 Talk Goal • Encourage undergraduates to answer the call to the era of parallelism • Education • Software Engineering 3 Why Parallelism? Why now? • You’ve already been exposed to parallelism • Bit Level Parallelism • Instruction Level Parallelism • Thread Level Parallelism 4 Why Parallelism? Why now? • Single-threaded performance has plateaued • Silicon Trends • Power Consumption • Heat Dissipation 5 Why Parallelism? Why now? 6 Power Chart: P = CV2F 7 Heat Chart (Feature Size) 8 Why Parallelism? Why now? • Issue: Power & Heat • Good: Cheaper to have more cores, but slower • Bad: Breaks hardware/software contract 9 Why Parallelism? Why now? • Hardware/Software Contract • Maintain backwards-compatibility with existing codes 10 Why Parallelism? Why now? 11 Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start? 12 Personal Mobile Device Space iPhone 5 13 Galaxy S3 Personal Mobile Device Space 2 CPU cores/ 3 GPU cores iPhone 5 14 Galaxy S3 Personal Mobile Device Space 2 CPU cores/ 3 GPU cores iPhone 5 15 4 CPU cores/ 4 GPU cores Galaxy S3 Desktop Space 16 Desktop Space 16 CPU cores • Rare To Have “Single Core” CPU • Clock Speeds < 3.0 GHz • Power Wall • Heat Dissipation AMD Opteron 6272 17 Desktop Space • General Purpose 2048 GPU Cores • Power Efficient • High Performance • Not All Problems Can Be Done on GPU 18 AMD Radeon 7970 Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 cores each) • 2x NVIDIA C2050 (448 GPUs each) 19 Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 cores each) • 2x NVIDIA C2050 (448 GPUs each) • 209 nodes 20 Warehouse Space (HokieSpeed) • Each node: • 2x Intel Xeon 5645 (6 ★ cores2508 CPU cores each) ★ 187264 GPU cores • 2x NVIDIA C2050 (448 GPUs each) • 209 nodes 21 All Spaces 22 Convergence in Computing • Three Classes: • Warehouse • Desktop • Personal Mobile Device • Main Criteria • Power, Performance, Programmability 23 Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start? 24 What is a CPU? • CPU • SR71 Jet • Capacity • 2 passengers • Top Speed • 2200 mph 25 What is the GPU? • GPU • Boeing 747 • Capacity • 605 passengers • Top Speed • 570 mph 26 CPU vs. GPU 27 Capacity (passengers) Speed (mph) Throughput (passengers * mph) “CPU” Fighter Jet 2 2200 4400 “GPU” 747 452 555 250,860 CPU Architecture • Latency Oriented (Speculation) 28 GPU Architecture 29 APU = CPU + GPU • Accelerated Processing Unit • Both CPU + GPU on the same die 30 CPUs, GPUs, APUs • How to handle parallelism? • How to extract performance? • Can I just throw processors at a problem? 31 CPUs, GPUs, APUs • Multi-threading (2-16 threads) • Massive multi-threading (100,000+) • Depends on Your Problem 32 Agenda • Introduction/Motivation • Why Parallelism? Why now? • Survey of Parallel Hardware • CPUs vs. GPUs • Conclusion • How Can I Start? 33 How Can I start? • CUDA Programming • You most likely have a CUDA enabled GPU if you have a recent NVIDIA card 34 How Can I start? • CPU or GPU Programming • Use OpenCL (your laptop could potentially run) 35 How Can I start? • Undergraduate research • Senior/Grad Courses: • • • • 36 CS 4234 – Parallel Computation CS 5510 – Multiprocessor Programming ECE 4504/5504 – Computer Architecture CS 5984 – Advanced Computer Graphics In Summary … • Parallelism is here to stay • How does this affect you? • How fast is fast enough? • Are we content with current computer performance? 37 Thank you! • Carlo del Mundo, • Senior, Computer Engineering • Website: http://filebox.vt.edu/users/cdel/ • E-mail: cdel@vt.edu Previous Internships @ 38 Appendix 39 Programming Models • • • • 40 pthreads MPI CUDA OpenCL pthreads • A UNIX API to create and destroy threads 41 MPI • A communications protocol • “Send and Receive” messages between nodes 42 CUDA • Massive multithreading (100,000+) • Threadlevel parallelism 43 OpenCL • Heterogeneous programming model that is catered to several devices (CPUs, GPUs, APUs) 44 Comparisons pthreads MPI CUDA OpenCL Number Threads 2-16 -- 100,000+ 2 – 100,000+ Platform CPU only Any Platform NVIDIA Only Any Platform Productivity† Easy Medium Hard Hard Parallelism through Threads Messages Threads Threads † Productivity is subjective and draws from my experiences Parallel Applications • Vector Add • Matrix Multiplication 46 Vector Add + = 47 Vector Add • Serial • Loop N times • N cycles† • Parallel • Assume you have N cores • 1 cycles† † Assume 48 1 add = 1 cycle + = Matrix Multiplication B A C 49 Matrix Multiplication B A C 50 Matrix Multiplication B A C 51 Matrix Multiplication • Embarassingly Parallel • Let L be the length of each side • L^2 elements, each element requires L multiplies and L adds 52 Performance • • • • 53 Operations/Second (FLOPS) Power (W) Throughput (# things/unit time) FLOPS/W Puss In Boots • Renders that took hours now take minutes • - Ken Mueseth, Effects R&D Supervisor • DreamWorks Animation 54 Computational Finance • Black-Scholes – A PDE which governs the price of an option essentially “eliminating” risk 55 Genome Sequencing • Knowledge of the human genome can provide insights to new medicine and biotechnology • E.g.: genetic engineering, hybridization 56 Applications 57 Why Should You Care? • Trends: • CPU Core Counts Double Every 2 years • 2006 – 2 cores, AMD Athlon 64 X2 • 2010 – 8-12 cores, AMD Magny Cours • Power Wall 58 Then And Now • Today’s state-of-the-art hardware is yesterday’s supercomputer • 1998 – Intel TFLOPS supercomputer • 1.8 trillion floating point ops / sec (1.8 TFLOP) • 2008 – AMD Radeon 4870 GPU x2 • 2400 trilliion floating point ops / sec (2.4 TFLOP) 59