Current Trends in High Performance Computing Chokchai Box Leangsuksun, PhD SWEPCO Endowed Professor*, Computer Science Director, High Performance Computing Initiative Louisiana Tech University box@latech.edu 1 *SWEPCO endowed professorship is made possible by LA Board of Regents Outline • • • • What is HPC? Current Trends More on PS3 and GPU computing Conclusion 12 December 2011 2 1 Mainstream CPUs • CPU speed – plateaus 3-4 Ghz • More cores in a single chip 3-4 Ghz cap – Dual/Quad core is now – Manycore (GPGPU) • Traditional Applications won’t get a free rides • Conversion to parallel computing (HPC, MT) This diagram is from “no free lunch article in DDJ 12 December 2011 3 New trends in computing • Old & current – SMP, Cluster • Multicore computers – Intel Core 2 Duo – AMD 2x 64 • Many-core accelerators – GPGPU, FPGA, Cell • More Many brains in one computer • Not to increase CPU frequency • Harness many computers – a cluster computing 12/12/11 4 2 What is HPC? • High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc. • Time to insights, Time to discovery, Times to markets 12 December 2011 5 Parallel Programming Concepts Conventional serial execution where the problem is represented as a series of instructions that are executed by the CPU Problem Parallel execution of a problem involves partitioning of the problem into multiple executable parts that are mutually exclusive and collectively exhaustive represented as a partially ordered set exhibiting concurrency. Task Problem Task Task Task CPU Parallel computing takes advantage of concurrency to : • Solve larger problems with less time • Save on Wall Clock Time • Overcoming memory constraints • Utilizing non-local resources 12 December 2011 instructions instructions CPU CPU CPU CPU 6 Source from Thomas Sterling’s intro to HPC 6 3 HPC Applications and Major Industries • Finite Element Modeling – Auto/Aero • Fluid Dynamics – Auto/Aero, Consumer Packaged Goods Mfgs, Process Mfg, Disaster Preparedness (tsunami) • Imaging – Seismic & Medical • Finance & Business – Banks, Brokerage Houses (Regression Analysis, Risk, Options Pricing, What if, …) – Wal-mart’s HPC in their operations • Molecular Modeling – Biotech and Pharmaceuticals Complex Problems, Large Datasets, Long Runs This slide is from Intel presentation “Technologies for Delivering Peak Performance on HPC and Grid Applications” 12 December 2011 7 HPC Drives Knowledge Economy 12/12/11 8 4 Life Science Problem – an example of Protein Folding • Take a computing year (in serial mode) to do molecular dynamics simulation for a protein folding problem • Excerpted from IBM David Klepacki’s The future of HPC • Petaflop = a thousand trillion floating point operations per second 12 December 2011 9 Disaster Preparedness - example • Project LEAD – Severe Weather prediction (Tornado) – OU leads. • HPC & Dynamically adaptation to weather forecast • Professor Seidel’s LSU CCT – – – – Hurricane Route Prediction Emergency Preparedness Accuracy of prediction 1 Mile2 = $1 M 12 December 2011 10 5 HPC accelerates a product • FE analysis on 1 CPU – 1,000,000 elements – Numerical processing for 1 element = .1 secs – One computer will take 100,000 secs = 27.7 hrs • Says 100 CPUs – .27 hr ~ 16 mins 12 December 2011 11 Avian Flu Pandemic Modeled on a Supercomputer • MIDAS (Models of Infectious Disease Agent Study) program • The large-scale, stochastic simulation model examines the nationwide spread of a pandemic influenza virus strain • A simulation starts with 2 passengers with contaminated AF arriving LAX • The simulation rolls out a city-city and census-tract-level picture of the spread of infection • a synthetic population of 281 million people over the course of 180 days • It is a very large scale and complex multi-variant 12 December 2011 12 6 Avian Flu Pandemic (90 days) Timothy C. Germann, Kai Kadau, Catherine A. Macken (Los Alamos National Laboratory); Ira M. Longini Jr. (Emory University) 12 December 2011 Source from www.lanl.gov 13 Avian Flu Pandemic (II) • The results show that advance preparation of a modestly effective vaccine in large quantities appears to be preferable to waiting for the development of a well-matched vaccine that may be too late. • The simulation models a synthetic population that matches U.S. census demographics and worker mobility data by randomly assigning the simulated individuals to households, workplaces, schools, and the like. • The models serve as virtual laboratories to study how infectious diseases and what intervention strategies are more effective • Run on the Los Alamos supercomputer known as Pink, a 1,024-node (2,048 processor) LinuxBIOS/Bpro with 2 GB/ node. Source from www.lanl.gov 12 December 2011 14 7 Significant indicators – why HPC now? • Main stream computers with multi-cores (Intel or AMD) – – – – In past 1-2 years, CPU speed was flatten at 3+ Ghz More CPUs in one chip – Dual core, multi-core chips Traditional software won’t take advantage of these new processors Personal/Desktop Supercomputing. • Many real problems are highly computational intensive. – – – – NSA uses supercomputing to do data mining DOE – fusion, plasma, energy related (including weaponry). Help solving many other important areas (nanotech, life science etc.) Product design, ERM/Inventory Management • Giants recently sneeze out HPC – Bush’s state of union speech – 3 main S&T focus of which Supercomputing is one of them – Bill Gates’ keynote speech at SC05 – MS goes after HPC • Google search engine - 100,000 nodes • Playstation 3 is a personal supercomputing platform • Hollywood (Entertainment) is HPC-bound (Pixar – more than 3000 CPUs to render animation) 12 December 2011 15 HPC preparedness • Build work forces that understand HPC paradigm & its applications – HPC/Grid Curriculum in IT/CS/CE/ICT – Offer HPC-enabling tracks to other disciplinary (engineering, life science, physic, computational chem, business etc..) – Training business community – Bring awareness to public • National and strategic policies • Improve Infrastructure 12 December 2011 16 8 Pause here • Switch to a tour of machine rooms – Clusters, our Lab to show what they will be using.. • Get students’ info on signup sheet for accounts on our clusters (azul, quadcore, GPU and PS3). • Intro to Linux • Then continue on HPC101 12/12/11 17 HPC 101 12 December 2011 18 9 How to Run Applications Faster ? • There are 3 ways to improve performance: – Work Harder – Work Smarter – Get more Help • Computer Analogy – Using faster hardware – Optimized algorithms and techniques used to solve computational tasks – Multiple computers to solve a particular task 12 December 2011 19 Parallel Programming Concepts Task Problem Task Task Task instructions CPU 12 December 2011 CPU CPU CPU Source from Thomas Sterling’s intro to HPC 20 10 HPC objective • High Performance Computing – Parallel , Supercomputing – Achieve the fastest possible computing outcome – Subdivide a very large job into many pieces – Enabled by multiple high speed CPUs, networking, software & programming paradigms – fastest possible solution – Technologies that help solving non-trivial tasks including scientific, engineering, medical, business, entertainment and etc. 12 December 2011 21 Flynn’s Taxonomy of Computer Architectures l SISD - Single Instruction/Single Data l SIMD - Single Instruction/Multiple Data l MISD - Multiple Instruction/Single Data l MIMD - Multiple Instruction/Multiple Data 22 11 Single Instruction/Single Data PU – Processing Unit Your desktop, before the spread of dual core CPUs Slide Source: Wikipedia, Flynn’s Taxonomy 23 Flavors of SISD Instructions: 24 12 More on pipelining… 25 Single Instruction/Multiple Data Processors that execute same instruction on multiple pieces of data: NVIDIA GPUs Slide Source: Wikipedia, Flynn’s Taxonomy 26 13 Single Instruction/Multiple Data l l Each core runs the same set of instructions on different data Example: l GPGPU: processes pixels of an image in parallel Slide Source: Klimovitski & Macri, Intel 27 SISD versus SIMD Writing a compiler for SIMD architectures is VERY difficult (inter-thread communication complicates the picture…) Slide Source: ars technica, Peakstream article 28 14 Multiple Instruction/Single Data Pipe line : CMU Warp machine. Slide Source: Wikipedia, Flynn’s Taxonomy 29 Multiple Instruction/Multiple Data e.g. Multicore systems were based on a MIMD architecture + programming paradigm Such as openMP, multithreads Slide Source: Wikipedia, Flynn’s Taxonomy 30 15 Multiple Instruction/Multiple Data l The sky is the limit: each PU is free to do as it pleases l Can be of either shared memory or distributed memory categories Instructions: 31 Current HPC Hardware • Traditionally HPC has adopted expensive parallel hardware: – Massively Parallel Processors (MPP) – Symmetric Multi-Processors (SMP) • • • • Cluster Computers Recent trends in HPC … Multicore systems Heterogeneous Computing with Accelerator Boards (GPGPU, FPGA) 12 December 2011 32 16 HPC cluster • Login • Compile • Submit job • At least 2 connections • Run tasks 12 December 2011 33 Parallel Programming Env • Parallel Programming Environments and Tools – Threads (PCs, SMPs, NOW..) • POSIX Threads • Java Threads – MPI • Linux, NT, on many Supercomputers – OpenMP (predominantly on SMP) – – – – – – – – – PVM (old) UPC, Co-array Fortran CUDA, Brooks+, openCL Software DSMs (Shmem) Compilers RAD (rapid application development tools) Debuggers Performance Analysis Tools Visualization Tools 12 December 2011 34 17 Recent Trends in HPC Hardware • • • • • • • Multicore & Manycore are now. Multi CPUs in a single die Better power consumption tightly couple and better for multi-threading GPGPU As a build blocks for a much larger system New Top 500 HPC systems - clusters of multi-core & GPGPU 12 December 2011 35 What are HPC systems 12/12/11 36 18 Current top 5 systems 12/12/11 37 Shared vs Distributed Memory 12/12/11 38 19 Shared memory • Global memory space, accessible by all processors • Processors may have local memory to hold copies of some global memory. • Consistency of copies is usually maintained by hardware (cache coherency) 12/12/11 39 Two typical classes of SM • Uniform Memory Access (UMA): – Equal access times – identical processors typically represented by Symmetric Multi- processor Machines (SMP) or Multicores • Non-Uniform Memory Access (NUMA): – Memory access times are not uniform, memory access across a link is slower – Often made by physically linking two or more SMPs or heterogeneous computing 12/12/11 40 20 Advantage & Disadvantage • Global address space is user-friendly • Data sharing between tasks is fast • System may suffer from lack of scalability. Adding CPUs increases traffic on shared memory - to - CPU path. This is especially true for cache coherent systems • Programmer is responsible for correct synchronization • Systems larger than an SMP need some specialpurpose components. 12/12/11 41 Distributed Memory 12/12/11 42 21 Multicores • Three multicore classifications – Homogeneous – Heterogeneous – Hybrid 12 December 2011 43 Multicores(I) • Homogeneous Cores (a main CPU) – All cores are identical – A traditional MC with few cores • Good for jumbo & few tasks – Not as many tasks/threads as accelerators or GPU. – E.g. Intel Core2Duo, i3, i5, i7, AMD – Programming – Multithreads/openMP 12 December 2011 44 22 Multicores(II) • Homogeneous Cores as accelerator or compute device – Need a main CPU system – As attached processing units – All cores are identical and many – Good for many SIMD tasks/threads – E.g. NVIDIA GPGPU, Clearspeed FPGA – Programming – library calls from a main program or a new language extension, e.g. CUDA 12 December 2011 45 Multicores(III) • Heterogeneous Cores – All cores are NOT identical – All in one die – Programming is more difficult – See more in PS3 presentation 12 December 2011 46 23 Multicores(IV) • Hybrid System – Mix between host cores & accelerator cores – A typical host can be a desktop to server system, e.g. Intel or AMD – Accelerator – NVDIA, ATI Stream or FPGA – Programming model is more complex – Issues – memory bandwidth between host vs. devices 12 December 2011 47 Introduction to Cell BE (PS3) Programming HPCI: High Performance Computing Initiative 24 PS3 - awesome HPC system • IBM Cell processor • Affordable • But currently not many tools 12 December 2011 49 Cell BE Architecture PowerPC Processor Element Main Processor 64 bit Also support Vector/SIMD Run the OS, Manage SPE 12 December 2011 Synergistic Processor Element 128-bit RISC, SIMD processor 256 KB local storage memory Use DMA to transfer data between local storage and main memory Picture ref: http://gamasutra.com/features/20060721/chow_01.shtml 25 Cell Programming • • • • IBM Cell SDK Main Process run on PPE Threads run on SPEs PPE Centric programming paradigm PPE process SPE thread SPE thread SPE thread ... 12 December 2011 GPGPU General Purpose Graphic Processing Unit 12/12/11 52 26 Two major players Parallel Computing on a GPU • NVIDIA GPU Computing Architecture – Via a HW device interface – In laptops, desktops, workstations, servers • • • • 8-series GPUs deliver 50 to 500 GFLOPS on compiled parallel C applications Tesla T10 1070 from 1-4 TFLOPS GPU parallelism is better than Moore’s law, more doubling every year GPGPU is a GPU that allows user to process both graphics and non-graphics applications. Tesla D870 GeForce 8800 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, UrbanaChampaign 27 NVIDIA GeForce 8800 (G80) • the eighth generation of NVIDIA’s GeForce graphic cards. • High performance CUDA-enabled GPGPU • 128 cores • Memory 256-768 MB or 1.5 GB in Tesla • High-speed memory bandwidth • Supports Scalable Link Interface (SLI) NVIDIA TeslaTM • Feature – GPU Computing for HPC – No display ports – Dedicate to computation – For massively Multi-threaded computing – Supercomputing performance 28 NVIDIA Tesla Card >> • C-Series(Card) = 1 GPU with 1.5 GB • D-Series(Deskside unit) = 2 GPUs • S-Series(1U server) = 4 GPUs • Note: 1 G80 GPU = 128 cores = ~500 GFLOPs • 1 T10 = 240 cores = 1 TFLOPs << NVIDIA G80" This slide is from NVDIA CUDA tutorial © David Kirk/ NVIDIA and Wen-mei W. Hwu, 2007! ECE 498AL, University of Illinois, UrbanaChampaign! 29 GPGPU Programming with CUDA • CUDA (Compute Unified Device Architecture) is a SDK and API that allow a programmer to write C and Fortran programs to execute on GPGPU. • Works with NVIDIA G80 or later and Tesla • The GPGPU is viewed as a compute device ATI Stream (1) 12/12/11 60 30 ATI 4870 12/12/11 61 ATI 4870 X2 12/12/11 62 31 Architecture of ATI Radeon 4000 series This slide is from ATI presentation 32 This slide is from ATI presentation Introduction to Open CL Toward new approach in Computing Moayad Almohaishi 33 Introduction to openCL • OpenCL stands for Open Computing Language. • It is from consortium efforts such as Apple, NVDIA, AMD etc. • The Khronos group who was responsible for OpenGL. • Take 6 months to come up with the specifications. OpenCL • 1. Royalty-free. • 2. Support both task and data parallel programing modes. • 3. Works for vendor-agnostic GPGPUs • 4. including multi cores CPUs • 5. Works on Cell processors. • 6. Support handhelds and mobile devices. • 7. Based on C language under C99. 34 OpenCL • Can make query on available devices and build an context of the available devices. • Programmers would be able to program more freely for any kind of device. • Applications are more resuable even if the hardware changed in the future. 35 OpenCL Platform Model CPUs+GPU platforms 12/12/11 72 36 Performance of GPGPU Note: A cluster of dual Xeon 2.8GZ 30 nodes, Peak performance ~336 GFLOPS © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, UrbanaChampaign 37 Last words! • HPC or Supercomputing system is not necessarily gigantic in a big machine room but is accessible for Thais and may now be sitting next to your desk • Computing is necessity and Fast computing provides competitive edge, esp Knowledge Economy • New trends of HPC includes GPGPU, various multicore architecture • Prepare ourselves and strengthen our S&T, and industry as well business community for this phenomenon (HPC goes mainstream) before too late. 12 December 2011 75 Back up slides 12/12/11 76 38 Cancer Gene-mining • • Unsuccessful on a uni-processor Our approach – Novel parallel gene-mining algorithms – Input from microarray – Retain accuracy – Significantly speed up (superlinear) • IBM P5 supercomputer (128 node PPC). Time to run the algorithm, keeping number of nodes fixed Mesothelioma Time taken(in secs) 1200 1000 Bladder 100 Breast 80 60 Renal Leukemia 40 800 20 Prostate 600 0 Lung 400 Pancreas 200 0 Colorectal Ovary 13 39 65 91 Lymphoma Melanoma Number of processors OvaMarker based Selection GeneSetMine based Selection 12 December 2011 77 Drug Delivery • • • • By WU & Palmer, Louisiana Tech U Assisted by HPCI A study of microcapsules for drug delivery. Computational Fluid Dynamics methodology to model the generation of droplets or cores (using alginate and oil) • Goal: better understanding process parameters needed for generating cores of homogeneous size for the manufacturing of microcapsules. 12 December 2011 78 39 Droplet Generation: Experimental Procedure 12 December 2011 79 Droplet Generation: Example Results Case 1: Olive oil: Density 930 kg/m3 Viscosity 0.03 kg/m-s Alginate: Density 1012 kg/m3 Viscosity 0.2137 kg/m-s Case 2: Phase 1: Density 918 kg/m3 Viscosity 0.084 kg/m-s Phase 2: Density 998.2 kg/m3 Viscosity 0.001003 kg/m-s 12 December 2011 Source from wu’s thesis 80 40