CPE 779 Parallel Computing http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr12/index.html Lecture 1: Introduction Walid Abu-Sufah University of Jordan CPE 779 Parallel Computing - Spring 2012 1 Acknowledgment: Collaboration This course is being offered in collaboration with • The IMPACT research group at the University of Illinois http://impact.crhc.illinois.edu/ • The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/ CPE 779 Parallel Computing - Spring 2012 2 Acknowledgment: Slides Some of the slides used in this course are based on slides by • Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) http://www.cs.berkeley.edu/~demmel/cs267_Spr12/ • Wen-mei Hwu of the University of Illinois and David Kurk, Nvidia Corporation http://courses.engr.illinois.edu/ece408/ece408_syll.html • Kathy Yelick, University of California at Berkeley http://www.cs.berkeley.edu/~yelick/cs194f07 CPE 779 Parallel Computing - Spring 2012 3 Course Motivation In the last few years: • Conventional sequential processors can not get faster • Previously clock speed doubled every 18 months • All computers will be parallel • >>> All programs will have to become parallel programs • Especially programs that need to run faster. CPE 779 Parallel Computing - Spring 2012 4 Course Motivation (continued) There will be a huge change in the entire computing industry • Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them. • Multi/ many core chips have started a revolution in the software industry CPE 779 Parallel Computing - Spring 2012 5 Course Motivation (continued) Large research activities to address this issue are underway • Computer companies: Intel, Microsoft, Nvidia, IBM, ..etc • Parallel programming is a concern for the entire computing industry. • Universities • Berkeley's ParLab (2008: $20 million grant) CPE 779 Parallel Computing - Spring 2012 6 Course Goals Part 1 (~4 weeks) • focus on the techniques that are most appropriate for multicore programming and the use of parallelism to improve program performance. Topics include • performance analysis and tuning • data techniques • shared data structures • load balancing. and task parallelism • synchronization CPE 779 Parallel Computing - Spring 2012 7 Course Goals (continued - I) Part 2 (~ 12 weeks) • Learn how to program massively parallel processors and achieve • high performance • functionality and maintainability • scalability across future generations • Acquire technical knowledge required to achieve the above goals • principles and patterns of parallel algorithms • processor architecture features and constraints • programming API, tools and techniques CPE 779 Parallel Computing - Spring 2012 8 Outline of rest of lecture all • Why powerful computers must use parallel processors Including your laptops and handhelds • Examples of Computational Science and Engineering (CSE) problems which require powerful computers Commercial problems too • Why writing (fast) parallel programs is hard But things are improving • Principles of parallel computing performance • Structure of the course CPE 779 Parallel Computing - Spring 2012 9 What is Parallel Computing? • Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor • Examples of parallel machines: • A cluster computer that contains multiple PCs combined together with a high speed network • A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system • A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip • Concurrent execution comes from the desire for performance • * Technically, SMP stands for “Symmetric Multi-Processor” CPE 779 Parallel Computing - Spring 2012 10 Units of Measure • High Performance Computing (HPC) units are: • Flop: floating point operation • Flops/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega: Giga: Tera: Peta: Exa: Zetta: Yotta: Mflop/s = 1006 flop/sec; Gflop/s = 1009 flop/sec; Tflop/s = 1012 flop/sec; Pflop/s = 1015 flop/sec; Eflop/s = 1018 flop/sec; Zflop/s = 1021 flop/sec; Yflop/s = 1024 flop/sec; Mbyte = 220 = 1048576 ~ 106 bytes Gbyte = 230 ~ 109 bytes Tbyte = 240 ~ 1012 bytes Pbyte = 250 ~ 1015 bytes Ebyte = 260 ~ 1018 bytes Zbyte = 270 ~ 1021 bytes Ybyte = 280 ~ 1024 bytes • Current fastest (public) machine ~ 11 Pflop/s • Up-to-date list at www.top500.org CPE 779 Parallel Computing - Spring 2012 11 all (2007) Why powerful computers are parallel CPE 779 Parallel Computing - Spring 2012 12 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra CPE 779 Parallel Computing - Spring 2012 13 Microprocessor Transistors / Clock (1970-2000) 10000000 1000000 Transistors (Thousands) 100000 Frequency (MHz) 10000 1000 100 10 1 0 1970 14 1975 1980 1985 1990 CPE 779 Parallel Computing - Spring 2012 1995 2000 Impact of Device Shrinkage • What happens when the feature size (transistor size) shrinks by a factor of x ? • Clock rate goes up by x because wires are shorter • actually less than x, because of power consumption • Transistors per unit area goes up by x2 • Die size also tends to increase • typically another factor of ~x • Raw computing power of the chip goes up by ~ x4 ! • typically x3 is devoted to either on-chip • parallelism: hidden parallelism such as ILP • locality: caches • So most programs x3 times faster, without changing them 15 CPE 779 Parallel Computing - Spring 2012 Power Density Limits Serial Performance – Dynamic power is proportional to V2fC – Increasing frequency (f) also increases supply voltage (V) cubic effect – Increasing cores increases capacitance (C) but only linearly – Save power by lowering clock speed Scaling clock speed (business as usual) will not work 10000 Sun’s Surface Source: Patrick Gelsinger, Shenkar Bokar, Intel Rocket 1000 Power Density (W/cm2) • Concurrent systems are more power efficient Nozzle Nuclear 100 Reactor Hot Plate 8086 10 4004 8008 8080 P6 8085 286 Pentium® 386 486 1 1970 1980 1990 2000 2010 Year • High performance serial processors waste power - Speculation, dynamic dependence checking, etc. burn power - Implicit parallelism discovery • More transistors, but not faster serial processors CPE 779 Parallel Computing - Spring 2012 16 Revolution in Processors 10000000 1000000 1000000 Transistors Transistors (Thousands) (Thousands) Transistors(MHz) (Thousands) Frequency Frequency (MHz) Power Cores (W) Cores 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 1970 • • • • 1975 1980 1985 1990 1995 2000 2005 Chip density is continuing increase ~2x every 2 years Clock speed is not Number of processor cores may double instead CPE 779 Parallel Computing - Spring 2012 Power is under control, no longer growing 2010 17 Parallelism in 2012? • These arguments are no longer theoretical • All major processor vendors are producing multicore chips • Every machine will soon be a parallel machine • To keep doubling performance, parallelism must double • Which (commercial) applications can use this parallelism? • Do they have to be rewritten from scratch? • Will all programmers have to be parallel programmers? • New software model needed • Try to hide complexity from most programmers – eventually • In the meantime, need to understand it • Computer industry betting on this big change, but does not have all the answers • Berkeley ParLab established to work on this 18 CPE 779 Parallel Computing - Spring 2012 Parallelism in 2012? • These arguments are no longer theoretical • All major processor vendors are producing multicore chips • Every machine will soon be a parallel machine • To keep doubling performance, parallelism must double • Which (commercial) applications can use this parallelism? • Do they have to be rewritten from scratch? • Will all programmers have to be parallel programmers? • New software model needed • Try to hide complexity from most programmers – eventually • In the meantime, need to understand it • Computer industry betting on this big change, but does not have all the answers • Berkeley ParLab established to work on this CPE 779 Parallel Computing - Spring 2012 19 Memory is Not Keeping Pace Technology trends against a constant or increasing memory per core • Memory density is doubling every three years; processor logic is every two • Storage costs (dollars/Mbyte) are dropping gradually compared to logic costs Cost of Computation vs. Memory Source: David Turek, IBM Source: IBM Question: Can you double concurrency without doubling memory? • Strong scaling: fixed problem size, increase number of processors • Weak scaling: grow problem size proportionally to number of processors 20 The TOP500 Project • Listing the 500 most powerful computers in the world • Yardstick: Rmax of Linpack • Solve Ax=b, dense problem, matrix is random • Dominated by dense matrix-matrix multiply • Update twice a year: • ISC’xy in June in Germany • SCxy in November in the U.S. • All information available from the TOP500 web site at: www.top500.org CPE 779 Parallel Computing - Spring 2012 21 38th List: The TOP10 Rank Site Manufacturer 1 RIKEN Advanced Institute for Computational Science Fujitsu 2 National SuperComputer Center in Tianjin NUDT 3 Oak Ridge National Laboratory Cray 4 National Supercomputing Centre in Shenzhen Dawning 5 GSIC, Tokyo Institute of Technology NEC/HP 6 DOE/NNSA/LANL/SNL Cray 7 NASA/Ames Research Center/NAS SGI 8 DOE/SC/ LBNL/NERSC 9 Commissariat a l'Energie Atomique (CEA) 10 DOE/NNSA/LANL Computer K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect Tianhe-1A NUDT TH MPP, Rmax Power [Pflops] [MW] Country Cores Japan 795,024 10.51 12.66 China 186,368 2.566 4.04 USA 224,162 1.759 6.95 China 120,640 1.271 2.58 Japan 73,278 1.192 1.40 USA 142,272 1.110 3.98 USA 111,104 1.088 4.10 USA 153,408 1.054 2.91 France 138.368 1.050 4.59 USA 122,400 22 1.042 2.34 Xeon 6C, NVidia, FT-1000 8C Jaguar Cray XT5, HC 2.6 GHz Nebulae TC3600 Blade, Intel X5650, NVidia Tesla C2050 GPU TSUBAME-2 HP ProLiant, Xeon 6C, NVidia, Linux/Windows Cielo Cray XE6, 8C 2.4 GHz Pleiades SGI Altix ICE 8200EX/8400EX Hopper Cray XE6, 6C 2.1 GHz Tera 100 Bull Bull bullx super-node S6010/S6030 Roadrunner CPE IBM 779 Parallel Computing - Spring 2012 BladeCenter QS22/LS21 Cray Performance Development 74.2 PFlop/s 100 Pflop/s 10.51 PFlop/s 10 Pflop/s 1 Pflop/s 100 Tflop/s SUM 50.9 TFlop/s 10 Tflop/s 1 Tflop/s N=1 1.17 TFlop/s 100 Gflop/s 59.7 GFlop/s N=500 10 Gflop/s 1 Gflop/s 100 Mflop/s 400 MFlop/s CPE 779 Parallel Computing - Spring 2012 23 Projected Performance Development 1 Eflop/s 100 Pflop/s 10 Pflop/s 1 Pflop/s SUM 100 Tflop/s 10 Tflop/s N=1 1 Tflop/s 100 Gflop/s N=500 10 Gflop/s 1 Gflop/s 100 Mflop/s CPE 779 Parallel Computing - Spring 2012 24 Core Count CPE 779 Parallel Computing - Spring 2012 25 Moore’s Law reinterpreted • Number of cores per chip can double every two years • Clock speed will not increase (possibly decrease) • Need to deal with systems with millions of concurrent threads • Need to deal with inter-chip parallelism as well as intra-chip parallelism CPE 779 Parallel Computing - Spring 2012 26 Outline all • Why powerful computers must be parallel processors Including your laptops and handhelds • Large CSE problems require powerful computers Commercial problems too • Why writing (fast) parallel programs is hard But things are improving • Structure of the course 27 CPE 779 Parallel Computing - Spring 2012 Drivers for Change • Continued exponential increase in computational power • Can simulate what theory and experiment can’t do • Continued exponential increase in experimental data • Moore’s Law applies to sensors too • Need to analyze all that data CPE 779 Parallel Computing - Spring 2012 28 Simulation: The Third Pillar of Science • Traditional scientific and engineering method: (1) Do theory or paper design (2) Perform experiments or build system Theory Experiment • Limitations: –Too difficult—build large wind tunnels –Too expensive—build a throw-away passenger jet –Too slow—wait for climate or galactic evolution –Too dangerous—weapons, drug design, climate experimentation Simulation • Computational science and engineering paradigm: (3) Use computers to simulate and analyze the phenomenon • Based on known physical laws and efficient numerical methods • Analyze simulation results with computational tools and methods beyond what is possible manually CPE 779 Parallel Computing - Spring 2012 29 Data Driven Science • Scientific data sets are growing exponentially - Ability to generate data is exceeding our ability to store and analyze - Simulation systems and some observational devices grow in capability with Moore’s Law • Petabyte (PB) data sets will soon be common: • Climate modeling: estimates of the next IPCC data is in 10s of petabytes • Genome: JGI alone will have .5 petabyte of data this year and double each year • Particle physics: LHC is projected to produce 16 petabytes of data per year • Astrophysics: LSST and others will produce 5 petabytes/year (via 3.2 Gigapixel camera) • Create scientific communities with “Science Gateways” to data 30 Some Particularly Challenging Computations • Science • • • • • Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences • Engineering • • • • • Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation • Business • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography CPE 779 Parallel Computing - Spring 2012 31 Economic Impact of HPC • Airlines: • System-wide logistics optimization systems on parallel systems. • Savings: approx. $100 million per airline per year. • Automotive design: • Major automotive companies use large systems (500+ CPUs) for: • CAD-CAM, crash testing, structural integrity and aerodynamics. • One company has 500+ CPU parallel system. • Savings: approx. $1 billion per company per year. • Semiconductor industry: • Semiconductor firms use large systems (500+ CPUs) for • device electronics simulation and logic validation • Savings: approx. $1 billion per company per year. • Energy • Computational modeling improved performance of current nuclear power plants, equivalent to building two new power plants. CPE 779 Parallel Computing - Spring 2012 32 $5B World Market in Technical Computing in 2004 1998 1999 2000 2001 2002 2003 100% 90% 80% 70% Other Technical Management and Support Simulation Scientific Research and R&D Mechanical Design/Engineering Analysis Mechanical Design and Drafting 60% Imaging 50% Geoscience and Geoengineering 40% Electrical Design/Engineering Analysis Economics/Financial 30% Digital Content Creation and Distribution 20% Classified Defense 10% Chemical Engineering 0% Biosciences Source: IDC 2004, from NRC Future of Supercomputing Report CPE 779 Parallel Computing - Spring 2012 33 Why writing (fast) parallel programs is hard CPE 779 Parallel Computing - Spring 2012 34 Principles of Parallel Computing • • • • • • Finding enough parallelism (Amdahl’s Law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things makes parallel programming harder than sequential programming. CPE 779 Parallel Computing - Spring 2012 35 “Automatic” Parallelism in Modern Machines • Bit level parallelism • within floating point operations, etc. • Instruction level parallelism (ILP) • multiple instructions execute per clock cycle • Memory system parallelism • overlap of memory operations with computation • OS parallelism • multiple jobs run in parallel on commodity SMPs Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks CPE 779 Parallel Computing - Spring 2012 36 Finding Enough Parallelism: Amdahl’s Law T1 = execution time using 1 processor (serial execution time) Tp = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S • Maximum speedup (i.e. when P=∞), Smax = 1/S; example S=.05 , speedup max= 20 • Currently the fastest machine has 705K processors; 2nd fastest has ~186K processors +GPUs • Even if the parallel part speeds up perfectly performance is limited 37 by the sequential part Speedup Barriers: (a) Overhead of Parallelism • Given enough parallel work, overhead is a big barrier to getting desired speedup • Parallelism overheads include: • • • • cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation • Each of these can be in the range of milliseconds (=millions of flops) on some systems • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work CPE 779 Parallel Computing - Spring 2012 38 Speedup Barriers: (b) Working on Non Local Data Conventional Storage Hierarchy Proc Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache L3 Cache Memory Memory Memory potential interconnects L3 Cache • Large memories are slow, fast memories are small • Parallel processors, collectively, have large, fast cache • the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data CPE 779 Parallel Computing - Spring 2012 39 Processor-DRAM Gap (latency) Goal: find algorithms that minimize communication, not necessarily arithmetic CPU “Moore’s Law” 10 1 Time 40 µProc 60%/yr. Processor-Memory Performance Gap: (grows 50% / year) DRAM DRAM 7%/yr. 100 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Performance 1000 CPE 779 Parallel Computing - Spring 2012 Speedup Barriers: (c) Load Imbalance • Load imbalance occurs when some processors in the system are idle due to • insufficient parallelism (during that phase) • unequal size tasks • Algorithm needs to balance load CPE 779 Parallel Computing - Spring 2012 41 Outline all • Why powerful computers must be parallel processors Including your laptops and handhelds • Large CSE problems require powerful computers Commercial problems too • Why writing (fast) parallel programs is hard But things are improving • Structure of the course CPE 779 Parallel Computing - Spring 2012 42 Instructor (Sections 2 & 3) • Instructor: Dr. Walid Abu-Sufah • Office: CPE 10 • Email: abusufah@ju.edu.jo • Office Hours: Monday 11-12, Tuesday 12-1 and by appointment. • Course web site: http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr12/index. html CPE 779 Parallel Computing - Spring 2012 43 Prerequisite • CPE 432: Computer Design, and general C programming skills 44 CPE 779 Parallel Computing - Spring 2012 Grading Policy • Programming Assignments: 25% • Demo/knowledge: 25% • Functionality and Performance: 40% • Report: 35% • Project: 35% • Design Document: 25% • Project Presentation: 25% • Demo/Functionality/Performance/Report: 50% • Midterm: 15% • Final: 25 % 45 CPE 779 Parallel Computing - Spring 2012 Bonus Days • Each of you get five bonus days • A bonus day is a no-questions-asked one-day extension that can be used on most assignments • You can’t turn in multiple versions of a team assignment on different days; all of you must combine individual bonus days into one team bonus day. • You can use multiple bonus days on the same assignment • Weekends/holidays don’t count for the number of days of extension (Thursday-Sunday is one day extension) • Intended to cover illnesses, just needing more time, etc. CPE 779 Parallel Computing - Spring 2012 46 Using Bonus Days • Bonus days are automatically applied to late projects • Penalty for being late beyond bonus days is 10% of the possible points/day, again counting only • Things you can’t use bonus days on: • Final project design documents, final project presentations, final project demo, exam CPE 779 Parallel Computing - Spring 2012 47 Academic Honesty • You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people who’ve already taken the course is also fine. • Any reference to assignments from web postings is unacceptable • Any copying of non-trivial code is unacceptable • Non-trivial = more than a line or so • Includes reading someone else’s code and then going off to write your own. CPE 779 Parallel Computing - Spring 2012 48 Academic Honesty (cont.) • Giving/receiving help on an exam is unacceptable • Penalties for academic dishonesty: • Zero on the assignment for the first occasion • Automatic failure of the course for repeat offenses CPE 779 Parallel Computing - Spring 2012 49 Team Projects • Work can be divided up between team members in any way that works for you • However, each team member will demo the final checkpoint of each project individually, and will get a separate demo grade • This will include questions on the entire design • Rationale: if you don’t know enough about the whole design to answer questions on it, you aren’t involved enough in the project CPE 779 Parallel Computing - Spring 2012 50 Text/Notes 1. D. Kirk and W. Hwu, “Programming Massively Parallel Processors – A Hands-on Approach,” Morgan Kaufman Publisher, 2010, ISBN 9780123814722 2. Cleve B. Moler, Numerical Computing with MATLAB, Society for Industrial Mathematics (January 1, 2004). Available for individual download at http://www.mathworks.com/moler/chapters.html 3. NVIDIA, NVidia CUDA C Programming Guide, version 4.0, NVidia, 2011 (reference book) 4. Lecture notes will be posted at the class web site CPE 779 Parallel Computing - Spring 2012 51 Rough List of Topics • Basics of computer architecture, memory hierarchies, performance • Parallel Programming Models and Machines • Shared Memory and Multithreading (OpenMP) • Distributed Memory and Message Passing (MPI) 52 CPE 779 Parallel Computing - Spring 2012 Rough List of Topics (continued) • Programming NVIDIA processors using CUDA • Introduction to CUDA C • CUDA Parallel Execution Model with Fermi Updates • CUDA Memory Model with Fermi Updates • Tiled Matrix-Matrix Multiplication • Debugging and Profiling, Introduction to Convolution • Convolution, Constant Memory and Constant Caching CPE 779 Parallel Computing 53 - Spring 2012 Rough List of Topics (continued) • Programming NVIDIA processors using CUDA (continued) • Tiled 2D Convolution • Parallel Computation Patterns - Reduction Trees • Memory Bandwidth • Parallel Computation Patterns - Prefix Sum (Scan) • Floating Point Considerations • Atomic Operations and Histogramming • Data Transfers and GMAC • Multi-GPU Programming in CUDA and GMAC • MPI and CUDA Programming CPE 779 Parallel Computing - Spring 2012 54 Rough List of Topics (continued) • Selected numerical computing topics (with MATLAB) • Linear Equations • Eigenvalues CPE 779 Parallel Computing - Spring 2012 55