CPE 779 Parallel Computing http://www1.ju.edu.jo/ecourse/abusufah/cpe779_Spr10/index.html Lecture 1: Introduction Walid Abu-Sufah University of Jordan CPE 779 Parallel Computing - Spring 2010 1 Acknowledgment: Collaboration This course is being offered in collaboration with • The IMPACT research group at the University of Illinois http://impact.crhc.illinois.edu/ • The Universal Parallel Computing Research Center (UPCRC) at the University of Illinois http://www.upcrc.illinois.edu/ • The Computation based Science and Technology Research Center (CSTRC) of the Cyprus Institute http://cstrc.cyi.ac.cy/ 01/19/2010 CS267 - Lecture 1 2 Acknowledgment: Slides Some of the slides used in this course are based on slides by • Kathy Yelick, University of California at Berkeley http://www.cs.berkeley.edu/~yelick/cs194f07 • Jim Demmel, University of California at Berkeley & Horst Simon, Lawrence Berkeley National Lab (LBNL) http://www.cs.berkeley.edu/~demmel/cs267_Spr10/ • Wen-mei Hwu and Sanjay Patel of the University of Illinois and David Kurk, Nvidia Corporation http://courses.ece.illinois.edu/ece498/al/ 01/19/2010 CS267 - Lecture 1 3 Course Motivation In the last few years: • Conventional sequential processors can not get faster - Previously clock speed doubled every 18 months • All computers will be parallel • >>> All programs will have to become parallel programs - Especially programs that need to run faster. 01/19/2010 CS267 - Lecture 1 4 Course Motivation (continued) There will be a huge change in the entire computing industry • Previously the industry depended on selling new computers by running their users' programs faster without the users having to reprogram them. • Multi/ many core chips have started a revolution in the software industry 01/19/2010 CS267 - Lecture 1 5 Course Motivation (continued) Large research activities to address this issue are underway • Computer companies: Intel, Microsoft, Nvidia, IBM, ..etc - Parallel programming is a concern for the entire computing industry. • Universities - Berkeley's ParLab (2008: $20 million grant) - The Universal Parallel Computing Research Center of the University of Illinois (2008: $20 million grant) 01/19/2010 CS267 - Lecture 1 6 Course Goals • The purpose of this course is to teach students the necessary skills for developing applications that can take advantage of on-chip parallelism. • Part 1 (~4 weeks): focus on the techniques that are most appropriate for multicore architectures and the use of parallelism to improve program performance. Topics include - performance analysis and tuning - data techniques - shared data structures - load balancing. and task parallelism - synchronization 01/19/2010 CS267 - Lecture 1 7 Course Goals (continued - I) • Part 2 (~ 12 weeks): Provide students with knowledge and hands-on experience in developing applications software for massively parallel processors (100s or 1000s of cores) - Use NVIDIA GPUs and the Cuda programming language. - To Effectively program these processors students will acquire in-depth knowledge about - Data parallel programming principles - Parallelism models - Communication models - Resource limitations of these processors. 01/19/2010 CS267 - Lecture 1 8 Outline of rest of lecture all • Why powerful computers must use parallel processors Including your laptops and handhelds • Examples of Computational Science and Engineering (CSE) problems which require powerful computers Commercial problems too • Why writing (fast) parallel programs is hard But things are improving • Principles of parallel computing performance • Structure of the course 01/19/2010 CS267 - Lecture 1 9 What is Parallel Computing? • Parallel computing: using multiple processors in parallel to solve problems (execute applications) more quickly than with a single processor • Examples of parallel machines: • A cluster computer that contains multiple PCs combined together with a high speed network • A shared memory multiprocessor (SMP*) by connecting multiple processors to a single memory system • A Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip • Concurrent execution comes from the desire for performance • * Technically, SMP stands for “Symmetric Multi-Processor” CPE 779 Parallel Computing - Spring 2010 10 Units of Measure • High Performance Computing (HPC) units are: • Flop: floating point operation • Flops/s: floating point operations per second • Bytes: size of data (a double precision floating point number is 8) • Typical sizes are millions, billions, trillions… Mega: Giga: Tera: Peta: Exa: Zetta: Yotta: Mflop/s = 1006 flop/sec; Gflop/s = 1009 flop/sec; Tflop/s = 1012 flop/sec; Pflop/s = 1015 flop/sec; Eflop/s = 1018 flop/sec; Zflop/s = 1021 flop/sec; Yflop/s = 1024 flop/sec; Mbyte = 220 = 1048576 ~ 106 bytes Gbyte = 230 ~ 109 bytes Tbyte = 240 ~ 1012 bytes Pbyte = 250 ~ 1015 bytes Ebyte = 260 ~ 1018 bytes Zbyte = 270 ~ 1021 bytes Ybyte = 280 ~ 1024 bytes • Current fastest (public) machine ~ 2.3 Pflop/s • Up-to-date list at www.top500.org 11 High Performance Computing, HPC • Parallel computers have been used for decades • Mostly used in computational science, engineering, business, and defense • Problems too large to solve on one processor; use 100s or 1000s • Examples of challenging computations in science • • • • • Global climate modeling Biology: genomics; protein folding; drug design Astrophysical modeling Computational Chemistry Computational Material Sciences and Nanosciences CPE 779 Parallel Computing - Spring 2010 12 High Performance Computing, HPC(continued) • Examples of challenging computations in engineering • • • • • Semiconductor design Earthquake and structural modeling Computation fluid dynamics (airplane design) Combustion (engine design) Crash simulation • Examples of challenging computations in business • Financial and economic modeling • Transaction processing, web services and search engines • Examples of challenging computations in defence • Nuclear weapons -- test by simulations • Cryptography CPE 779 Parallel Computing - Spring 2010 13 Economic Impact of HPC • Airlines: • Logistics optimization systems on parallel computers. • Savings: approx. $100 million per airline per year. • Automotive design: • Major automotive companies use large systems (500+ CPUs) for: • CAD-CAM, crash testing, structural integrity and aerodynamics. • One company has 500+ CPU parallel system. • Savings: approx. $1 billion per company per year. • Semiconductor industry: • Semiconductor firms use large systems (500+ CPUs) for • device electronics simulation and logic validation • Savings: approx. $1 billion per company per year. • Securities industry (note: old data …) 14 • Savings: approx. $15 billion per year for U.S. home mortgages. CPE 779 Parallel Computing - Spring 2010 all (2007) Why powerful computers are parallel 15 What is New in Parallel Computing Now? • In the 80s/90s many companies “bet” on parallel computing and failed • Computers got faster too quickly for there to be a large market • What is new now? • The entire computing industry has bet on parallelism There is a desperate need for parallel programmers • Let’s see why… CPE 779 Parallel Computing - Spring 2010 16 Technology Trends: Microprocessor Capacity 2X transistors/Chip Every 1.5 years Called “Moore’s Law” Microprocessors have become smaller, denser, and more powerful. Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra CPE 779 Parallel Computing - Spring 2010 17 Microprocessor Transistors and Clock Rate Growth in transistors per chip Increase in clock rate 100,000,000 1000 10,000,000 1,000,000 i80386 i80286 100,000 R3000 R2000 100 Clock Rate (MHz) Transistors R10000 Pentium 10 1 i8086 10,000 i8080 i4004 1,000 1970 1975 1980 1985 1990 1995 2000 2005 0.1 1970 1980 1990 2000 Year Year In 2002: Why bother with parallel programming? Just wait a year or two… CPE 779 Parallel Computing - Spring 2010 18 Limit #1: Power density Can soon put more transistors on a chip than can afford to turn on. -- Patterson ‘07 Scaling clock speed (business as usual) will not work Sun’s Surface Power Density (W/cm2) 10000 Rocket Nozzle 1000 Nuclear Reactor 100 8086 Hot Plate 10 4004 8008 8085 386 286 8080 1 1970 1980 P6 Pentium® 486 1990 Year Source: Patrick Gelsinger, Intel 2000 CPE 779 Parallel Computing - Spring 2010 2010 19 Limit #2: Hidden Parallelism Tapped Out Application performance was increasing by 52% per year as measured by the SpecInt benchmarks here From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006 • ½ due to transistor density • ½ due to architecture changes, e.g., Instruction Level Parallelism (ILP) • VAX : 25%/year 1978 to 1986 • RISC + x86: 52%/yearCPE 1986 to 2002 779 Parallel Computing - Spring 2010 20 Limit #2: Hidden Parallelism Tapped Out • Superscalar (SS) designs were the state of the art; many forms of parallelism not visible to programmer • multiple instruction issue • dynamic scheduling: hardware discovers parallelism between instructions • speculative execution: look past predicted branches • non-blocking caches: multiple outstanding memory ops • Unfortunately, these sources have been used up CPE 779 Parallel Computing - Spring 2010 21 Revolution is Happening Now • Chip density is continuing increase ~2x every 2 years • Clock speed is not • Number of processor cores may double instead • There is little or no hidden parallelism (ILP) to be found • Parallelism must be exposed to and managed by software Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond) CPE 779 Parallel Computing - Spring 2010 23 Parallelism in 2010? • All major processor vendors are producing multicore chips • Every machine will soon be a parallel machine • To keep doubling performance, parallelism must double • Which commercial applications can use this parallelism? • Do they have to be rewritten from scratch? • Will all programmers have to be parallel programmers? • New software model needed • Try to hide complexity from most programmers – eventually • In the meantime, need to understand it • Computer industry betting on this big change, but does not have all the answers • Berkeley ParLab and University of Illinois UPCRC established to work on this 24 Moore’s Law reinterpreted • Number of cores per chip will double every two years • Clock speed will not increase (possibly decrease) • Need to deal with systems with millions of concurrent threads • Need to deal with inter-chip parallelism as well as intra-chip parallelism Outline all • Why powerful computers must be parallel processors Including your laptop • Why writing (fast) parallel programs is hard • Principles of parallel computing performance CPE 779 Parallel Computing - Spring 2010 26 Why writing (fast) parallel programs is hard CPE 779 Parallel Computing - Spring 2010 27 Principles of Parallel Computing • • • • • • Finding enough parallelism (Amdahl’s Law) Granularity Locality Load balance Coordination and synchronization Performance modeling All of these things make parallel programming harder than sequential programming. CPE 779 Parallel Computing - Spring 2010 28 Finding Enough Parallelism: Amdahl’s Law T1 = execution time using 1 processor (serial execution time) Tp = execution time using P processors S = serial fraction of computation (i.e. fraction of computation which can only be executed using 1 processor) C = fraction of computation which could be executed by p processors Then S + C = 1 and Tp = S * T1+ (T1 * C)/P = (S + C/P)T1 Speedup = Ψ(p) = T1/Tp = 1/(S+C/P) <= 1/S • Maximum speedup (i.e. when P=∞) Ψmax = 1/S; example S=.05 , Ψmax = 20 • Currently the fastest machine has ~224,000 processors • Even if the parallel part speeds up perfectly performance is limited by the sequential part CPE 779 Parallel Computing 29 - Spring 2010 Speedup Barriers: (a) Overhead of Parallelism • Given enough parallel work, overhead is a big barrier to getting desired speedup • Parallelism overheads include: • • • • cost of starting a thread or process cost of communicating shared data cost of synchronizing extra (redundant) computation • Each of these can be in the range of milliseconds on some systems (=millions of flops) • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work CPE 779 Parallel Computing - Spring 2010 30 Speedup Barriers: (b) Working on Non Local Data Conventional Storage Proc Hierarchy Cache L2 Cache Proc Cache L2 Cache Proc Cache L2 Cache L3 Cache L3 Cache Memory Memory Memory potential interconnects L3 Cache • Large memories are slow, fast memories are small • Parallel processors, collectively, have large, fast cache • the slow accesses to “remote” data we call “communication” • Algorithm should do most work on local data CPE 779 Parallel Computing - Spring 2010 31 Speedup Barriers: (c) Load Imbalance • Load imbalance occurs when some processors in the system are idle due to • insufficient parallelism (during that phase) • unequal size tasks • Algorithm needs to balance load CPE 779 Parallel Computing - Spring 2010 32 Course Mechanics • Web page: http://www1.ju.edu.jo/ecourse/abusufah/cpe779_spr10/index.h tml • Grading: - Five programming assignments - Final projects (proposals due Wednesday April 21) - Could be parallelizing an application - Developing an application using Cuda/OpenCL - Performance models driven tuning of a parallel application on multicore and or GPU - Midterm, Wednesday April 7 ; 5:30-6:45 - Final, Thursday, May 27, 5:30-7:30 01/19/2010 CS267 - Lecture 1 33 Rough List of Topics (For Details see Syllabus) • Basics of computer architecture, memory hierarchies, performance • Parallel Programming Models and Machines - Shared Memory and Multithreading - Data parallelism, GPUs • Parallel languages and libraries - OpenMP - CUDA • General techniques - Load balancing, performance modeling and tools • Applications 01/19/2010 CS267 - Lecture 1 34 Reading Materials: Textbooks Required • David B. Kirk, Wen-mei W. Hwu, Programming Massively Parallel Processors: A Hands-on Approach, Morgan Kaufmann (February 5, 2010). (Most chapters are available on line in draft status; visit http://courses.ece.illinois.edu/ece498/al/Syllabus.html ) • Calvin Lin and Larry Snyder, "Principles of Parallel Programming", Addison-Wesley, 2009 01/19/2010 CS267 - Lecture 1 35 Reading Materials: References • Ian Foster, Designing and Building Parallel Programs , AddisonWesley ( available online @ http://www.mcs.anl.gov/~itf/dbpp/ ) • Randima Fernando, GPU Gems: Programming Techniques, Tips and Tricks for Real-Time Graphics, Publisher: Addison-Wesley Professional (2004), (available online on NVIDIA Developer Site; see: http://http.developer.nvidia.com/GPUGems/gpugems_part01.html ) • Grama, A., Gupta, A., Karypis, G., and Kumar, V. "Introduction to Parallel Computing", Second Edition, Addison Wesley, 2003. 01/19/2010 CS267 - Lecture 1 36 Reading Materials: Tutorials • See course website for tutorials and other online resources 01/19/2010 CS267 - Lecture 1 37