inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 39 – Software Parallel Computing 2006-12-04 Thanks to Prof. Demmel for his CS267 slides & Andy Carle for 1st CS61C draft www.cs.berkeley.edu/~demmel/cs267_Spr05/ Lecturer SOE Dan Garcia www.cs.berkeley.edu/~ddgarcia Five-straight big games! This one was supposed to be easy! Stanford showed up, and so did the winds as Cal struggled to beat SU, a 1-10 team and the weakest opponent all year. Next game is 2006-12-28 in the Holiday Bowl in San Diego vs 9-3 Texas A&M. Quic kTime™ and a TIFF (Unc ompres sed) dec ompres sor are needed to see this pic ture. QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see t his picture. calbears.cstv.com/sports/m-footbl/recaps/120206aaa.html CS61C L39 Software Parallel Computing (1) Garcia, Fall 2006 © UCB Outline • Large, important problems require powerful computers • Why powerful computers must be distributed or parallel computers • What are these? • Principles of parallel computing performance CS61C L39 Software Parallel Computing (2) Garcia, Fall 2006 © UCB Simulation: The Third Pillar of Science • Traditional scientific and engineering paradigm: • Do theory or paper design • Perform experiments or build system • Limitations: • • • • Too difficult -- build large wind tunnels Too expensive -- build a throw-away passenger jet Too slow -- wait for climate or galactic evolution Too dangerous -- weapons, drug design, climate models • Computational science paradigm: • Use high performance computer (HPC) systems to simulate the phenomenon • Based on physical laws & efficient numerical methods CS61C L39 Software Parallel Computing (3) Garcia, Fall 2006 © UCB Example Applications • Science • • • • • Global climate modeling Biology: genomics; protein folding; drug design; malaria simulations Astrophysical modeling Computational Chemistry, Material Sciences and Nanosciences SETI@Home : Search for Extra-Terrestrial Intelligence • Engineering • • • • • • Semiconductor design Earthquake and structural modeling Fluid dynamics (airplane design) Combustion (engine design) Crash simulation Computational Game Theory (e.g., Chess Databases) • Business • Rendering computer graphic imagery (CGI), ala Pixar and ILM • Financial and economic modeling • Transaction processing, web services and search engines • Defense • Nuclear weapons -- test by simulations • Cryptography CS61C L39 Software Parallel Computing (4) Garcia, Fall 2006 © UCB Performance Requirements • Terminology • FLOP – FLoating point OPeration www.epm.ornl.gov/chammp/chammp.html • Flops/second – standard metric for expressing the computing power of a system • Example : Global Climate Modeling • Divide the world into a grid (e.g. 10 km spacing) • Solve fluid dynamics equations to determine what the air has done at that point every minute Requires about 100 Flops per grid point per minute • This is an extremely simplified view of how the atmosphere works, to be maximally effective you need to simulate many additional systems on a much finer grid CS61C L39 Software Parallel Computing (5) Garcia, Fall 2006 © UCB High Resolution Climate Modeling on NERSC-3 – P. Duffy, et al., LLNL Performance Requirements (2) • Computational Requirements • To keep up with real time (i.e. simulate one minute per wall clock minute): 8 Gflops/sec • Weather Prediction (7 days in 24 hours): 56 Gflops/sec • Climate Prediction (50 years in 30 days): 4.8 Tflops/sec • Climate Prediction Experimentation (50 yrs in 12 hrs): 288 Tflops/sec • Perspective • Pentium 4 1.4GHz, 1GB RAM, 4x100MHz FSB ~0.3 Gflops/sec, effective Climate Prediction would take ~1233 years Reference:http://www.tc.cornell.edu/~lifka/Papers/SC2001.pdf CS61C L39 Software Parallel Computing (7) Garcia, Fall 2006 © UCB What Can We Do? • Wait for our machines to get faster? • Moore’s law tells us things are getting better; |why not stall for the moment? • Heat issues Moore on last legs! • Many believe so … thus push for multi-core (fri)! • Prohibitive costs : Rock’s law • Cost of building a semiconductor chip fabrication plant capable of producing chips in line w/Moore’s law doubles every four years • In 2003, it cost $3 BILLION CS61C L39 Software Parallel Computing (8) Garcia, Fall 2006 © UCB Upcoming Calendar Week # Mon Wed Parallel Parallel Computing Computing in Hardware This Week in Software (Scott) #15 #16 Sun 2pm Review 10 Evans Thu Lab I/O Networking & 61C Feedback Survey Fri LAST CLASS Summary, Review, & HKN Evals FINAL EXAM THU 12-14 @ 12:30pm-3:30pm 234 Hearst Gym Final exam Same rules as Midterm, except you get 2 double-sided handwritten review sheets (1 from your midterm, 1 new one) + green sheet [Don’t bring backpacks] CS61C L39 Software Parallel Computing (9) Garcia, Fall 2006 © UCB Alternatively, many CPUs at once! • Distributed computing • Many computers (either homogeneous or heterogeneous) who get “work units” from a “central dispatcher” and return when they’re done to get more • “{ Grid, Cluster } Computing” • Parallel Computing • Multiple processors “all in one box” executing identical code on different parts of problem to arrive at a unified (meaningful) solution • “Supercomputing” CS61C L39 Software Parallel Computing (10) Garcia, Fall 2006 © UCB Distributed Computing • Let’s “tie together” many disparate machines into one compute cluster • These could all be the same (easier) or very different machines (harder) • Common themes • “Dispatcher” hands jobs & collects results • “Workers” (get, process, return) until done • Examples • SETI@Home, BOINC, Render farms • Google clusters running MapReduce CS61C L39 Software Parallel Computing (11) Garcia, Fall 2006 © UCB Google’s MapReduce • Remember CS61A? (reduce + (map square '(1 2 3)) (reduce + '(1 4 9)) 14 • We told you “the beauty of pure functional programming is that it’s easily parallelizable” • Do you see how you could parallelize this? • What if the reduce function argument were associative, would that help? • Imagine 10,000 machines ready to help you compute anything you could cast as a MapReduce problem! • This is the abstraction Google is famous for authoring (but their reduce not exactly the same as the CS61A’s reduce) • It hides LOTS of difficulty of writing parallel code! • The system takes care of load balancing, dead machines, etc CS61C L39 Software Parallel Computing (12) Garcia, Fall 2006 © UCB MapReduce Programming Model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) list(out_key, intermediate_value) • Processes input key/value pair • Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) list(out_value) • Combines all intermediate values for a particular key • Produces a set of merged output values (usu just one) code.google.com/edu/parallel/mapreduce-tutorial.html CS61C L39 Software Parallel Computing (13) Garcia, Fall 2006 © UCB Example : Word Occurrances in Files map(String input_key, String input_value): // input_key : document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); • This is just an example • Real code at Google is more complicated and in C++ • Real code at Hadoop (open source MapReduce) in Java CS61C L39 Software Parallel Computing (14) Garcia, Fall 2006 © UCB Example : MapReduce Execution file1 file2 file3 ah ah er ah file4 file5 file6 file7 if or or uh or ah if map(String input_key, String input_value): // input_key : doc name // input_value: doc contents for each word w in input_value: EmitIntermediate(w, "1"); ah:1 ah:1 er:1 ah:1 if:1 or:1 or:1 uh:1 or:1 ah:1 if:1 ah:1,1,1,1 er:1 if:1,1 or:1,1,1 uh:1 reduce(String output_key, Iterator intermediate_values): // output_key : a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 4 1 2 (ah) (er) (if) CS61C L39 Software Parallel Computing (15) 3 1 (or) (uh) Garcia, Fall 2006 © UCB How Do I Experiment w/MapReduce? 1. Work for Google. ;-) 2. Use the Open Source versions • • • Hadoop: map/reduce & distributed file system: lucene.apache.org/hadoop/ Nutch: crawler, parsers, index : lucene.apache.org/nutch/ Lucene Java: text search engine library: lucene.apache.org/java/docs/ 3. Wait until 2007Fa • (You heard it here 1st!) Google will be giving us: • • Single Rack, 40 dual proc xeon, 4GB Ram, 600GB Disk Also available as vmware image We’re developing bindings for Scheme for 61A It will be available to students & researchers CS61C L39 Software Parallel Computing (16) Garcia, Fall 2006 © UCB Recent History of Parallel Computing • Parallel Computing as a field exploded in popularity in the mid-1990s • This resulted in an “arms race” between universities, research labs, and governments to have the fastest supercomputer in the world • LINPACK (solving dense system of linear equations) is benchmark Source: top500.org CS61C L39 Software Parallel Computing (17) Garcia, Fall 2006 © UCB Current Champions (Nov 2006) QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. unclassified, classified QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. BlueGene/L – eServer Blue Gene Solution IBM DOE / NNSA / LLNL Livermore, CA, United States 65,536 dual-processors, 280.6 Tflops/s 0.7 GHz PowerPC 440, IBM Red Storm – Sandia / Cray Red Storm NNSA / Sandia National Laboratories Albuquerque, NM, United States 26,544 Processors, 101.4 Tflops/s 2.4 GHz dual-core Opteron, Cray Inc. BGW - eServer Blue Gene Solution IBM TJ Watson Research Center Yorktown Heights, NY, United States 40,960 Processors, 91.3 Tflops/s 0.7 GHz PowerPC 440, IBM CS61C L39 Software Parallel Computing (18) top500.org Garcia, Fall 2006 © UCB Synchronization (1) • How do processors communicate with each other? • How do processors know when to communicate with each other? • How do processors know which other processor has the information they need? • When you are done computing, which processor, or processors, have the answer? CS61C L39 Software Parallel Computing (19) Garcia, Fall 2006 © UCB Synchronization (2) • Some of the logistical complexity of these operations is reduced by standard communication frameworks • Message Passing Interface (MPI) • Sorting out the issue of who holds what data can be made easier with the use of explicitly parallel languages • Unified Parallel C (UPC) • Titanium (Parallel Java Variant) • Even with these tools, much of the skill and challenge of parallel programming is in resolving these problems CS61C L39 Software Parallel Computing (20) Garcia, Fall 2006 © UCB Parallel Processor Layout Options P2 P1 Pn $ $ $ bus Shared Memory memory P0 memory NI P1 memory NI Pn ... memory NI Distributed Memory interconnect CS61C L39 Software Parallel Computing (21) Garcia, Fall 2006 © UCB Parallel Locality • We now have to expand our view of the memory hierarchy to include remote machines • Remote memory behaves like a very fast network • Bandwidth vs. Latency becomes important Regs Instr. Operands Cache Blocks Memory Blocks Remote Memory Large Data Blocks Local and Remote Disk CS61C L39 Software Parallel Computing (22) Garcia, Fall 2006 © UCB Amdahl’s Law • Applications can almost never be completely parallelized • Let s be the fraction of work done sequentially, so (1-s) is fraction parallelizable, and P = number of processors Speedup(P) = Time(1) / Time(P) ≤ 1 / ( s + ((1-s) / P) ), and as P ∞ ≤ 1/s • Even if the parallel portion of your application speeds up perfectly, your performance may be limited by the sequential portion CS61C L39 Software Parallel Computing (23) Garcia, Fall 2006 © UCB Parallel Overhead • Given enough parallel work, these are the biggest barriers to getting desired speedup • Parallelism overheads include: • cost of starting a thread or process • cost of communicating shared data • cost of synchronizing • extra (redundant) computation • Each of these can be in the range of ms (many Mflops) on some systems • Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work CS61C L39 Software Parallel Computing (24) Garcia, Fall 2006 © UCB Load Imbalance • Load imbalance is the time that some processors in the system are idle due to • insufficient parallelism (during that phase) • unequal size tasks • Examples of the latter • adapting to “interesting parts of a domain” • tree-structured computations • fundamentally unstructured problems • Algorithms need to carefully balance load CS61C L39 Software Parallel Computing (25) Garcia, Fall 2006 © UCB Peer Instruction of Assumptions 1. 2. 3. Writing & managing SETI@Home is relatively straightforward; just hand out & gather data 1: 2: 3: Most parallel programs that, when run on N (N big) identical supercomputer processors 4: will yield close to N x performance increase 5: 6: The majority of the world’s computing power 7: lives in supercomputer centers 8: CS61C L39 Software Parallel Computing (26) ABC FFF FFT FTF FTT TFF TFT TTF TTT Garcia, Fall 2006 © UCB Peer Instruction Answer 1. The heterogeneity of the machines, handling machines that fail, falsify data. FALSE 2. The combination of Amdahl’s law, overhead, and load balancing take its toll. FALSE 3. Have you considered how many PCs + game devices exist? Not even close. FALSE 1. 2. 3. Writing & managing SETI@Home is relatively straightforward; just hand out & gather data 1: 2: 3: Most parallel programs that, when run on N (N big) identical supercomputer processors 4: will yield close to N x performance increase 5: 6: The majority of the world’s computing power 7: lives in supercomputer centers 8: CS61C L39 Software Parallel Computing (27) ABC FFF FFT FTF FTT TFF TFT TTF TTT Garcia, Fall 2006 © UCB Summary • Parallel & Distributed Computing is a multi-billion dollar industry driven by interesting and useful scientific computing & business applications • It is extremely unlikely that sequential computing will ever again catch up with the processing power of parallel or distributed systems • Programming parallel systems can be extremely challenging (distributed computing less so, thanks to APIs), but is built upon many of the concepts you’ve learned this semester in 61c CS61C L39 Software Parallel Computing (28) Garcia, Fall 2006 © UCB