EECS 570 Lecture 1 Parallel Computer Architecture

EECS 570 Lecture 1 Parallel Computer Architecture Winter 2016 Prof. Thomas Wenisch h6p://www.eecs.umich.edu/courses/eecs570/ Slides developed in part by Profs. Austin, Adve, Falsafi, Martin, Narayanasamy, Nowatzyk, Peh, and Wenisch of CMU, EPFL, MIT, UPenn, U-M, UIUC. EECS 570 Lecture 1 Slide 1 Announcements No discussion on Friday. Online quizzes (Canvas) on 1st readings due Monday, 1:30pm. Sign up for piazza. EECS 570 Lecture 1 Slide 2 Readings For Monday 1/12 (quizzes due by 1:30pm) ❒  ❒  David Wood and Mark Hill. “Cost-‐EffecTve Parallel CompuTng,” IEEE Computer, 1995. Mark Hill et al. “21st Century Computer Architecture.” CCC White Paper, 2012. For Wednesday 1/14: ❒  EECS 570 Seiler et al. Larrabee: A Many-‐Core x86 Architecture for Visual CompuTng. Siggraph 2008. Lecture 1 Slide 3 EECS 570 Class Info Instructor: Professor Thomas Wenisch ❒  URL: hap://www.eecs.umich.edu/~twenisch Research interests: ❒  ❒  ❒  MulTcore / mulTprocessor arch. & programmability Data center architecture, server energy-‐efficiency Accelerators for medical imaging, data analyTcs GSI: ❒  Amlan Nayak (amlan@umich.edu) Class info: ❒  ❒  ❒  EECS 570 URL: hap://www.eecs.umich.edu/courses/eecs570/ Canvas for reading quizzes & reporTng grades Piazza for discussions & project coordinaTon Lecture 1 Slide 4 Meeting Times Lecture ❒  MW 1:40pm – 3:00pm (1017 Dow) Discussion ❒  ❒  ❒  ❒  F 1:40pm – 2:30pm (1200 EECS) Talk about programming assignments and projects Make-‐up lectures Keep the slot free, but we oken won’t meet Office Hours ❒  ❒  Prof. Wenisch: M 3-‐4 (4620 CSE) Amlan: TBD (LocaTon: TBD) Q&A EECS 570 Fri 1:30-‐2:30 (LocaTon: TBD) when no discussion Use Piazza for all technical quesTons Use e-‐mail sparingly Lecture 1 Slide 5 Who Should Take 570? Graduate Students (& seniors interested in research) 1.  2.  3.  Computer architects to be Computer system designers Those interested in computer systems Required Background ❒  ❒  EECS 570 Computer Architecture (e.g., EECS 470) C / C++ programming Lecture 1 Slide 6 Grading 2 Prog. Assignments: Reading Quizzes: Midterm exam: Final exam: Final Project: 5% & 10% 10% 25% 25% 25% Attendance & participation count (your goal is for me to know who you are) EECS 570 Lecture 1 Slide 7 Grading (Cont.) Group studies are encouraged Group discussions are encouraged All programming assignments must be results of individual work All reading quizzes must be done individually, quesTons/answers should not be posted publicly There is no tolerance for academic dishonesty. Please refer to the University Policy on chea;ng and plagiarism. Discussion and group studies are encouraged, but all submi@ed material must be the student's individual work (or in case of the project, individual group work). EECS 570 Lecture 1 Slide 8 Some Advice on Reading… If you carefully read every paper start to finish… …you will never finish Learn to skim past details EECS 570 Lecture 1 Slide 9 Reading Quizzes •  You must take an online quiz for every paper Quizzes must be completed by class start via Canvas •  There will be 2 mulTple choice quesTons ❒  ❒  The quesTons are chosen randomly from a list You only have 5 minutes ❍  ❒  Not enough Tme to find the answer if you haven’t read the paper You only get one aaempt •  Some of the quesTons may be reused on the midterm/final •  4 lowest quiz grades (of about 40) will be dropped over the course of the semester (e.g., skip some if you are travelling) ❒  EECS 570 Retakes/retries/reschedules will not be given for any reason Lecture 1 Slide 10 Final Project •  Original research on a topic related to the course ❒  Goal: a high-‐quality 6-‐page workshop paper by end of term ❒  25% of overall grade ❒  Done in groups of 3 ❒  Poster session -‐ April 21, 10:30am-‐12:30pm (exam slot for 7:30am classes) •  See course website for Tmeline •  Available infrastructure ❒  FeS2 and M5 mulTprocessor simulators ❒  GPGPUsim ❒  Pin ❒  Xeon Phi accelerators •  Suggested topic list will be distributed in a few weeks You may propose other topics if you convince me they are worthwhile EECS 570 Lecture 1 Slide 11 Course Outline Unit I – Parallel Programming Models and ApplicaTons ❒  ❒  Message passing, shared memory (pthreads and GPU) ScienTfic and commercial parallel applicaTons Unit II – SynchronizaTon ❒  SynchronizaTon, Locks and TransacTonal Memory Unit III – Coherency and Consistency ❒  ❒  ❒  Snooping bus-‐based systems Directory-‐based distributed shared memory Memory Models Unit IV – InterconnecTon Networks ❒  On-‐chip and off-‐chip networks Unit V – Modern & UnconvenTonal MulTprocessors ❒  EECS 570 Simultaneous & speculaTve threading Lecture 1 Slide 12 Parallel Computer Architecture The Multicore Revolution Why is it happening? EECS 570 Lecture 1 Slide 13 If you want to make your computer faster, there are only two opTons: 1. increase clock frequency 2. execute two or more things in parallel InstrucTon-‐Level Parallelism (ILP) Programmer specified explicit parallelism EECS 570 Lecture 1 Slide 14 The ILP Wall Olukotun et al ASPLOS 96 •  6-‐issue has higher IPC than 2-‐issue, but not by 3x ❒  EECS 570 Memory (I & D) and dependence (pipeline) stalls limit IPC Lecture 1 Slide 15 Single-thread performance Performance 10000 15%/yr. 1000 52%/yr. 100 10 1 1985 1990 1995 2000 2005 2010 Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed. Conclusion: Can’t scale MHz or issue width to keep selling chips Hence, mul<core! EECS 570 Lecture 1 Slide 16 The Power E2UDC ERC VWall ision 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W) 1000 100 10 Limits on heat extracTon 1 0.1 0.01 Limits on energy-‐efficiency of operaTons 0.001 1985 EECS 570 1990 1995 2000 2005 2010 2015 2020 Lecture 1 Slide 17 The Power E2UDC ERC VWall ision 1000000 Transistors (100,000's) 100000 Power (W) Performance (GOPS) 10000 Efficiency (GOPS/W) 1000 100 10 Limits on heat extracTon 1 Stagnates performance growth 0.1 0.01 Limits on energy-‐efficiency of operaTons 0.001 1985 1990 1995 2000 2005 2010 2015 2020 Era of High Performance CompuTng Era of Energy-‐Efficient CompuTng c. 2000 EECS 570 Lecture 1 Slide 18 Classic CMOS Dennard Scaling: the Science behind Moore’s Law Scaling: Voltage: Oxide: V/α tOX/α Source: Future of Computing Performance: Game Over or Next Level?, National Academy Press, 2011 Results: 1/α2 Power/ckt: Power Density: ~Constant EECS 570 P = C V2 f Lecture 1 Slide 19 Post-classic CMOS Dennard Scaling TODO: Chips w/ higher power (no), smaller (L), dark silicon (J), or other (?) Post Dennard CMOS Scaling Rule Scaling: Voltage: Oxide: V/α V tOX/α Results: 1/α2 1 Power/ckt: Power Density: ~Constant α2 EECS 570 P = C V2 f Lecture 1 Slide 20 Leakage Killed Dennard Scaling Leakage: •  ExponenTal in inverse of Vth •  ExponenTal in temperature •  Linear in device count To switch well •  must keep Vdd/Vth > 3 ➜ Vdd can’t go down EECS 570 Lecture 1 Slide 21 Multicore: Solution to Power-constrained design? Power = CV2F F ∝ V Scale clock frequency to 80% Now add a second core Performance Power Same power budget, but 1.6x performance! But: ❒  ❒  EECS 570 Must parallelize applicaTon Remember Amdahl’s Law! Lecture 1 Slide 22 What Is a Parallel Computer? “A collecTon of processing elements that communicate and cooperate to solve large problems fast.” EECS 570 Almasi & Go@lieb, 1989 Lecture 1 Slide 23 Spectrum of Parallelism Bit-‐level Pipelining EECS 370 ILP EECS 470 MulTthreading MulTprocessing Distributed EECS 570 EECS 591 Why mulTprocessing? •  Desire for performance •  Techniques from 370/470 difficult to scale further EECS 570 Lecture 1 Slide 24 Why Parallelism Now? •  These arguments are no longer theoreTcal •  All major processor vendors are producing mulTcore chips ❒  ❒  Every machine will soon be a parallel machine All programmers will be parallel programmers??? •  New sokware model ❒  ❒  Want a new feature? Hide the “cost” by speeding up the code first All programmers will be performance programmers??? •  Some may eventually be hidden in libraries, compilers, and high level languages ❒  But a lot of work is needed to get there •  Big open quesTons: ❒  ❒  EECS 570 What will be the killer apps for mulTcore machines? How should the chips, languages, OS be designed to make it easier for us to develop parallel programs? Lecture 1 Slide 25 Multicore in Products •  “We are dedicaTng all of our future product development to mulTcore designs. … This is a sea change in compuTng” Paul Otellini, President, Intel (2005) •  All microprocessor companies switch to MP (2X cores / 2 yrs) Intel’s NehalemEX Azul’s Vega nVidia’s Tesla Processors/System 4 16 4 Cores/Processor 8 48 448 Threads/Processor 2 1 Threads/System 64 768 EECS 570 1792 Lecture 1 Slide 26 Revolution Continues.. Azul’s Vega 3 7300 54-‐core chip Blue Gene/Q Sequoia 16-‐core chip 864 cores 1.6 million cores 768 GB Memory May 2008 1.6 PB 2012 Sun’s Modular DataCenter ‘08 8-‐core chip, 8-‐thread/core 816 cores / 160 sq.feet Lakeside Datacenter (Chicago) 1.1 milion sq.feet ~45 million threads EECS 570 Lecture 1 Slide 27 Multiprocessors Are Here To Stay •  Moore’s law is making the mulTprocessor a commodity part ❒  ❒  ❒  1B transistors on a chip, what to do with all of them? Not enough ILP to jusTfy a huge uniprocessor Really big caches? thit increases, diminishing %miss returns •  Chip mulSprocessors (CMPs) ❒  Every compuTng device (even your cell phone) is now a mulTprocessor EECS 570 Lecture 1 Slide 28 Parallel Programming Intro EECS 570 Lecture 1 Slide 29 Motivation for MP Systems •  Classical reason for mulTprocessing: More performance by using mulTple processors in parallel ❒  Divide computaTon among processors and allow them to work concurrently ❒  AssumpTon 1: There is parallelism in the applicaTon ❒  AssumpTon 2: We can exploit this parallelism EECS 570 Lecture 1 Slide 30 Finding Parallelism 1.  FuncTonal parallelism ❒  ❒  ❒  2.  Data parallelism ❒  3.  Vector, matrix, db table, pixels, … Request parallelism ❒  EECS 570 Car: {engine, brakes, entertain, nav, …} Game: {physics, logic, UI, render, …} Signal processing: {transform, filter, scaling, …} Web, shared database, telephony, … Lecture 1 Slide 31 Computational Complexity of (Sequential) Algorithms •  Model: Each step takes a unit Tme •  Determine the Tme (/space) required by the algorithm as a funcTon of input size EECS 570 Lecture 1 Slide 32 Sequential Sorting Example •  Given an array of size n •  MergeSort takes O(n log n) Tme •  BubbleSort takes O(n2) Tme •  But, a BubbleSort implementaTon can someTmes be faster than a MergeSort implementaTon •  Why? EECS 570 Lecture 1 Slide 33 Sequential Sorting Example •  Given an array of size n •  MergeSort takes O(n log n) Tme •  BubbleSort takes O(n2) Tme •  But, a BubbleSort implementaTon can someTmes be faster than a MergeSort implementaTon •  The model is sTll useful ❒  ❒  EECS 570 Indicates the scalability of the algorithm for large inputs Lets us prove things like a sorTng algorithm requires at least O(n log n) comparisons Lecture 1 Slide 34 We need a similar model for parallel algorithms EECS 570 Lecture 1 Slide 35 Sequential Merge Sort 16MB input (32-‐bit integers) Time Recurse(lek) Recurse(right) SequenTal ExecuTon Merge to scratch array Copy back to input array EECS 570 Lecture 1 Slide 36 Parallel Merge Sort (as Parallel Directed Acyclic Graph) 16MB input (32-‐bit integers) Time Recurse(lek) Recurse(right) Parallel ExecuTon Merge to scratch array Copy back to input array EECS 570 Lecture 1 Slide 37 Parallel DAG for Merge Sort (2-core) SequenTal Sort Merge SequenTal Sort Time EECS 570 Lecture 1 Slide 38 Parallel DAG for Merge Sort (4-core) EECS 570 Lecture 1 Slide 39 Parallel DAG for Merge Sort (8-core) EECS 570 Lecture 1 Slide 40 The DAG Execution Model of a Parallel Computation •  Given an input, dynamically create a DAG •  Nodes represent sequenTal computaTon ❒  Weighted by the amount of work •  Edges represent dependencies: ❒  EECS 570 Node A à Node B means that B cannot be scheduled unless A is finished Lecture 1 Slide 41 Sorting 16 elements in four cores EECS 570 Lecture 1 Slide 42 Sorting 16 elements in four cores (4 element arrays sorted in constant time) 1 8 1 1 1 16 1 1 8 1 EECS 570 Lecture 1 Slide 43 Performance Measures •  Given a graph G, a scheduler S, and P processors •  Tp(S) : Time on P processors using scheduler S •  Tp : Time on P processors using best scheduler •  T1 : Time on a single processor (sequenTal cost) •  T∞ : Time assuming infinite resources EECS 570 Lecture 1 Slide 44 Work and Depth •  T1 = Work ❒  The total number of operaTons executed by a computaTon •  T∞ = Depth ❒  The longest chain of sequenTal dependencies (criTcal path) in the parallel DAG EECS 570 Lecture 1 Slide 45 T∞ (Depth): Critical Path Length (Sequential Bottleneck) EECS 570 Lecture 1 Slide 46 T1 (work): Time to Run Sequentially EECS 570 Lecture 1 Slide 47 Sorting 16 elements in four cores (4 element arrays sorted in constant time) 1 8 1 1 1 16 1 1 8 1 EECS 570 Work = Depth = Lecture 1 Slide 48 Some Useful Theorems EECS 570 Lecture 1 Slide 49 Work Law •  “You cannot avoid work by parallelizing” T1 / P ≤ TP EECS 570 Lecture 1 Slide 50 Work Law •  “You cannot avoid work by parallelizing” T1 / P ≤ TP Speedup = T1 / TP EECS 570 Lecture 1 Slide 51 Work Law •  “You cannot avoid work by parallelizing” T1 / P ≤ TP Speedup = T1 / TP •  Can speedup be more than 2 when we go from 1-‐core to 2-‐ core in pracTce? EECS 570 Lecture 1 Slide 52 Depth Law •  More resources should make things faster •  You are limited by the sequenTal boaleneck EECS 570 TP ≥ T∞ Lecture 1 Slide 53 Amount of Parallelism Parallelism = T1 / T∞ EECS 570 Lecture 1 Slide 54 Maximum Speedup Possible Speedup T1 / TP ≤ T1 / T∞ Parallelism “speedup is bounded above by available parallelism” EECS 570 Lecture 1 Slide 55 Greedy Scheduler •  If more than P nodes can be scheduled, pick any subset of size P •  If less than P nodes can be scheduled, schedule them all EECS 570 Lecture 1 Slide 56 Performance of the Greedy Scheduler TP(Greedy) ≤ T1 / P + T∞ Work law T1 / P ≤ TP Depth law T∞ ≤ TP EECS 570 Lecture 1 Slide 57 Greedy is optimal within factor of 2 TP ≤ TP(Greedy) ≤ 2 TP Work law T1 / P ≤ TP Depth law T∞ ≤ TP EECS 570 Lecture 1 Slide 58 Work/Depth of Merge Sort (Sequential Merge) •  Work T1 : O(n log n) •  Depth T∞ : O(n) ❒  Takes O(n) Tme to merge n elements •  Parallelism: ❒  EECS 570 T1 / T∞ = O(log n) à really bad! Lecture 1 Slide 59 Main Message •  Analyze the Work and Depth of your algorithm •  Parallelism is Work/Depth •  Try to decrease Depth ❒  ❒  the criTcal path a sequen;al boaleneck •  If you increase Depth ❒  EECS 570 beaer increase Work by a lot more! Lecture 1 Slide 60 Amdahl’s law •  SorTng takes 70% of the execuTon Tme of a sequenTal program •  You replace the sorTng algorithm with one that scales perfectly on mulT-‐core hardware •  How many cores do you need to get a 4x speed-‐up on the program? EECS 570 Lecture 1 Slide 61 Amdahl’s law, 𝑓=70% Speedup(f, c) = 1 / ( 1 – f) + f / c f 1-f c EECS 570 = the parallel porTon of execuTon = the sequenTal porTon of execuTon = number of cores used Lecture 1 Slide 62 Amdahl’s law, 𝑓=70% 4.5 4.0 3.5 Desired 4x speedup Speedup 3.0 2.5 2.0 Speedup achieved (perfect scaling on 70%) 1.5 1.0 0.5 0.0 1 EECS 570 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1 Slide 63 Amdahl’s law, 𝑓=70% 4.5 4.0 3.5 Desired 4x speedup Speedup 3.0 2.5 Limit as c→∞ = 1/(1-‐f) = 3.33 2.0 1.5 Speedup achieved (perfect scaling on 70%) 1.0 0.5 0.0 1 EECS 570 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1 Slide 64 Amdahl’s law, 𝑓=10% 1.12 1.10 1.08 Speedup 1.06 Speedup achieved with perfect scaling 1.04 Amdahl’s law limit, just 1.11x 1.02 1.00 0.98 0.96 0.94 1 EECS 570 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1 Slide 65 Amdahl’s law, 𝑓=98% 60 50 Speedup 40 30 20 10 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 #cores EECS 570 Lecture 1 Slide 66 Lesson •  Speedup is limited by sequenTal code •  Even a small percentage of sequenTal code can greatly limit potenTal speedup EECS 570 Lecture 1 Slide 67 Gustafson’s Law Any sufficiently large problem can be parallelized effecTvely Speedup(f, c) = f c + (1 – f) f 1-f c = the parallel porTon of execuTon = the sequenTal porTon of execuTon = number of cores used Key assump;on: 𝑓 increases as problem size increases EECS 570 Lecture 1 Slide 68

EECS 570 Lecture 1 Parallel Computer Architecture

Related documents

Products

Support

EECS 570 Lecture 1 Parallel Computer Architecture

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib