EECS 570 Lecture 1 Parallel Computer Architecture

advertisement
EECS 570 Lecture 1 Parallel Computer Architecture Winter 2016 Prof. Thomas Wenisch h6p://www.eecs.umich.edu/courses/eecs570/ Slides
developed in part by Profs. Austin, Adve, Falsafi, Martin, Narayanasamy,
Nowatzyk, Peh, and Wenisch of CMU, EPFL, MIT, UPenn, U-M, UIUC.
EECS 570
Lecture 1
Slide 1
Announcements
No discussion on Friday. Online quizzes (Canvas) on 1st readings due Monday, 1:30pm. Sign up for piazza. EECS 570
Lecture 1
Slide 2
Readings
For Monday 1/12 (quizzes due by 1:30pm) ❒ 
❒ 
David Wood and Mark Hill. “Cost-­‐EffecTve Parallel CompuTng,” IEEE Computer, 1995. Mark Hill et al. “21st Century Computer Architecture.” CCC White Paper, 2012. For Wednesday 1/14: ❒ 
EECS
570
Seiler et al. Larrabee: A Many-­‐Core x86 Architecture for Visual CompuTng. Siggraph 2008. Lecture 1
Slide 3
EECS 570 Class Info
Instructor: Professor Thomas Wenisch ❒ 
URL: hap://www.eecs.umich.edu/~twenisch Research interests: ❒ 
❒ 
❒ 
MulTcore / mulTprocessor arch. & programmability Data center architecture, server energy-­‐efficiency Accelerators for medical imaging, data analyTcs GSI: ❒ 
Amlan Nayak (amlan@umich.edu) Class info: ❒ 
❒ 
❒ 
EECS 570
URL: hap://www.eecs.umich.edu/courses/eecs570/ Canvas for reading quizzes & reporTng grades Piazza for discussions & project coordinaTon Lecture 1
Slide 4
Meeting Times
Lecture ❒ 
MW 1:40pm – 3:00pm (1017 Dow) Discussion ❒ 
❒ 
❒ 
❒ 
F 1:40pm – 2:30pm (1200 EECS) Talk about programming assignments and projects Make-­‐up lectures Keep the slot free, but we oken won’t meet Office Hours ❒ 
❒ 
Prof. Wenisch: M 3-­‐4 (4620 CSE) Amlan: TBD (LocaTon: TBD) Q&A EECS 570
Fri 1:30-­‐2:30 (LocaTon: TBD) when no discussion Use Piazza for all technical quesTons Use e-­‐mail sparingly Lecture 1
Slide 5
Who Should Take 570?
Graduate Students (& seniors interested in research)
1. 
2. 
3. 
Computer architects to be Computer system designers Those interested in computer systems Required Background ❒ 
❒ 
EECS 570
Computer Architecture (e.g., EECS 470) C / C++ programming Lecture 1
Slide 6
Grading
2 Prog. Assignments:
Reading Quizzes:
Midterm exam:
Final exam:
Final Project:
5% & 10%
10%
25%
25%
25%
Attendance & participation count
(your goal is for me to know who you are)
EECS 570
Lecture 1
Slide 7
Grading (Cont.)
Group studies are encouraged Group discussions are encouraged All programming assignments must be results of individual work All reading quizzes must be done individually, quesTons/answers should not be posted publicly There is no tolerance for academic dishonesty. Please refer to the University Policy on chea;ng and plagiarism. Discussion and group studies are encouraged, but all submi@ed material must be the student's individual work (or in case of the project, individual group work). EECS 570
Lecture 1
Slide 8
Some Advice on Reading…
If you carefully read every paper start to finish… …you will never finish Learn to skim past details EECS 570
Lecture 1
Slide 9
Reading Quizzes
•  You must take an online quiz for every paper Quizzes must be completed by class start via Canvas •  There will be 2 mulTple choice quesTons ❒ 
❒ 
The quesTons are chosen randomly from a list You only have 5 minutes ❍ 
❒ 
Not enough Tme to find the answer if you haven’t read the paper You only get one aaempt •  Some of the quesTons may be reused on the midterm/final •  4 lowest quiz grades (of about 40) will be dropped over the course of the semester (e.g., skip some if you are travelling) ❒ 
EECS 570
Retakes/retries/reschedules will not be given for any reason Lecture 1
Slide 10
Final Project
•  Original research on a topic related to the course ❒  Goal: a high-­‐quality 6-­‐page workshop paper by end of term ❒  25% of overall grade ❒  Done in groups of 3 ❒  Poster session -­‐ April 21, 10:30am-­‐12:30pm (exam slot for 7:30am classes) •  See course website for Tmeline •  Available infrastructure ❒  FeS2 and M5 mulTprocessor simulators ❒  GPGPUsim ❒  Pin ❒  Xeon Phi accelerators •  Suggested topic list will be distributed in a few weeks You may propose other topics if you convince me they are worthwhile EECS 570
Lecture 1
Slide 11
Course Outline
Unit I – Parallel Programming Models and ApplicaTons ❒ 
❒ 
Message passing, shared memory (pthreads and GPU) ScienTfic and commercial parallel applicaTons Unit II – SynchronizaTon ❒ 
SynchronizaTon, Locks and TransacTonal Memory Unit III – Coherency and Consistency ❒ 
❒ 
❒ 
Snooping bus-­‐based systems Directory-­‐based distributed shared memory Memory Models Unit IV – InterconnecTon Networks ❒ 
On-­‐chip and off-­‐chip networks Unit V – Modern & UnconvenTonal MulTprocessors ❒ 
EECS 570
Simultaneous & speculaTve threading Lecture 1
Slide 12
Parallel Computer Architecture
The Multicore Revolution
Why is it happening? EECS 570
Lecture 1
Slide 13
If you want to make your computer faster, there are only two opTons: 1. increase clock frequency 2. execute two or more things in parallel InstrucTon-­‐Level Parallelism (ILP) Programmer specified explicit parallelism EECS 570
Lecture 1
Slide 14
The ILP Wall
Olukotun et al ASPLOS 96 •  6-­‐issue has higher IPC than 2-­‐issue, but not by 3x ❒ 
EECS 570
Memory (I & D) and dependence (pipeline) stalls limit IPC Lecture 1
Slide 15
Single-thread performance
Performance
10000
15%/yr. 1000
52%/yr. 100
10
1
1985
1990
1995
2000
2005
2010
Source: Hennessy & Patterson, Computer Architecture: A Quantitative Approach, 4th ed.
Conclusion: Can’t scale MHz or issue width to keep selling chips Hence, mul<core! EECS 570
Lecture 1
Slide 16
The
Power
E2UDC ERC VWall
ision 1000000 Transistors (100,000's) 100000 10000 Power (W) Performance (GOPS) Efficiency (GOPS/W) 1000 100 10 Limits on heat extracTon 1 0.1 0.01 Limits on energy-­‐efficiency of operaTons 0.001 1985 EECS 570
1990 1995 2000 2005 2010 2015 2020 Lecture 1
Slide 17
The
Power
E2UDC ERC VWall
ision 1000000 Transistors (100,000's) 100000 Power (W) Performance (GOPS) 10000 Efficiency (GOPS/W) 1000 100 10 Limits on heat extracTon 1 Stagnates performance growth 0.1 0.01 Limits on energy-­‐efficiency of operaTons 0.001 1985 1990 1995 2000 2005 2010 2015 2020 Era of High Performance CompuTng Era of Energy-­‐Efficient CompuTng c. 2000 EECS 570
Lecture 1
Slide 18
Classic CMOS Dennard Scaling:
the Science behind Moore’s Law
Scaling:
Voltage: Oxide: V/α
tOX/α
Source: Future of Computing Performance:
Game Over or Next Level?, National Academy Press, 2011
Results:
1/α2 Power/ckt: Power Density: ~Constant EECS 570
P = C V2 f
Lecture 1
Slide 19
Post-classic CMOS Dennard Scaling
TODO:
Chips w/ higher power (no), smaller (L),
dark silicon (J), or other (?)
Post Dennard CMOS Scaling Rule
Scaling:
Voltage: Oxide: V/α V tOX/α
Results:
1/α2 1 Power/ckt: Power Density: ~Constant α2 EECS 570
P = C V2 f
Lecture 1
Slide 20
Leakage Killed Dennard Scaling
Leakage: •  ExponenTal in inverse of Vth •  ExponenTal in temperature •  Linear in device count To switch well •  must keep Vdd/Vth > 3 ➜ Vdd can’t go down EECS 570
Lecture 1
Slide 21
Multicore:
Solution to Power-constrained design?
Power = CV2F F ∝ V Scale clock frequency to 80% Now add a second core Performance Power Same power budget, but 1.6x performance! But: ❒ 
❒ 
EECS 570
Must parallelize applicaTon Remember Amdahl’s Law! Lecture 1
Slide 22
What Is a Parallel Computer?
“A collecTon of processing elements that communicate and cooperate to solve large problems fast.” EECS 570
Almasi & Go@lieb, 1989 Lecture 1
Slide 23
Spectrum of Parallelism
Bit-­‐level Pipelining EECS 370 ILP EECS 470 MulTthreading MulTprocessing Distributed EECS 570 EECS 591 Why mulTprocessing? •  Desire for performance •  Techniques from 370/470 difficult to scale further EECS 570
Lecture 1
Slide 24
Why Parallelism Now?
•  These arguments are no longer theoreTcal •  All major processor vendors are producing mulTcore chips ❒ 
❒ 
Every machine will soon be a parallel machine All programmers will be parallel programmers??? •  New sokware model ❒ 
❒ 
Want a new feature? Hide the “cost” by speeding up the code first All programmers will be performance programmers??? •  Some may eventually be hidden in libraries, compilers, and high level languages ❒ 
But a lot of work is needed to get there •  Big open quesTons: ❒ 
❒ 
EECS 570
What will be the killer apps for mulTcore machines? How should the chips, languages, OS be designed to make it easier for us to develop parallel programs? Lecture 1
Slide 25
Multicore in Products
•  “We are dedicaTng all of our future product development to mulTcore designs. … This is a sea change in compuTng” Paul Otellini, President, Intel (2005) • 
All microprocessor companies switch to MP (2X cores / 2 yrs) Intel’s NehalemEX
Azul’s Vega
nVidia’s Tesla
Processors/System
4
16
4
Cores/Processor
8
48
448
Threads/Processor
2
1
Threads/System
64
768
EECS 570
1792
Lecture 1
Slide 26
Revolution Continues..
Azul’s Vega 3 7300 54-­‐core chip Blue Gene/Q Sequoia 16-­‐core chip 864 cores 1.6 million cores 768 GB Memory May 2008 1.6 PB 2012 Sun’s Modular DataCenter ‘08 8-­‐core chip, 8-­‐thread/core 816 cores / 160 sq.feet Lakeside Datacenter (Chicago) 1.1 milion sq.feet ~45 million threads EECS 570
Lecture 1
Slide 27
Multiprocessors Are Here To Stay
•  Moore’s law is making the mulTprocessor a commodity part ❒ 
❒ 
❒ 
1B transistors on a chip, what to do with all of them? Not enough ILP to jusTfy a huge uniprocessor Really big caches? thit increases, diminishing %miss returns •  Chip mulSprocessors (CMPs) ❒ 
Every compuTng device (even your cell phone) is now a mulTprocessor EECS 570
Lecture 1
Slide 28
Parallel Programming Intro
EECS 570
Lecture 1
Slide 29
Motivation for MP Systems
•  Classical reason for mulTprocessing: More performance by using mulTple processors in parallel ❒  Divide computaTon among processors and allow them to work concurrently ❒ 
AssumpTon 1: There is parallelism in the applicaTon ❒ 
AssumpTon 2: We can exploit this parallelism EECS 570
Lecture 1
Slide 30
Finding Parallelism
1. 
FuncTonal parallelism ❒ 
❒ 
❒ 
2. 
Data parallelism ❒ 
3. 
Vector, matrix, db table, pixels, … Request parallelism ❒ 
EECS 570
Car: {engine, brakes, entertain, nav, …} Game: {physics, logic, UI, render, …} Signal processing: {transform, filter, scaling, …} Web, shared database, telephony, … Lecture 1
Slide 31
Computational Complexity of (Sequential)
Algorithms
•  Model: Each step takes a unit Tme •  Determine the Tme (/space) required by the algorithm as a funcTon of input size EECS 570
Lecture 1
Slide 32
Sequential Sorting Example
•  Given an array of size n •  MergeSort takes O(n log n) Tme •  BubbleSort takes O(n2) Tme •  But, a BubbleSort implementaTon can someTmes be faster than a MergeSort implementaTon •  Why? EECS 570
Lecture 1
Slide 33
Sequential Sorting Example
•  Given an array of size n •  MergeSort takes O(n log n) Tme •  BubbleSort takes O(n2) Tme •  But, a BubbleSort implementaTon can someTmes be faster than a MergeSort implementaTon •  The model is sTll useful ❒ 
❒ 
EECS 570
Indicates the scalability of the algorithm for large inputs Lets us prove things like a sorTng algorithm requires at least O(n log n) comparisons Lecture 1
Slide 34
We need a similar model for parallel
algorithms
EECS 570
Lecture 1
Slide 35
Sequential Merge Sort
16MB input (32-­‐bit integers) Time Recurse(lek) Recurse(right) SequenTal ExecuTon Merge to scratch array Copy back to input array EECS 570
Lecture 1
Slide 36
Parallel Merge Sort
(as Parallel Directed Acyclic Graph)
16MB input (32-­‐bit integers) Time Recurse(lek) Recurse(right) Parallel ExecuTon Merge to scratch array Copy back to input array EECS 570
Lecture 1
Slide 37
Parallel DAG for Merge Sort
(2-core)
SequenTal Sort Merge SequenTal Sort Time EECS 570
Lecture 1
Slide 38
Parallel DAG for Merge Sort
(4-core)
EECS 570
Lecture 1
Slide 39
Parallel DAG for Merge Sort
(8-core)
EECS 570
Lecture 1
Slide 40
The DAG Execution Model of a
Parallel Computation
•  Given an input, dynamically create a DAG •  Nodes represent sequenTal computaTon ❒ 
Weighted by the amount of work •  Edges represent dependencies: ❒ 
EECS 570
Node A à Node B means that B cannot be scheduled unless A is finished Lecture 1
Slide 41
Sorting 16 elements in four cores
EECS 570
Lecture 1
Slide 42
Sorting 16 elements in four cores
(4 element arrays sorted in constant time)
1 8 1 1 1 16 1 1 8 1 EECS 570
Lecture 1
Slide 43
Performance Measures
•  Given a graph G, a scheduler S, and P processors •  Tp(S) : Time on P processors using scheduler S •  Tp
: Time on P processors using best scheduler •  T1 : Time on a single processor (sequenTal cost) •  T∞
: Time assuming infinite resources EECS 570
Lecture 1
Slide 44
Work and Depth
•  T1 = Work ❒ 
The total number of operaTons executed by a computaTon •  T∞ = Depth ❒ 
The longest chain of sequenTal dependencies (criTcal path) in the parallel DAG EECS 570
Lecture 1
Slide 45
T∞ (Depth): Critical Path Length
(Sequential Bottleneck)
EECS 570
Lecture 1
Slide 46
T1 (work): Time to Run Sequentially
EECS 570
Lecture 1
Slide 47
Sorting 16 elements in four cores
(4 element arrays sorted in constant time)
1 8 1 1 1 16 1 1 8 1 EECS 570
Work = Depth = Lecture 1
Slide 48
Some Useful Theorems
EECS 570
Lecture 1
Slide 49
Work Law
•  “You cannot avoid work by parallelizing” T1 / P ≤ TP
EECS 570
Lecture 1
Slide 50
Work Law
•  “You cannot avoid work by parallelizing” T1 / P ≤ TP
Speedup = T1 / TP
EECS 570
Lecture 1
Slide 51
Work Law
•  “You cannot avoid work by parallelizing” T1 / P ≤ TP
Speedup = T1 / TP
•  Can speedup be more than 2 when we go from 1-­‐core to 2-­‐
core in pracTce? EECS 570
Lecture 1
Slide 52
Depth Law
•  More resources should make things faster •  You are limited by the sequenTal boaleneck EECS 570
TP ≥ T∞
Lecture 1
Slide 53
Amount of Parallelism
Parallelism = T1 / T∞
EECS 570
Lecture 1
Slide 54
Maximum Speedup Possible
Speedup T1 / TP ≤ T1 / T∞ Parallelism “speedup is bounded above by available parallelism” EECS 570
Lecture 1
Slide 55
Greedy Scheduler
•  If more than P nodes can be scheduled, pick any subset of size P •  If less than P nodes can be scheduled, schedule them all EECS 570
Lecture 1
Slide 56
Performance of the Greedy Scheduler
TP(Greedy) ≤ T1 / P + T∞
Work law
T1 / P ≤ TP
Depth law T∞ ≤ TP
EECS 570
Lecture 1
Slide 57
Greedy is optimal within factor of 2
TP ≤ TP(Greedy) ≤ 2 TP
Work law
T1 / P ≤ TP
Depth law T∞ ≤ TP
EECS 570
Lecture 1
Slide 58
Work/Depth of Merge Sort
(Sequential Merge)
•  Work T1 :
O(n log n)
•  Depth T∞ : O(n)
❒ 
Takes O(n) Tme to merge n elements •  Parallelism: ❒ 
EECS 570
T1 / T∞ = O(log n) à really bad! Lecture 1
Slide 59
Main Message
•  Analyze the Work and Depth of your algorithm •  Parallelism is Work/Depth •  Try to decrease Depth ❒ 
❒ 
the criTcal path a sequen;al boaleneck •  If you increase Depth ❒ 
EECS 570
beaer increase Work by a lot more! Lecture 1
Slide 60
Amdahl’s law
•  SorTng takes 70% of the execuTon Tme of a sequenTal program •  You replace the sorTng algorithm with one that scales perfectly on mulT-­‐core hardware •  How many cores do you need to get a 4x speed-­‐up on the program? EECS 570
Lecture 1
Slide 61
Amdahl’s law, 𝑓=70%
Speedup(f, c) = 1 / ( 1 – f) + f / c
f
1-f
c
EECS 570
= the parallel porTon of execuTon = the sequenTal porTon of execuTon
= number of cores used Lecture 1
Slide 62
Amdahl’s law, 𝑓=70%
4.5 4.0 3.5 Desired 4x speedup Speedup 3.0 2.5 2.0 Speedup achieved (perfect scaling on 70%) 1.5 1.0 0.5 0.0 1 EECS 570
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1
Slide 63
Amdahl’s law, 𝑓=70%
4.5 4.0 3.5 Desired 4x speedup Speedup 3.0 2.5 Limit as c→∞ = 1/(1-­‐f) = 3.33 2.0 1.5 Speedup achieved (perfect scaling on 70%) 1.0 0.5 0.0 1 EECS 570
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1
Slide 64
Amdahl’s law, 𝑓=10%
1.12 1.10 1.08 Speedup 1.06 Speedup achieved with perfect scaling 1.04 Amdahl’s law limit, just 1.11x 1.02 1.00 0.98 0.96 0.94 1 EECS 570
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 #cores Lecture 1
Slide 65
Amdahl’s law, 𝑓=98%
60 50 Speedup 40 30 20 10 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103 109 115 121 127 #cores EECS 570
Lecture 1
Slide 66
Lesson
•  Speedup is limited by sequenTal code •  Even a small percentage of sequenTal code can greatly limit potenTal speedup EECS 570
Lecture 1
Slide 67
Gustafson’s Law
Any sufficiently large problem can be parallelized effecTvely Speedup(f, c) = f c + (1 – f)
f
1-f
c
= the parallel porTon of execuTon = the sequenTal porTon of execuTon
= number of cores used Key assump;on: 𝑓 increases as problem size increases EECS 570
Lecture 1
Slide 68
Download