Amdahl’s Law in the Multicore Era Mark D. Hill and Michael R. Marty University of Wisconsin—Madison January 2010 @ Other UW IBM’s Dr. Thomas Puzak: Everyone knows Amdahl’s Law © 2010 Multifacet Project But quickly forgets it!University of Wisconsin-Madison HPCA 2007 Debate [IEEE Micro 11-12/2007] Single-Threaded vs. Multithreaded: Where Should We Focus? Yale Patt vs. Mark Hill w/ Joel Emer, moderator © 2010 Multifacet Project University of Wisconsin-Madison Executive Summary • Develop A Corollary to Amdahl’s Law – – – – Simple Model of Multicore Hardware Complements Amdahl’s software model Fixed chip resources for cores Core performance improves sub-linearly with resources • Research Implications (1) Need Dramatic Increases in Parallelism (No Surprise) • 99% parallel limits 256 (base) cores to speedup 72 • New Moore’s Law: Double Parallelism Every Two Years? (2) Many larger chips need increased core performance (3) HW/SW for asymmetric designs (one/few cores enhanced) (4) HW/SW for dynamic designs (serial parallel) 7/26/2016 4 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 5 Wisconsin Multifacet Project Technology & Moore’s Law Transistor 1947 Integrated Circuit 1958 (a.k.a. Chip) Moore’s Law 1964: # Transistors per Chip doubles every two years (or 18 months) Architects & Another Moore’s Law Microprocessor 1971 50M transistors ~2000 Popular Moore’s Law: Processor (core) performance doubles every two years Multicore Chip (a.k.a. Chip Multiprocesors) Why Multicore? • Power simpler structures • Memory Concurrent accesses to tolerate off-chip latency • Wires intra-core wires shorter • Complexity divide & conquer But More cores; NOT faster cores Will effective chip performance keep doubling every two years? Eight 4-way cores 2006 Virtuous Cycle, circa 1950 – 2005 (per Larus) Increased processor performance Larger, more feature-full software Slower programs Larger development teams Higher-level languages & abstractions World-Wide Software Market (per IDC): $212b (2005) $310b (2010) 7/26/2016 9 Wisconsin Multifacet Project Virtuous Cycle, 2005 – ??? X Increased processor performance Slower programs Larger, more feature-full software GAME OVER — NEXT LEVEL? Larger development teams Higher-level languages & abstractions Thread Level Parallelism & Multicore Chips World-Wide Software Market $212b (2005) ? 7/26/2016 10 Wisconsin Multifacet Project 100 90 80 70 60 SMP Bulge 50 Lead up What to Next? Multicore 40 30 20 10 0 1973 1974 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Percent Multiprocessor Papers in ISCA How has Architecture Research Prepared? Source: Hill & Rajwar, The Rise & Fall of Multiprocessor Papers in ISCA, http://www.cs.wisc.edu/~markhill/mp2001.html (3/2001) 7/26/2016 11 Wisconsin Multifacet Project 70 60 50 0 Source: Hill, 2/2008 7/26/2016 12 Wisconsin Multifacet Project 2009 80 2008 1973 1974 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 Percent Multiprocessor Papers in ISCA Reacted? How has Architecture Research Prepared? 100 90 Will Architecture Research Overreact? Multicore Ramp 40 30 20 10 Lead up to Multicore 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997 1996 1995 1994 1993 1992 1991 1990 What Next? Gentle Multicore Ramp End of Small SMP Bulge? 1989 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1988 Percent Multiprocessor Papers What About PL/Compilers (PLDI) Research? PLDI Begins Source: Steve Jackson, 3/2008 7/26/2016 14 Wisconsin Multifacet Project 100% 90% 80% 70% 60% 50% Small SMP Bulge 40% Lead up What to Next? Multicore NO Multicore Ramp (Yet) 30% 20% 10% 0% 1967 1969 1971 1973 1975 1977 1979 1981 1983 1985 1987 1989 1991 1993 1994 1995 1996 1997 1999 1999 2000 2001 2002 2003 2004 2005 2006 2007 Percent Multiprocessor Papers What About Systems (SOSP/OSDI) Research? SOSP odd years only ODSI even & SOSP odd Source: Michael Swift, 3/2008 7/26/2016 15 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 16 Wisconsin Multifacet Project Recall Amdahl’s Law • Begins with Simple Software Assumption (Limit Arg.) – Fraction F of execution time perfectly parallelizable – No Overhead for – Scheduling – Communication – Synchronization, etc. – Fraction 1 – F Completely Serial • Time on 1 core = (1 – F) / 1 + F / 1 = 1 • Time on N cores = (1 – F) / 1 + F / N 7/26/2016 17 Wisconsin Multifacet Project Recall Amdahl’s Law [1967] 1 Amdahl’s Speedup = 1-F 1 + F N • For mainframes, Amdahl expected 1 - F = 35% – For a 4-processor speedup = 2 – For infinite-processor speedup < 3 – Therefore, stay with mainframes with one/few processors • Amdahl’s Law applied to Minicomputer to PC Eras • What about the Multicore Era? 7/26/2016 18 Wisconsin Multifacet Project Designing Multicore Chips Hard • Designers must confront single-core design options – – – – Instruction fetch, wakeup, select Execution unit configuation & operand bypass Load/queue(s) & data cache Checkpoint, log, runahead, commit. • As well as additional design degrees of freedom – – – – How many cores? How big each? Shared caches: levels? How many banks? Memory interface: How many banks? On-chip interconnect: bus, switched, ordered? 7/26/2016 19 Wisconsin Multifacet Project Want Simple Multicore Hardware Model To Complement Amdahl’s Simple Software Model (1) Chip Hardware Roughly Partitioned into – Multiple Cores (with L1 caches) – The Rest (L2/L3 cache banks, interconnect, pads, etc.) – Changing Core Size/Number does NOT change The Rest (2) Resources for Multiple Cores Bounded – Bound of N resources per chip for cores – Due to area, power, cost ($$$), or multiple factors – Bound = Power? (but our pictures use Area) 7/26/2016 20 Wisconsin Multifacet Project Want Simple Multicore Hardware Model, cont. (3) Micro-architects can improve single-core performance using more of the bounded resource • A Simple Base Core – Consumes 1 Base Core Equivalent (BCE) resources – Provides performance normalized to 1 • An Enhanced Core (in same process generation) – Consumes R BCEs – Performance as a function Perf(R) • What does function Perf(R) look like? 7/26/2016 21 Wisconsin Multifacet Project More on Enhanced Cores • (Performance Perf(R) consuming R BCEs resources) • If Perf(R) > R Always enhance core • Cost-effectively speedups both sequential & parallel • Therefore, Equations Assume Perf(R) < R • Graphs Assume Perf(R) = Square Root of R – 2x performance for 4 BCEs, 3x for 9 BCEs, etc. – Why? Models diminishing returns with “no coefficients” – Alpha EV4/5/6 [Kumar 11/2005] & Intel’s Pollack’s Law • How to speedup enhanced core? – <Insert favorite or TBD micro-architectural ideas here> 7/26/2016 22 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 23 Wisconsin Multifacet Project How Many (Symmetric) Cores per Chip? • • • • • Each Chip Bounded to N BCEs (for all cores) Each Core consumes R BCEs Assume Symmetric Multicore = All Cores Identical Therefore, N/R Cores per Chip — (N/R)*R = N For an N = 16 BCE Chip: Sixteen 1-BCE cores 7/26/2016 Four 4-BCE cores 24 One 16-BCE core Wisconsin Multifacet Project Performance of Symmetric Multicore Chips • Serial Fraction 1-F uses 1 core at rate Perf(R) • Serial time = (1 – F) / Perf(R) • Parallel Fraction uses N/R cores at rate Perf(R) each • Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N • Therefore, w.r.t. one base core: 1 Symmetric Speedup = 1-F Perf(R) • Implications? + F*R Perf(R)*N Enhanced Cores speed Serial & Parallel 7/26/2016 25 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 Symmetric Speedup 14 12 10 8 6 4 F=0.5 2 0 1 2 4 8 16 (16 cores) (8 cores) R BCEs (4 cores) (2 cores) (1 core) F=0.5 R=16, Cores=1, Speedup=4 F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16)) Need to increase parallelism to make multicore optimal! 7/26/2016 26 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 Symmetric Speedup 14 12 10 8 F=0.9 6 F=0.9, R=2, Cores=8, Speedup=6.7 4 F=0.5 2 0 1 2 4 8 16 R BCEs F=0.5 R=16, Cores=1, Speedup=4 At F=0.9, Multicore optimal, but speedup limited Need to obtain even more parallelism! 7/26/2016 27 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 Symmetric Speedup 14 F1, R=1, Cores=16, Speedup16 F=0.99 12 F=0.975 10 8 F=0.9 6 4 F=0.5 2 0 1 2 4 8 16 R BCEs F matters: Amdahl’s Law applies to multicore chips MANY Researchers should target parallelism F first 7/26/2016 28 Wisconsin Multifacet Project Need a Third “Moore’s Law?” • Technologist’s Moore’s Law – Double Transistors per Chip every 2 years – Slows or stops: TBD • Microarchitect’s Moore’s Law – Double Performance per Core every 2 years – Slowed or stopped: Early 2000s • Multicore’s Moore’s Law – – – – – Double Cores per Chip every 2 years & Double Parallelism per Workload every 2 years & Aided by Architectural Support for Parallelism = Double Performance per Chip every 2 years Starting now • Software as Producer, not Consumer, of Performance Gains! 7/26/2016 29 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 Symmetric Speedup 14 F=0.99 12 F=0.975 10 8 F=0.9 6 Recall F=0.9, R=2, Cores=8, Speedup=6.7 4 F=0.5 2 0 1 2 4 8 16 R BCEs As Moore’s Law enables N to go from 16 to 256 BCEs, More cores? Enhance cores? Or both? 7/26/2016 30 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 256 BCEs Symmetric Speedup 250 F1 R=1 (vs. 1) Cores=256 (vs. 16) Speedup=204 (vs. 16) 200 F=0.999 150 MORE CORES! 100 F=0.99 F=0.975 50 F=0.99 R=3 (vs. 1) Cores=85 0(vs. 16) Speedup=80 (vs.1 13.9) 2 MORE CORES & ENHANCE CORES! F=0.9 F=0.5 4 8 16 R BCEs 32 F=0.9 R=28 (vs. 64 128 2) 256 Cores=9 (vs. 8) Speedup=26.7 (vs. 6.7) ENHANCE CORES! As Moore’s Law increases N, often need enhanced core designs Some arch. researchers should target single-core performance 7/26/2016 31 Wisconsin Multifacet Project Aside: Cost-Effective Parallel Computing • Isn’t Speedup(C) < C Inefficient? (C = #cores) • Much of a Computer’s Cost OUTSIDE Processor [Wood & Hill, IEEE Computer 2/1995] Cores • Let Costup(C) = Cost(C)/Cost(1) • Parallel Computing Cost-Effective: Multicores have • Speedup(C) > Costup(C) even lower • 1995 SGI PowerChallenge w/ 500MB: Costups!!! • Costup(32) = 8.6 How Might Servers/Clients/Embedded Evolve? • Recall 1970s Watergate – – – – Secret Source Deep Throat (W. Mark Felt @ FBI) Helped Reporters Bob Woodward & Carl Bernstein Confirmed, but would not provide information Frequently recommended: Follow the Money • Today I recommend: Follow the Parallelism! – Where Parallelism Helps Performance – Where Parallelism Helps Cost-Performance • Servers can use vast parallelism • Can clients & embedded? • If not, computing’s center of gravity server cloud 7/26/2016 34 Wisconsin Multifacet Project Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 35 Wisconsin Multifacet Project Asymmetric (Heterogeneous) Multicore Chips • Symmetric Multicore Required All Cores Equal • Why Not Enhance Some (But Not All) Cores? • For Amdahl’s Simple Software Assumptions – One Enhanced Core – Others are Base Cores • How? – <fill in favorite micro-architecture techniques here> – Model ignores design cost of asymmetric design • How does this effect our hardware model? 7/26/2016 36 Wisconsin Multifacet Project How Many Cores per Asymmetric Chip? • • • • • Each Chip Bounded to N BCEs (for all cores) One R-BCE Core leaves N-R BCEs Use N-R BCEs for N-R Base Cores Therefore, 1 + N - R Cores per Chip For an N = 16 BCE Chip: Asymmetric: One 4-BCE core & Twelve 1-BCE base cores Symmetric: Four 4-BCE cores 7/26/2016 37 Wisconsin Multifacet Project Performance of Asymmetric Multicore Chips • Serial Fraction 1-F same, so time = (1 – F) / Perf(R) • Parallel Fraction F – One core at rate Perf(R) – N-R cores at rate 1 – Parallel time = F / (Perf(R) + N - R) • Therefore, w.r.t. one base core: 1 Asymmetric Speedup = 7/26/2016 1-F Perf(R) 38 + F Perf(R) + N - R Wisconsin Multifacet Project Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 (256 cores) (1+252 cores) R BCEs (1+192 cores) (1 core) (1+240 cores) Number of Cores = 1 (Enhanced) + 256 – R (Base) How do Asymmetric & Symmetric speedups compare? 7/26/2016 39 Wisconsin Multifacet Project Recall Symmetric Multicore Chip, N = 256 BCEs Symmetric Speedup 250 200 F=0.999 150 100 F=0.99 F=0.975 50 F=0.9 F=0.5 Recall F=0.9, R=28, Cores=9, Speedup=26.7 0 1 2 4 8 16 32 64 128 256 R BCEs 7/26/2016 40 Wisconsin Multifacet Project Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 F=0.99 R=41 (vs. 3) Cores=216 (vs. 85) Speedup=166 (vs. 80) 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 R BCEs 32 64 128 256 F=0.9 R=118 (vs. 28) Cores= 139 (vs. 9) Speedup=65.6 (vs. 26.7) Asymmetric offers greater speedups potential than Symmetric In Paper: As Moore’s Law increases N, Asymmetric gets better Some arch. researchers should target asymmetric multicores 7/26/2016 41 Wisconsin Multifacet Project Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier) At What Level? – – – – – – – Application Programmer Library Author More Leverage (?) Compiler Runtime System More Info (?) Operating System Hypervisor (Virtual Machine Monitor) Hardware Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 43 Wisconsin Multifacet Project Dynamic Multicore Chips, Take 1 • Why NOT Have Your Cake and Eat It Too? • N Base Cores for Best Parallel Performance • Harness R Cores Together for Serial Performance • How? DYNAMICALLY Harness Cores Together – <insert favorite or TBD techniques here> parallel mode sequential mode 7/26/2016 44 Wisconsin Multifacet Project Dynamic Multicore Chips, Take 2 • Let POWER provide the limit of N BCEs • While Area is Unconstrained (to first order) parallel mode How to model these two chips? sequential mode • Result: N base cores for parallel; large core for serial – [Chakraborty, Wells, & Sohi, Wisconsin CS-TR-2007-1607] – When Simultaneous Active Fraction (SAF) < ½ 7/26/2016 45 Wisconsin Multifacet Project Performance of Dynamic Multicore Chips • N Base Cores with R BCEs used Serially • Serial Fraction 1-F uses R BCEs at rate Perf(R) • Serial time = (1 – F) / Perf(R) • Parallel Fraction F uses N base cores at rate 1 each • Parallel time = F / N • Therefore, w.r.t. one base core: 1 Dynamic Speedup = 7/26/2016 1-F Perf(R) 46 + F N Wisconsin Multifacet Project Recall Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 200 Recall 150 F=0.99 100 F=0.99 R=41 Cores=216 Speedup=166 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 R BCEs What happens with a dynamic chip? 7/26/2016 47 Wisconsin Multifacet Project Dynamic Multicore Chip, N = 256 BCEs 250 Dynamic Speedup F=0.999 200 150 F=0.99 R=256 (vs. 41) Cores=256 (vs. 216) Speedup=223 (vs. 166) F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 R BCEs Dynamic offers greater speedup potential than Asymmetric Arch. researchers should target dynamically harnessing cores 7/26/2016 48 Wisconsin Multifacet Project Dynamic Asymmetric Multicore: 3 Software Issues 1. Schedule computation (e.g., when to use bigger core) 2. Manage locality (e.g., sending code or data can sap gains) 3. Synchronize (e.g., asymmetric cores reaching a barrier) At What Level? – – – – – – – Application Programmer Library Author More Leverage (?) Compiler Runtime System More Info (?) Operating System Hypervisor (Virtual Machine Monitor) Hardware Dynamic Challenges > Asymmetric Ones Dynamic chips due to power likely Outline • Multicore Motivation & Research Paper Trends • Recall Amdahl’s Law • A Model of Multicore Hardware • Symmetric Multicore Chips • Asymmetric Multicore Chips • Dynamic Multicore Chips • Caveats & Wrap Up 7/26/2016 50 Wisconsin Multifacet Project Three Multicore Amdahl’s Law Parallel Section 1 Symmetric Speedup = Sequential Section 1-F Perf(R) + 1 Enhanced Core Asymmetric Speedup = F*R Perf(R)*N N/R Enhanced Cores 1 1-F Perf(R) + F Perf(R) + N - R 1 Enhanced & N-R Base Cores 1 Dynamic Speedup = 7/26/2016 1-F Perf(R) + 51 F N N Base Cores Wisconsin Multifacet Project Amdahl’s Software Model Charges • Serial (parallel) fraction not totally serial (parallel) • Can extend to model to tree algorithms (bounded parallelism) • Serial/Parallel fraction changes Weak Scaling [Gustafson] • Prudent architectures support Strong Scaling • Synchronization, communication, scheduling effects? • Can extend for overheads and imbalance • Software challenges for asymmetric/dynamic worse • Can extend to model overheads to facilitate • Future software will be totally parallel (see “my work”) • I’m skeptical; not even true for MapReduce 7/26/2016 52 Wisconsin Multifacet Project Our Hardware Model Charges • Naïve to bound Cores by one resource (esp. area) • Can extend for Pareto optimal mix of area, dynamic/static power, complexity, reliability, … • Naïve to ignore off-chip bandwidth & L2/L3 caching • Can extend for modeling memory system • Naïve to use performance = square root of resources • Can extend as equations can use any function • We architects can’t scale Perf(R) for very large R • True, so what should we do about it? 7/26/2016 53 Wisconsin Multifacet Project Three-Part Charge Architects: Build more-effective multicore hardware • Don’t lament that we can’t do, but do it! • Play with & trash our models [IEEE Computer, July 2008] – www.cs.wisc.edu/multifacet/amdahl Computer Scientists: Implement “3rd Moore’s Law” • Double Parallelism Every Two Years • Consider Symmetric, Asymmetric, & Dynamic Chips Finally, We must all work together • Keep (cost-) performance gains progressing • Parallel Programming & Parallel Computers 7/26/2016 55 Wisconsin Multifacet Project Dynamic Multicore Chip, N = 1024 BCEs Dynamic Speedup 1000 F=0.999 800 F1 R1024 Cores1024 Speedup 1024! 600 F=0.99 400 F=0.975 200 F=0.9 F=0.5 0 1 4 16 64 256 1024 R BCEs NOT Possible Today NOT Possible EVER Unless We Dream & Act 7/26/2016 56 Wisconsin Multifacet Project Executive Summary • Develop A Corollary to Amdahl’s Law – – – – Simple Model of Multicore Hardware Complements Amdahl’s software model Fixed chip resources for cores Core performance improves sub-linearly with resources • Research Implications (1) Need Dramatic Increases in Parallelism (No Surprise) • 99% parallel limits 256 (base) cores to speedup 72 • New Moore’s Law: Double Parallelism Every Two Years? (2) Many larger chips need increased core performance (3) HW/SW for asymmetric designs (one/few cores enhanced) (4) HW/SW for dynamic designs (serial parallel) 7/26/2016 57 Wisconsin Multifacet Project Backup Slides 7/26/2016 58 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 16 BCEs 16 F=0.999 Symmetric Speedup 14 F=0.99 12 F=0.975 10 8 F=0.9 6 4 F=0.5 2 0 1 2 4 8 16 R BCEs 7/26/2016 60 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 256 BCEs Symmetric Speedup 250 200 F=0.999 150 100 F=0.99 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 R BCEs 7/26/2016 61 Wisconsin Multifacet Project Symmetric Multicore Chip, N = 1024 BCEs Symmetric Speedup 1000 800 600 F=0.999 400 F=0.99 F=0.975 F=0.9 200 F=0.5 0 1 4 16 64 256 1024 R BCEs 7/26/2016 62 Wisconsin Multifacet Project Asymmetric Multicore Chip, N = 16 BCEs F=0.999 16 F=0.99 Asymmetric Speedup 14 F=0.975 12 10 F=0.9 8 6 4 F=0.5 2 0 1 2 4 8 16 R BCEs 7/26/2016 63 Wisconsin Multifacet Project Asymmetric Multicore Chip, N = 256 BCEs 250 Asymmetric Speedup F=0.999 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 R BCEs 7/26/2016 64 Wisconsin Multifacet Project Asymmetric Multicore Chip, N = 1024 BCEs Asymmetric Speedup 1000 F=0.999 800 600 F=0.99 400 F=0.975 200 F=0.9 F=0.5 0 1 4 16 64 256 1024 R BCEs 7/26/2016 65 Wisconsin Multifacet Project Dynamic Multicore Chip, N = 16 BCEs F=0.999 16 F=0.99 Dynamic Speedup 14 F=0.975 12 10 F=0.9 8 6 4 F=0.5 2 0 1 2 4 8 16 R BCEs 7/26/2016 66 Wisconsin Multifacet Project Dynamic Multicore Chip, N = 256 BCEs 250 Dynamic Speedup F=0.999 200 150 F=0.99 100 F=0.975 50 F=0.9 F=0.5 0 1 2 4 8 16 32 64 128 256 R BCEs 7/26/2016 67 Wisconsin Multifacet Project Dynamic Multicore Chip, N = 1024 BCEs Dynamic Speedup 1000 F=0.999 800 600 F=0.99 400 F=0.975 200 F=0.9 F=0.5 0 1 4 16 64 256 1024 R BCEs 7/26/2016 68 Wisconsin Multifacet Project Software Model Charges 1 of 2 • Serial fraction not totally serial • Can extend software model to tree algorithms, etc. • Parallel fraction not totally parallel • Can extend for varying or bounded parallelism • • • • Serial/Parallel fraction may change Can extend for Weak Scaling [Gustafson, CACM’88] Run larger, more parallel problem in constant time But prudent architectures support Strong Scaling 7/26/2016 69 Wisconsin Multifacet Project Software Model Charges 2 of 2 • Synchronization, communication, scheduling effects? • Can extend for overheads and imbalance • Software challenges for asymmetric multicore worse • Can extend for asymmetric scheduling, etc. • Software challenges for dynamic multicore greater • Can extend to model overheads to facilitate • Future software will be totally parallel (see “my work”) • I’m skeptical; not even true for MapReduce 7/26/2016 70 Wisconsin Multifacet Project Hardware Model Charges 1 of 2 • Naïve to consider total resources for cores fixed • Can extend hardware model to how core changes effect The Rest • Naïve to bound Cores by one resource (esp. area) • Can extend for Pareto optimal mix of area, dynamic/static power, complexity, reliability, … • Naïve to ignore challenges due to off-chip bandwidth limits & benefits of last-level caching • Can extend for modeling these 7/26/2016 71 Wisconsin Multifacet Project Hardware Model Charges 2 of 2 • Naïve to use performance = square root of resources • Can extend as equations can use any function • We architects can’t scale Perf(R) for very large R • True, not yet. • We architects can’t dynamically harness very large R • True, not yet • So what should computer scientists do about it? 7/26/2016 72 Wisconsin Multifacet Project