(How) Can Programmers Conquer the Multicore Menace? Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace 2 Today: The Happily Oblivious Average Joe Programmer • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s requirements • Joe has built a solid boundary between Hardware and Software – High level languages abstract away the processors – Ex: Java bytecode is machine independent • This abstraction has provided a lot of freedom for Joe • Parallel Programming is only practiced by a few experts 3 Moore’s Law 100000 1,000,000,000 10000 P4 ??%/year 100,000,000 P3 1000 52%/year P2 10,000,000 Pentium 100 486 1,000,000 386 10 8086 1 286 25%/year Number of Transistors Performance (vs. VAX-11/780) From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium 100,000 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 4 From David Patterson Uniprocessor Performance (SPECint) 100000 10000 1,000,000,000 100,000,000 P4 P3 1000 52%/year P2 10,000,000 Pentium 100 486 1,000,000 386 10 8086 1 286 25%/year Number of Transistors Performance (vs. VAX-11/780) From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium 100,000 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 5 From David Patterson Uniprocessor Performance (SPECint) 100000 1,000,000,000 10000 P4 ??%/year 100,000,000 P3 1000 52%/year P2 10,000,000 Pentium 100 486 1,000,000 386 10 8086 1 286 25%/year Number of Transistors Performance (vs. VAX-11/780) From Hennessy and Patterson, Computer Architecture: Itanium 2 A Quantitative Approach, 4th edition, 2006 Itanium 100,000 10,000 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 6 From David Patterson Squandering of the Moore’s Dividend • 10,000x performance gain in 30 years! (~46% per year) • Where did this performance go? • Last decade we concentrated on correctness and programmer productivity • Little to no emphasis on performance • This is reflected in: – – – – Languages Tools Research Education • Software Engineering: Only engineering discipline where performance or efficiency is not a central theme 7 Matrix Multiply An Example of Unchecked Excesses • Abstraction and Software Engineering – Immutable Types – Dynamic Dispatch – Object Oriented • • High Level Languages Memory Management – Transpose for unit stride – Tile for cache locality • • • Vectorization Prefetching Parallelization 296,260x 87,042x 33,453x 12,316x 2,271x 7,514x 1,117x 522x 220x Matrix Multiply An Example of Unchecked Excesses • Typical Software Engineering Approach – – – – – – In Java Object oriented 296,260x Immutable Abstract types No memory optimizations No parallelization • Good Performance Engineering Approach – – – – In C/Assembly Memory optimized (blocked) BLAS libraries Parallelized (to 4 cores) • In Comparison: Lowest to Highest MPG in transportation 14,700x 294,000x 9 Joe the Parallel Programmer • Moore’s law is not bringing anymore performance gains • If Joe needs performance he has to deal with multicores – Joe has to deal with performance – Joe has to deal with parallelism 10 Why Parallelism is Hard • A huge increase in complexity and work for the programmer – Programmer has to think about performance! – Parallelism has to be designed in at every level • Programmers are trained to think sequentially – Deconstructing problems into parallel tasks is hard for many of us • Parallelism is not easy to implement – Parallelism cannot be abstracted or layered away – Code and data has to be restructured in very different (non-intuitive) ways • Parallel programs are very hard to debug – Combinatorial explosion of possible execution orderings – Race condition and deadlock bugs are non-deterministic and illusive – Non-deterministic bugs go away in lab environment and with instrumentation 11 Outline • The Multicore Menace • Deterministic Multithreading via Kendo – Joint work with Marek Olszewski and Jason Ansel • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace 12 Racing for Lock Acquisition • Two threads Time – Start at the same time – 1st thread: 1000 instructions to the lock acquisition – 2nd thread: 1100 instructions to the lock acquisition Instruction # 13 Non-Determinism • Inherent in parallel applications – Accesses to shared data can experience many possible interleavings – New! Was not the case for sequential applications! – Almost never part of program specifications – Simplest parallel programs, i.e. a work queue, is non deterministic • Non-determinism is undesirable – Hard to create programs with repeatable results – Difficult to perform cyclic debugging – Testing offers weaker guarantees 14 Deterministic Multithreading • Observation: – Non-determinism need not be a required property of threads – We can interleave thread communication in a deterministic manner – Call this Deterministic Multithreading • Deterministic multithreading: – Makes debugging easier – Tests offer guarantees again – Supports existing programming models/languages – Allows programmers to “determinize” computations that have previously been difficult to do so using today’s programming idioms – e.g.: Radiosity (Singh et al. 1994), LocusRoute (Rose 1988), and Delaunay Triangulation (Kulkarni et al. 2008) Deterministic Multithreading • Strong Determinism – Deterministic interleaving for all accesses to shared data for a given input – Attractive, but difficult to achieve efficiently without hardware support • Weak Determinism – Deterministic interleaving of all lock acquisitions for a given input – Cheaper to enforce – Offers same guarantees as strong determinism for data-race-free program executions – Can be checked with a dynamic race detector! Kendo • A Prototype Deterministic Locking Framework – Provides Weak Determinism for C and C++ code – Runs on commodity hardware today! – Implements a subset of the pthreads API – Enforces determinism without sacrificing load balance – Tracks progress of threads to dynamically construct the deterministic interleaving: – Deterministic Logical Time – Incurs low performance overhead (16% geomean on Splash2) Deterministic Logical Time • Abstract counterpart to physical time – Used to deterministically order events on an SMP machine – Necessary to construct the deterministic interleaving • Represented as P independently updated deterministic logical clocks – Not updated based on the progress of other threads (unlike Lamport clocks) – Event1 (on Thread 1) occurs before Event2 (on Thread 2) in Deterministic Logical Time if: – Thread 1 has lower deterministic logical clock than Thread 2 at time of events Deterministic Logical Clocks • Requirements – Must be based on events that are deterministically reproducible from run to run – Track progress of threads in physical time as closely as possible (for better load balancing of the deterministic interleaving) – Must be cheap to compute – Must be portable over micro-architecture – Must be stored in memory for other threads to observe Deterministic Logical Clocks • Some x86 performance counter events satisfy many of these requirements – Chose the “Retired Store Instructions” event • Required changes to Linux Kernel – Performance counters are kernel level accessible only – Added an interrupt service routine – Increments each thread’s deterministic logical clock (in memory) on every performance counter overflow – Frequency of overflows can be controlled Locking Algorithm • Construct a deterministic interleaving of lock acquires from deterministic logical clocks – Simulate the interleaving that would occur if running in deterministic logical time • Uses concept of a turn – It’s a thread’s turn when: – All thread’s with smaller ID have greater deterministic logical clocks – All thread’s with larger ID have greater or equal deterministic logical clocks Locking Algorithm function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); enable_logical_clock(); } function det_mutex_unlock(l) { unlock(l); } Example Physical Time t=5 Thread 2 t=3 Deterministic Logical Time Thread 1 Example Thread 1 Thread 2 t=6 Deterministic Logical Time Physical Time t=11 It’s a race! Example Thread 1 Thread 2 t=11 t=20 It’s a race! Deterministic Logical Time Physical Time Example Thread 1 Thread 2 t=18 det_lock(a) t=25 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 t=18 wait_for_turn() det_lock(a) t=25 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) wait_for_turn() det_lock(a) t=25 t=22 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) wait_for_turn() det_lock(a) t=25 t=22 wait_for_turn() Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) wait_for_turn() det_lock(a) t=25 t=22 lock() Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) wait_for_turn() det_lock(a) t=22 t=25 Thread 2 will always acquire the lock first! Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) wait_for_turn() det_lock(a) t=25 t=26 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) lock(a) det_lock(a) t=25 t=26 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) lock(a) det_lock(a) t=25 det_unlock(a) t=32 Deterministic Logical Time Physical Time Example Thread 1 Thread 2 det_lock(a) det_lock(a) t=28 det_unlock(a) t=32 Deterministic Logical Time Physical Time Locking Algorithm Improvements • Eliminate deadlocks in nested locks – Make thread increment its deterministic logical clock while it spins on the lock – Must do so deterministically • Queuing for fairness • Lock priority boosting • See ASPLOS09 Paper on Kendo for details Evaluation • Methodology – Converted Splash2 benchmark suite to run use the Kendo framework – Eliminated data-races – Checked determinism by examining output and the final deterministic logical clocks of each thread • Experimental Framework – Processor: Intel Core2 Quad-core running at 2.66GHz – OS: Linux 2.6.23 (modified for performance counter support) Results Execution Time (Relative to Non-Deterministic) 1.6 1.5 1.4 1.3 1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 tsp quicksort ocean barnes radiosity raytrace fmm volrend waternsqrd mean Benchmarks Application Time Interrupt Overhead Deterministic Wait Overhead Effect of interrupt frequency Execution Time (Relative to Non-Deterministic) 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 64 128 256 512 1K 2K 4K 8K 16K Interrupt Period Application Time Interrupt Overhead Deterministic Wait Overhead Related Work • DMP – Deterministic Multiprocessing – Hardware design that provides Strong Determinism • StreamIt Language – Streaming programming model only allows one interleaving of interthread communication • Cilk Language – Fork/join programming model that can produce programs with semantics that always match a deterministic “serialization” of the code – Cannot be used with locks – Must be data-race free (can be checked with a Cilk race detector) Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks – Joint work with Jason Ansel, Cy Chan, Yee Lok Wong, Qin Zhao, and Alan Edelman • Conquering the Multicore Menace 48 Observation 1: Algorithmic Choice • For many problems there are multiple algorithms – Most cases there is no single winner – An algorithm will be the best performing for a given: – – – – – Input size Amount of parallelism Communication bandwidth / synchronization cost Data layout Data itself (sparse data, convergence criteria etc.) • Multicores exposes many of these to the programmer – Exponential growth of cores (impact of Moore’s law) – Wide variation of memory systems, type of cores etc. • No single algorithm can be the best for all the cases 49 Observation 2: Natural Parallelism • World is a parallel place – It is natural to many, e.g. mathematicians – ∑, sets, simultaneous equations, etc. • It seems that computer scientists have a hard time thinking in parallel – We have unnecessarily imposed sequential ordering on the world – Statements executed in sequence – for i= 1 to n – Recursive decomposition (given f(n) find f(n+1)) • This was useful at one time to limit the complexity…. But a big problem in the era of multicores 50 Observation 3: Autotuning • Good old days model based optimization • Now – Machines are too complex to accurately model – Compiler passes have many subtle interactions – Thousands of knobs and billions of choices • But… – Computers are cheap – We can do end-to-end execution of multiple runs – Then use machine learning to find the best choice 51 PetaBricks Language • Implicitly parallel description transform MatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } } y A y AB h c c x w 52 xw B h PetaBricks Language • Implicitly parallel description transform MatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } 53 • Algorithmic choice a1 A a2 b1 B b2 AB PetaBricks Language transform MatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } 54 // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } a ab1 ABab2 b1 B b2 PetaBricks Language transform MatrixMultiply from A[c,h], B[w,c] to AB[w,h] { // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); } // Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); } // Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); } // Recursively decompose in h to(AB.region(0, 0, w, h/2) ab1, AB.region(0, h/2, w, h ) ab2) from(A.region(0, 0, c, h/2) a1, A.region(0, h/2, c, h ) a2, B b) { ab1=MatrixMultiply(a1, b); ab2=MatrixMultiply(a2, b); } } 55 PetaBricks Compiler Internals Compiler Passes PetaBricks Source Code Rule/Transform Headers Rule Bodies Compiler Passes Rule Body ChoiceGrid ChoiceGrid IR Parallel Dynamically Scheduled Sequential Leaf Code Code Generation C++ ChoiceGrid ChoiceGrid ChoiceGrid Choice Dependency Graph Compiler Passes Choice Grids transform RollingSum from A[n] to B[n] { Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … } Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … } } A: Input n 0 B: Rule2 0 Rule1 or Rule2 1 n Choice Dependency Graph transform RollingSum from A[n] to B[n] { Rule1: to(B.cell(i) b) from(B.cell(i-1) left, A.cell(i) a) { … } Rule2: to(B.cell(i) b) from(A.region(0, i) as) { … } } Input (r1, <) (r1, =), (r2, <=) (r2, =) (r1, =, -1) Rule2 Rule1 or Rule2 PetaBricks Autotuning Compiler Passes PetaBricks Source Code Rule/Transform Headers ChoiceGrid ChoiceGrid ChoiceGrid Choice Dependency Graph Compiler Passes Rule Bodies Compiler Passes Rule Body ChoiceGrid ChoiceGrid IR Parallel Dynamically Scheduled Sequential Leaf Code Choice Configuration File Code Generation Compiled ChoiceGrid ChoiceGrid User Code Parallel Runtime Engine Autotuner C++ PetaBricks Execution Compiler Passes PetaBricks Source Code Rule/Transform Headers ChoiceGrid ChoiceGrid ChoiceGrid Choice Dependency Graph Rule Bodies Compiler Passes Pruning Rule Body ChoiceGrid ChoiceGrid IR Parallel Dynamically Scheduled Sequential Leaf Code Choice Configuration File Dependency Graph Compiled ChoiceGrid ChoiceGrid User Code Parallel Runtime Engine Code Generation C++ Compiler Passes Experimental Setup • Test System – Dual-quad core (8 cores) – Xeon X5460 @ 3.16GHz w/ 8GB RAM – CSAIL Debian 4.0 (etch), kernel 2.6.18 • Training – Using our hybrid genetic tuner – Trained using all 8 cores – Training times varied from ~1 min to ~1 hour Sort 0.010 Insertion Sort Quick Sort Merge Sort Radix Sort Time 0.008 0.006 0.004 0.002 0.000 0 200 400 600 800 Size 62 1000 1200 1400 1600 1800 2000 Sort 0.010 Insertion Sort Quick Sort Merge Sort Radix Sort Autotuned Time 0.008 0.006 0.004 0.002 0.000 0 200 400 600 800 Size 63 1000 1200 1400 1600 1800 2000 Eigenvector Solve 0.05 Bisection DC 0.04 Time QR 0.03 0.02 0.01 0.00 0 50 100 150 200 Size 64 250 300 350 400 450 500 Eigenvector Solve 0.05 Bisection DC 0.04 Time QR Autotuned 0.03 0.02 0.01 0.00 0 50 100 150 200 Size 65 250 300 350 400 450 500 Poisson 256 32 Time 4 0.5 0.0625 Direct 0.0078125 Jacobi 0.0009766 SOR Multigrid 0.0001221 1.526E-05 3 5 9 17 33 Matrix Size 66 65 129 257 513 1025 2049 Poisson 256 32 Time 4 0.5 0.0625 Direct 0.0078125 Jacobi 0.0009766 SOR Multigrid 0.0001221 Autotuned 1.526E-05 3 5 9 17 33 Matrix Size 67 65 129 257 513 1025 2049 Scalability 8 7 Speedup 6 5 4 3 MM 2 Sort Poisson 1 Eigenvector Solve 0 1 2 3 4 5 Number of Cores 68 6 7 8 Impact of Autotuning • Custom hybrid genetic tuner • Huge gains by training on the target architecture: Trained On: SunFire T200 Niagra (8 cores) Xeon E7340 (8 cores) SunFire T200 Niagra (8 cores) 1.00x 0.72x Xeon E7340 (8 cores) 0.43x 1.00x Xeon E7340 (1 core) Run On: 0.30x Related Work • SPARSITY, OSKI – Sparse Matrices • ATLAS, FLAME – Linear Algebra • FFTW • STAPL – Template Framework Library • SPL – Digital signal processing • High level optimization via automated statistical modeling. (Eric Brewer) Outline • The Multicore Menace • Deterministic Multithreading via Kendo • Algorithmic Choices via PetaBricks • Conquering the Multicore Menace 71 Conquering the Menace • Parallelism Extraction – The world is parallel, but most computer science is based in sequential thinking – Parallel Languages – Natural way to describe the maximal concurrency in the problem – Parallel Thinking – Theory, Algorithms, Data Structures Education • Parallelism Management – Mapping algorithmic parallelism to a given architecture – New hardware support – Easier to enforce correctness – Reduce the cost of bad decisions – A Universal Parallel Compiler 72 Hardware Opportunities • Don’t have to contend with uniprocessors • Not your same old multiprocessor problem – How does going from Multiprocessors to Multicores impact programs? – What changed? – Where is the Impact? – Communication Bandwidth – Communication Latency 73 Communication Bandwidth • How much data can be communicated between two cores? • What changed? – Number of Wires – Clock rate – Multiplexing • Impact on programming model? – Massive data exchange is possible – Data movement is not the bottleneck processor affinity not that important 74 10,000X 32 Giga bits/sec ~300 Tera bits/sec Parallel Language Opportunities • We need a lot more innovation! Languages that….. – require no non-intuitive reorganization of data or code. – make the programmer focus on concurrency, but not performance Off-load the parallelism and performance issues to the compiler (akin to ILP compilation to VLIW machines) – eliminate hard problems such as race conditions and deadlocks (akin to the elimination of memory bugs in Java) – inform the programmer if they have done something illegal (akin to a type system or runtime null-pointer checks) – take advantage of domains to reduce the parallelization burden (akin to the StreamIt language for the streaming domain) – use novel hardware to eliminate problems & help the programmer (akin to cache coherence hardware) 75 Compilation Opportunities • Universal Parallel Compiler: GCC for Uniprocessors – Easily portable to any uniprocessor – Able to obtain respectable performance Single program (in C) runs on all uniprocessors • MultiCompiler: Universal Compiler for Parallel Systems – Language exposes maximal parallelism Compiler manages it – Unlike uniprocessors, many single decisions are performance critical – Candidates: Don’t bind a single decision, keep multiple tracks – Learning: Learn and improve heuristics – Adaptation: Dynamically choose candidates and adapt the program to resources and runtime conditions 76 Conclusions • Kendo – The first system to efficiently provide weak determinism on commodity hardware – Provide a systematic method of reproducing many non-deterministic bugs – Incurs modest performance overhead when running on 4 processors – This low overhead makes it possible to leave on while an application is deployed • PetaBricks – First language where micro-level algorithmic choice can be naturally expressed – Autotuning can find the best choice – Can switch between choices as solution is constructed • Switching to multicores without losing the gains in programmer productivity may be the Grandest of the Grand Challenges – Half a century of work, still no winning solution – Will affect everyone! – A lot more work to do to solve this problem!!! 77 http://groups.csail.mit.edu/commit/