Parallel Hardware Parallel Applications Parallel Software The Parallel Computing Laboratory: A Research Agenda based on the Berkeley View Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Kathy Yelick April 28, 2008 1 Outline Overview of Par Lab Motivation & Scope Driving Applications Need for Parallel Libraries and Frameworks Parallel Libraries Success Metric High performance (speed and accuracy) Autotuning Required Functionality Ease of use Summary of meeting goals, other talks Identify opportunities for collaboration 2 Outline Overview of Par Lab Motivation & Scope Driving Applications Need for Parallel Libraries and Frameworks Parallel Libraries Success Metric High performance (speed and accuracy) Autotuning Required functionality Ease of use Summary of meeting goals, other talks Identify opportunities for collaboration 3 A Parallel Revolution, Ready or Not Old Moore’s Law is over New Moore’s Law is here No more doubling speed of sequential code every 18 months 2X processors (“cores”) per chip every technology generation, but same clock rate Sea change for HW & SW industries since changing the model of programming and debugging 4 “Motif" Popularity 1 2 3 4 5 6 7 8 9 10 11 12 13 HPC ML Games DB SPEC Embed (Red Hot Blue Cool) How do compelling apps relate to 13 motifs? Health Image Speech Music Browser Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid 5 “Motif" Popularity 1 2 3 4 5 6 7 8 9 10 11 12 13 Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid HPC ML Games DB SPEC Embed (Red Hot Blue Cool) How do compelling apps relate to 13 motifs? Health Image Speech Music Browser 6 Choosing Driving Applications “Who needs 100 cores to run M/S Word?” 1. 2. 3. 4. Need compelling apps that use 100s of cores How did we pick applications? Enthusiastic expert application partner, leader in field, promise to help design, use, evaluate our technology Compelling in terms of likely market or social impact, with short term feasibility and longer term potential Requires significant speed-up, or a smaller, more efficient platform to work as intended As a whole, applications cover the most important Platforms (handheld, laptop, games) Markets (consumer, business, health) 7 Compelling Client Applications Image Query by example Image Database Music/Hearing Robust Speech Input 1000’s of images Parallel Browser Personal Health 8 “Motif" Popularity 1 2 3 4 5 6 7 8 9 10 11 12 13 Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid HPC ML Games DB SPEC Embed (Red Hot Blue Cool) How do compelling apps relate to 13 motifs? Health Image Speech Music Browser 9 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 10 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 11 Developing Parallel Software 2 types of programmers 2 layers Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries, Hypervisors, … “Bare metal” efficiency possible at Efficiency Layer Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries Frameworks & libraries composed to form app frameworks Effective composition techniques allows the efficiency programmers to be highly leveraged Create language for Composition and Coordination (C&C) Talk by Kathy Yelick 12 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 13 Outline Overview of Par Lab Motivation & Scope Driving Applications Need for Parallel Libraries and Frameworks Parallel Libraries Success Metric High performance (speed and accuracy) Autotuning Required Functionality Ease of use Summary of meeting goals, other talks Identify opportunities for collaboration 14 Success Metric - Impact LAPACK and ScaLAPACK are widely used Adopted by Cray, Fujitsu, HP, IBM, IMSL, MathWorks, NAG, NEC, SGI, … >86M web hits @ Netlib (incl. CLAPACK, LAPACK95) 35K hits/day Cosmic Microwave Background Analysis, BOOMERanG collaboration, MADCAP code (Apr. 27, 2000). ScaLAPACK Xiaoye Li: Sparse LU High Performance (Speed and Accuracy) Matching Algorithms to Architectures (8 talks) Faster Algorithms (2 talks) Symmetric eigenproblem (O(n2) instead of O(n3)) Sparse LU factorization More accurate algorithms (2 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs Either at “usual” speed, or at any cost Structure-exploiting algorithms Roots(p) (O(n2) instead of O(n3)) High Performance (Speed and Accuracy) Matching Algorithms to Architectures (8 talks) Faster Algorithms (2 talks) Symmetric eigenproblem (O(n2) instead of O(n3)) Sparse LU factorization More accurate algorithms (2 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs Either at “usual” speed, or at any cost Structure-exploiting algorithms Roots(p) (O(n2) instead of O(n3)) Automatic Performance Tuning Writing high performance software is hard Ideal: get high fraction of peak performance from one algorithm Reality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,… Best choice can depend on knowing a lot of applied mathematics and computer science Changes with each new hardware, compiler release Goal: Automation Generate and search a space of algorithms Past successes: PHiPAC, ATLAS, FFTW, Spiral Many conferences, DOE projects, … The Difficulty of Tuning SpMV // y <-- y + A*x for all nonzero A(i,j): y(i) += A(i,j) * x(j) // Compressed sparse row (CSR) for each row i: t=0 for k=row[i] to row[i+1]-1: t += A[k] * x[J[k]] y[i] = t • Exploit 8x8 dense blocks Speedups on Itanium 2: The Need for Search Mflop/s (31.1%) Reference Mflop/s (7.6%) Speedups on Itanium 2: The Need for Search Mflop/s (31.1%) Best: 4x2 Reference Mflop/s (7.6%) SpMV Performance—raefsky3 More surprises tuning SpMV More complex example Example: 3x3 blocking Logical grid of 3x3 cells Extra Work Can Improve Efficiency More complex example Example: 3x3 blocking Logical grid of 3x3 cells Pad with zeros “Fill ratio” = 1.5 1.5x as many flops On Pentium III: 1.5x speedup! (2/3 time) 1.52 = 2.25x flop rate Autotuned Performance of SpMV Intel Clovertown AMD Opteron 7.0 6.5 6.5 6.0 6.0 5.5 5.5 5.0 5.0 4.5 4.5 4.0 4.0 GFlop/s 3.5 3.0 3.5 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking Median LP Circuit Webbase Econ FEM-Accel +SW Prefetching Naïve Pthreads 6.5 6.0 Naïve 5.5 5.0 4.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Median LP Webbase Circuit FEM-Accel Econ Epidem QCD FEM-Ship FEM-Harbor Tunnel FEM-Cant FEM-Sphr Dense 0.0 Protein GFlop/s +Compression +NUMA/Affinity Sun Niagara2 (Huron) 7.0 Epidem QCD FEM-Ship Tunnel FEM-Harbor FEM-Cant FEM-Sphr Median LP Circuit Webbase FEM-Accel Econ Epidem QCD FEM-Ship Tunnel FEM-Harbor FEM-Cant FEM-Sphr Dense Protein Dense 0.0 0.0 Protein GFlop/s 7.0 25 Autotuning SpMV Large search space of possible optimizations Large speed ups possible Parallelism adds more! Later talks Sam Williams on tuning SpMV for a variety of multicore, other platforms Ankit Jain on easy-to-use system for incorporating autotuning into applications Kaushik Datta on tuning special case of stencils Rajesh Nishtala on tuning collection communications But don’t you still have to write difficult code to generate search space? 26 Program Synthesis Best implementation/data structure hard to write, identify Don’t do this by hand Sketching: code generation using 2QBF Spec: simple implementation (3 loop 3D stencil) Sketch: optimized skeleton (5 loops, missing some index/bounds) Optimized code (tiled, prefetched, time skewed) Talk by Armando Solar-Lezama / Ras Bodik on program synthesis by sketching, applied to stencils Communication-Avoiding Linear Algebra (CALU) • Exponentially growing gaps between • Floating point time << 1/Network BW << Network Latency • Improving 59%/year vs 26%/year vs 15%/year • Floating point time << 1/Memory BW << Memory Latency • Improving 59%/year vs 23%/year vs 5.5%/year • Goal: reorganize linear algebra to avoid communication • Not just hiding communication (speedup 2x ) • Arbitrary speedups possible • Possible for Dense and Sparse Linear Algebra CALU Summary (1/4) QR or LU decomposition of m x n matrix, m>>n Parallel implementation Conventional: O( n log p ) messages New: O( log p ) messages – optimal Performance: Serial implementation with fast memory of size W Conventional: O( mn/W ) moves of data from slow to fast memory mn/W = how many times larger matrix is than fast memory New: O(1) moves of data Performance: QR 5x faster on cluster, LU 7x faster on cluster OOC QR only 2x slower than having DRAM Expect gains with Multicore as well Price: Some redundant computation (but flops are cheap!) Different representation of answer for QR (tree structured) LU stable in practice so far, but not GEPP CALU Summary (2/4) QR or LU decomposition of n x n matrix Communication lower by factor of b = block size Lots of speed up possible (modeled and measured) Modeled speedups of new QR over ScaLAPACK Measured and modeled speedups of new LU over ScaLAPACK I BM Power 5 (512 procs): up to 9.7x Petascale (8K procs): up to 22.9x Grid (128 procs): up to 11x IBM Power 5 (Bassi): up to 2.3x speedup (measured) Cray XT4 (Franklin): up to 1.8x speedup (measured) Petascale (8K procs): up to 80x (modeled) Speed limit: Cholesky? Matmul? Extends to sparse LU Communication more dominant, so pay off may be higher Speed limit: Sparse Cholesky? Talk by Xiaoye Li on alternative CALU Summary (3/4) Take k steps of Krylov subspace method GMRES, CG, Lanczos, Arnoldi Assume matrix “well-partitioned,” with modest surface-tovolume ratio Parallel implementation Serial implementation Need to be able to “compress” interactions between distant i, j Hierarchical, semiseparable matrices … Lots of speed up possible (modeled and measured) Conventional: O(k) moves of data from slow to fast memory New: O(1) moves of data – optimal Can incorporate some preconditioners Conventional: O(k log p) messages New: O(log p) messages - optimal Price: some redundant computation Talks by Marghoob Mohiyuddin, Mark Hoemmen CALU Summary (4/4) Lots of related work Some going back to 1960’s Reports discuss this comprehensively, we will not Our contributions Several new algorithms, improvements on old ones Unifying parallel and sequential approaches to avoiding communication Time for these algorithms has come, because of growing communication costs Systematic examination of as much of linear algebra as we can Why just linear algebra? Linear Algebra on GPUs Important part of architectural space to explore Talk by Vasily Volkov NVIDIA has licensed our BLAS (SGEMM) Fastest implementations of dense LU, Cholesky, QR 80-90% of “peak” Require various optimizations special to GPU Use CPU for BLAS1 and BLAS2, GPU for BLAS3 In LU, replace TRSM by TRTRI + GEMM (~stable as GEPP) 33 High Performance (Speed and Accuracy) Matching Algorithms to Architectures (8 talks) Faster Algorithms (2 talks) Symmetric eigenproblem (O(n2) instead of O(n3)) Sparse LU factorization More accurate algorithms (2 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs Either at “usual” speed, or at any cost Structure-exploiting algorithms Roots(p) (O(n2) instead of O(n3)) Faster Algorithms MRRR algorithm for symmetric eigenproblem R. Byers / R. Mathias / K. Braman 2003 SIAM Linear Algebra Prize Extensions to QZ: Talk by Xiaoye Li Up to 10x faster HQR Talk by Osni Marques / B. Parlett / I. Dhillon / C. Voemel 2006 SIAM Linear Algebra Prize for Parlett, Dhillon Parallel Sparse LU (Highlights) B. Kågström / D. Kressner / T. Adlerborn Faster Hessenberg, tridiagonal, bidiagonal reductions: R. van de Geijn / E. Quintana-Orti C. Bischof / B. Lang G. Howell / C. Fulton Collaborators UC Berkeley: Jim Demmel, Ming Gu, W. Kahan, Beresford Parlett, Xiaoye Li, Osni Marques, Yozo Hida, Jason Riedy, Vasily Volkov, Christof Voemel, David Bindel, undergrads… U Tennessee, Knoxville Jack Dongarra, Julien Langou, Julie Langou, Piotr Luszczek, Stan Tomov, Alfredo Buttari, Jakub Kurzak Other Academic Institutions UT Austin, UC Davis, CU Denver, Florida IT, Georgia Tech, U Kansas, U Maryland, North Carolina SU, UC Santa Barbara TU Berlin, ETH, U Electrocomm. (Japan), FU Hagen, U Carlos III Madrid, U Manchester, U Umeå, U Wuppertal, U Zagreb Research Institutions INRIA, LBL Industrial Partners (predating ParLab) Cray, HP, Intel, Interactive Supercomputing, MathWorks, NAG, NVIDIA High Performance (Speed and Accuracy) Matching Algorithms to Architectures (8 talks) Faster Algorithms (2 talks) Symmetric eigenproblem (O(n2) instead of O(n3)) Sparse LU factorization More accurate algorithms (2 talks) Autotuning – generate fast algorithm automatically depending on architecture and problem Communication-Avoiding Linear Algebra – avoiding latency and bandwidth costs Either at “usual” speed, or at any cost Structure-exploiting algorithms Roots(p) (O(n2) instead of O(n3)) More Accurate Algorithms Motivation User requests, debugging Iterative refinement for Ax=b, least squares “Promise” the right answer for O(n2) additional cost Talk by Jason Riedy Arbitrary precision versions of everything Using your favorite multiple precision package Talk by Yozo Hida Jacobi-based SVD Faster than QR, can be arbitrarily more accurate Drmac / Veselic What could go into linear algebra libraries? For all linear algebra problems For all matrix/problem structures For all data types For all architectures and networks For all programming interfaces Produce best algorithm(s) w.r.t. performance and accuracy (including condition estimates, etc) Need to prioritize, automate, enlist help! What do users want? (1/2) Performance, ease of use, functionality, portability Composability On multicore, expect to implement dense codes via DAG scheduling (Dongarra’s PLASMA) Talk by Krste Asanovic / Heidi Pan on threads Reproducibility Made challenging by nonassociativity of floating point Ongoing collaborations on Driving Apps Jointly analyzing needs Talk by T. Keaveny on Medical Application Other apps so far: mostly dense and sparse linear algebra, FFTs some interesting structured needs emerging What do users want? (2/2) DOE/NSF User Survey Small but interesting sample at www.netlib.org/lapack-dev What matrix sizes do you care about? 1000s: 34% 10,000s: 26% 100,000s or 1Ms: 26% How many processors, on distributed memory? >10: 34%, >100: 31%, >1000: 19% Do you use more than double precision? Sometimes or frequently: 16% New graduate program in CSE with 106 faculty from 18 departments New needs may emerge Highlights of New Dense Functionality Updating / downdating of factorizations: Stewart, More generalized SVDs: Bai Langou , Wang More generalized Sylvester/Lyapunov eqns: Kågström, Jonsson, Granat Structured eigenproblems Selected matrix polynomials: Mehrmann Organizing Linear Algebra www.netlib.org/lapack www.netlib.org/scalapack gams.nist.gov www.netlib.org/templates www.cs.utk.edu/~dongarra/etemplates Improved Ease of Use Which do you prefer? A\B CALL PDGESV( N ,NRHS, A, IA, JA, DESCA, IPIV, B, IB, JB, DESCB, INFO) CALL PDGESVX( FACT, TRANS, N ,NRHS, A, IA, JA, DESCA, AF, IAF, JAF, DESCAF, IPIV, EQUED, R, C, B, IB, JB, DESCB, X, IX, JX, DESCX, RCOND, FERR, BERR, WORK, LWORK, IWORK, LIWORK, INFO) Ease of Use: One approach Easy interfaces vs access to details Some users want access to all details, because Peak performance matters Control over memory allocation Other users want “simpler” interface Automatic allocation of workspace No universal agreement across systems on “easiest interface” Leave decision to higher level packages Keep expert driver / simple driver / computational routines Add wrappers for other languages Fortran95, Java, Matlab, Python, even Automatic allocation of workspace C Add wrappers to convert to “best” parallel layout Outline Overview of Par Lab Motivation & Scope Driving Applications Need for Parallel Libraries and Frameworks Parallel Libraries Success Metric High performance (speed and accuracy) Autotuning Required Functionality Ease of use Summary of meeting goals, other talks 46 Some goals for the meeting Introduce ParLab Describe numerical library efforts in detail Exchange information User needs, tools, goals Identify opportunities for collaboration 47 Summary of other talks (1) Monday, April 28 (531 Cory) 12:00 - 12:45 Jim Demmel - Overview of PAR Lab / Numerical Libraries 12:45 - 1:00 Avneesh Sud (Microsoft) - Introduction to library effort at Microsoft 1:00 - 1:45 Sam Williams/Ankit Jain - Tuning Sparse-matrix-vector multiply/Parallel OSKI 1:45 – 1:50 Break 1:50 - 2:20 Marghoob Mohiyuddin - Avoiding Communication in SpMV-like kernels 2:20 - 2:50 Mark Hoemmen - Avoiding communication in Krylov Subspace Methods 2:50 - 3:00 Break 3:00 - 3:30 Rajesh Nishtala - Tuning collective communication 3:30 - 4:00 Yozo Hida - High accuracy linear algebra 4:00 – 4:25 Jason Riedy - Iterative Refinement in linear algebra 4:25 – 4:30 Break 4:30 – 5:00 Tony Keaveny - Medical Image Analysis in PAR Lab 5:00 - 5:30 Ras Bodik/ Armando Solar-Lezama - Program synthesis by Sketching 5:30 - 6:00 Vasily Volkov - Linear Algebra on GPUs 48 Summary of other talks (2) Tuesday, April 29 (Wozniak Lounge) 9:00 - 10:00 Kathy Yelick - Programming Systems for PAR Lab 10:00 - 10:30 Kaushik Datta – Tuning Stencils 10:30 - 11:00 Xiaoye Li - Parallel sparse LU factorization 11:00 - 11:30 Osni Marques - Parallel Symmetric Eigensolvers 11:30 – 12:00 Krste Asanovic / Heidi Pan – Thread system 49 EXTRA SLIDES 50 P.S. Parallel Revolution May Fail John Hennessy, President, Stanford University, 1/07: “…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.” “A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07. 100% failure rate of Parallel Computer Companies Convex, Encore, Inmos (Transputer), MasPar, NCUBE, Kendall Square Research, Sequent, (Silicon Graphics), Thinking Machines, … What if IT goes from a growth industry to a replacement industry? 300 250 Millions of PCs / year 200 150 If SW can’t effectively use 32, 64, ... cores per chip 100 SW no faster on new computer 50 Only buy if computer wears out 0 1985 1995 2005 2015 51 5 Themes of Par Lab 1. Applications • 2. Identify Common Computational Patterns • 3. 2 Layers + Coordination & Composition Language + Autotuning OS and Architecture • • 5. Breaking through disciplinary boundaries Developing Parallel Software with Productivity, Efficiency, and Correctness • 4. Compelling apps drive top-down research agenda Composable primitives, not packaged solutions Deconstruction, Fast barrier synchronization, Partitions Diagnosing Power/Performance Bottlenecks 52 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 53 Compelling Laptop/Handheld Apps (David Wessel) Musicians have an insatiable appetite for computation Music Enhancer Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays Laptop/Handheld recreate 3D sound over ear buds Hearing Augmenter More channels, instruments, more processing, more interaction! Latency must be low (5 ms) Must be reliable (No clicks) Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch-diameter icosahedron incorporating 120 tweeters. Laptop/Handheld as accelerator for hearing aide Novel Instrument User Interface New composition and performance systems beyond keyboards Input device for Laptop/Handheld 54 Content-Based Image Retrieval (Kurt Keutzer) Relevance Feedback Query by example Similarity Metric Image Database 1000’s of images Candidate Results Final Result Built around Key Characteristics of personal databases Very large number of pictures (>5K) Non-labeled images Many pictures of few people Complex pictures including people, events, and objects places, 55 Coronary Artery Disease Before (Tony Keaveny) After Modeling to help patient compliance? • 450k deaths/year, 16M w. symptom, 72MBP Massively parallel, Real-time variations • CFD FE solid (non-linear), fluid (Newtonian), pulsatile • Blood pressure, activity, habitus, cholesterol 56 Compelling Laptop/Handheld Apps (Nelson Morgan) Meeting Diarist Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting Teleconference speaker identifier, speech helper L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said 57 Compelling Laptop/Handheld Apps Health Coach Since laptop/handheld always with you, Record images of all meals, weigh plate before and after, analyze calories consumed so far Since laptop/handheld always with you, record amount of exercise so far, show how body would look if maintain this exercise and diet pattern next 3 months “What if I order a pizza for my next meal? A salad?” “What would I look like if I regularly ran less? Further?” Face Recognizer/Name Whisperer Laptop/handheld scans faces, matches image database, whispers name in ear (relies on Content Based Image Retrieval) 58 Parallel Browser (Ras Bodik) Goal: Desktop quality browsing on handhelds Enabled Bottlenecks to parallelize Parsing, by 4G networks, better output devices Rendering, Scripting “SkipJax” Parallel replacement for JavaScript/AJAX Based on Brown’s FlapJax 59 Theme 2. Use design patterns instead of benchmarks? (Kurt Keutzer) 1. 2. 3. 4. 5. 6. How invent parallel systems of future when tied to old code, programming models, CPUs of the past? Look for common design patterns (see A Pattern Language, Christopher Alexander, 1975) Embedded Computing (42 EEMBC benchmarks) Desktop/Server Computing (28 SPEC2006) Data Base / Text Mining Software Games/Graphics/Vision Machine Learning High Performance Computing (Original “7 Dwarfs”) Result: 13 “Motifs” (Use “motif” instead when go from 7 to 13) 60 “Motif" Popularity 1 2 3 4 5 6 7 8 9 10 11 12 13 Finite State Mach. Combinational Graph Traversal Structured Grid Dense Matrix Sparse Matrix Spectral (FFT) Dynamic Prog N-Body MapReduce Backtrack/ B&B Graphical Models Unstructured Grid HPC ML Games DB SPEC Embed (Red Hot Blue Cool) How do compelling apps relate to 13 motifs? Health Image Speech Music Browser 61 4 Valuable Roles of Motifs 1. “Anti-benchmarks” • 2. Universal, understandable vocabulary, at least at high level • 3. To talk across disciplinary boundaries Bootstrapping: Parallelize parallel research • 4. Motifs not tied to code or language artifacts encourage innovation in algorithms, languages, data structures, and/or hardware Allow analysis of HW & SW design without waiting years for full apps Targets for libraries (see later) 62 Themes 1 and 2 Summary Application-Driven Research (top down) vs. CS Solution-Driven Research (bottom up) Drill down on (initially) 5 app areas to guide research agenda Motifs to guide design of apps and implement via programming framework per motif Motifs help break through traditional interfaces Benchmarking, multidisciplinary conversations, parallelizing parallel research, and target for libraries 63 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 64 Theme 3: Developing Parallel SW (Kurt Keutzer and Kathy Yelick) Observation: Use Motifs as design patterns Design patterns are implemented as 2 kinds of frameworks: 1. Traditional numerical parallel library 2. Library where apply supplied function to data in parallel Numerical Libraries Function Application Libraries Dense matrices Sparse matrices Spectral Combinational Finite state machines MapReduce Dynamic programming Backtracking/B&B N-Body (Un) Structured Grid Graph traversal, graphical models •Computations may be viewed at multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce library, mapping 1D FFTs and then transposing (generalized reduce) 65 Developing Parallel Software 2 types of programmers 2 layers Efficiency Layer (10% of today’s programmers) Expert programmers build Frameworks & Libraries, Hypervisors, … “Bare metal” efficiency possible at Efficiency Layer Productivity Layer (90% of today’s programmers) Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries Frameworks & libraries composed to form app frameworks Effective composition techniques allows the efficiency programmers to be highly leveraged Create language for Composition and Coordination (C&C) 66 C & C Language Requirements (Kathy Yelick) Applications specify C&C language requirements: Constructs for creating application frameworks Primitive parallelism constructs: Data parallelism Divide-and-conquer parallelism Event-driven execution Constructs for composing programming frameworks: Frameworks require independence Independence is proven at instantiation with a variety of techniques Needs to have low runtime overhead and ability to measure and control performance 67 Ensuring Correctness (Koushek Sen) Productivity Layer Enforce independence of tasks using decomposition (partitioning) and copying operators Goal: Remove chance for concurrency errors (e.g., nondeterminism from execution order, not just low-level data races) Efficiency Layer: Check for subtle concurrency bugs (races, deadlocks, and so on) Mixture of verification and automated directed testing Error detection on frameworks with sequential code as specification Automatic detection of races, deadlocks 68 21st Century Code Generation (Demmel, Yelick) Problem: generating optimal code is like searching for needle in a haystack Manycore New approach: “Auto-tuners” even more diverse 1st generate program variations of combinations of optimizations (blocking, prefetching, …) and data structures Then compile and run to heuristically search for best code for that computer Search space for block sizes (dense matrix): • Axes are block dimensions • Temperature is speed Examples: PHiPAC (BLAS), Atlas (BLAS), Spiral (DSP), FFT-W (FFT) Example: Sparse Matrix (SpMV) for 4 multicores Fastest SpMV; Optimizations: BCOO v. BCSR data structures, NUMA, 16b vs. 32b indicies, … 69 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services with Replay Legacy OS Hypervisor Multicore/GPGPU RAMP Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 70 Theme 4: OS and Architecture (Krste Asanovic, John Kubiatowicz) Traditional OSes brittle, insecure, memory hogs Traditional monolithic OS image uses lots of precious memory * 100s - 1000s times (e.g., AIX uses GBs of DRAM / CPU) How can novel architectural support improve productivity, efficiency, and correctness for scalable hardware? Efficiency instead of performance to capture energy as well as performance Other challenges: power limit, design and verification costs, low yield, higher error rates How prototype ideas fast enough to run real SW? 71 Deconstructing Operating Systems Resurgence of interest in virtual machines Hypervisor: Future OS: libraries where only functions needed are linked into app, on top of thin hypervisor providing protection and sharing of resources Opportunity thin SW layer btw guest OS and HW for OS innovation Leverage HW partitioning support for very thin hypervisors, and to allow software full access to hardware within partition 72 HW Solution: Small is Beautiful Want Software Composable Primitives, Not Hardware Packaged Solutions Expect modestly pipelined (5- to 9-stage) CPUs, FPUs, vector, SIMD PEs Small cores not much slower than large cores Parallel is energy efficient path to performance:CV2F “You’re not going fast if you’re headed in the wrong direction” Transactional Memory is usually a Packaged Solution Lower threshold and supply voltages lowers energy per op Configurable Memory Hierarchy (Cell v. Clovertown) Can configure on-chip memory as cache or local RAM Programmable DMA to move data without occupying CPU Cache coherence: Mostly HW but SW handlers for complex cases Hardware logging of memory writes to allow rollback 73 Build Academic Manycore from FPGAs As 10 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from 100 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs; 2X CPUs, 1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent supercomputer @ 150 MHz/CPU in 2007 Ideal for heterogeneous chip architectures RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington “Research Accelerator for Multiple Processors” as a vehicle to lure more researchers to parallel challenge and decrease time to parallel solution 74 1008 Core “RAMP Blue” (Wawrzynek, Krasnov,… at Berkeley) 1008 = 12 32-bit RISC cores / FPGA, 4 FGPAs/board, 21 boards Simple UPC versions (C plus shared-memory abstraction) CG, EP, IS, MG RAMPants creating HW & SW for manycore community using next gen FPGAs Full star-connection between modules NASA Advanced Supercomputing (NAS) Parallel Benchmarks (all class S) MicroBlaze soft cores @ 90 MHz Chuck Thacker & Microsoft designing next boards 3rd party to manufacture and sell boards: 1H08 Gateware, Software BSD open source RAMP Gold for Par Lab: new CPU 75 Par Lab Research Overview Personal Image Hearing, Parallel Speech Health Retrieval Music Browser Motifs Composition & Coordination Language (C&CL) C&CL Compiler/Interpreter Parallel Libraries Efficiency Languages Parallel Frameworks Sketching Static Verification Type Systems Directed Testing Autotuners Dynamic Legacy Communication & Schedulers Checking Code Synch. Primitives Efficiency Language Compilers Debugging OS Libraries & Services OS Libraries & Services with Replay Legacy Legacy OS OS Hypervisor Hypervisor Multicore/GPGPU Multicore/GPGPU RAMP RAMPManycore Manycore Correctness Diagnosing Power/Performance Easy to write correct programs that run efficiently on manycore 76 Theme 5: Diagnosing Power/ Performance Bottlenecks (Jim Demmel) Collect data on Power/Performance bottlenecks Aid autotuner, scheduler, OS in adapting system Turn data into useful information that can help efficiencylevel programmer improve system? E.g., % peak power, % peak memory BW, % CPU, % network E.g., sample traces of critical paths Turn data into useful information that can help productivitylevel programmer improve app? Where am I spending my time in my program? If I change it like this, impact on Power/Performance? Rely on machine learning to find correlations in data, predict Power/Performance? 77 Physical Par Lab - 5th Floor Soda 78 Impact of Automatic Performance Tuning Widely used in performance tuning of Kernels ATLAS (PhiPAC) - www.netlib.org/atlas FFTW – www.fftw.org Fast Fourier Transform and similar transforms, Wilkinson Software Prize Spiral - www.spiral.net Dense BLAS, now in Matlab, many other releases Digital Signal Processing Communication Collectives (UCB, UTK) Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), UHFFT (Houston), POET (UTSA), … More projects (PERI,TOPS2,CScADS), conferences, government reports, …