Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David Patterson, Koushik Sen, John Wawrzynek krste@eecs.berkeley.edu http://aspire.eecs.berkeley.edu UC Berkeley Future Application Drivers 2 UC Berkeley Compute Energy “Iron Law” Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule) When power is constrained, need better energy efficiency for more performance Where performance is constrained (real-time), want better energy efficiency to lower power Improving energy Efficiency is critical goal for all future systems and workloads 3 UC Berkeley Good News: Moore’s Law Continues “Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965 4 UC Berkeley L3 energy scaling ended in 2005 BadAnd News: Dennard (Voltage) Scaling Over Moore, ISSCC Keynote, 2003 Dennard Gordon Moore, ISSCC 2003 Moore, ISSCC Keynote, 2003 Scaling Post-Dennard Scaling CICC Sept 10, 2012 Data courtesy S. Borkar/Intel 2011 5 UC Berkeley 1st Impact of End of Scaling: End of Sequential Processor Era 6 UC Berkeley Parallelism: A one-time gain Use more, slower cores for better energy efficiency. Either simpler cores, or run cores at lower Vdd/frequency Even simpler general-purpose microarchitectures? - Limited by smallest sensible core Even Lower Vdd/Frequency? - Limited by Vdd/Vt scaling, errors Now what? 7 UC Berkeley 2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency! [Muller, ARM CTO, 2009] No savior device technology on horizon. Future energy-efficiency innovations must be above transistor level. 8 UC Berkeley The End of General-Purpose Processors? Most computing happens in specialized, heterogeneous processors - Can be 100-1000X more efficient than general-purpose processor Challenges: - Hardware design costs - Software development costs NVIDIA Tegra2 9 UC Berkeley The Real Scaling Challenge: Communication As transistors become smaller and cheaper, communication dominates performance and energy All scales: Across chip Up and down memory hierarchy Chip-to-chip Board-to-board Rack-to-rack 10 UC Berkeley ASPIRE: From Better to Best Specialize and optimize communication and computation across whole stack from applications to hardware What is the best we can do? - For a fixed target technology (e.g., 7nm) Can we prove a bound? Can we design implementation approaching bound? Provably Optimal Implementations 11 UC Berkeley Communication-Avoiding Algorithms: Algorithm Cost Measures 1.Arithmetic (FLOPS) 2.Communication: moving data between - levels of a memory hierarchy (sequential case) - processors over a network (parallel case). CPU Cache CPU DRAM CPU DRAM CPU DRAM CPU DRAM DRAM 12 UC Berkeley Modeling Runtime & Energy 13 UC Berkeley A few examples of speedups Matrix multiplication - Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communication QR decomposition (used in least squares, data mining, …) - Up to 8x on 8-core dual-socket Intel Clovertown, for 10M x 10 - Up to 6.7x on 16-proc. Pentium III cluster, for 100K x 200 - Up to 13x on Tesla C2050 / Fermi, for 110k x 100 - Up to 4x on Grid of 4 cities (Dongarra, Langou et al) - “infinite speedup” for out-of-core on PowerPC laptop - LAPACK thrashed virtual memory, didn’t finish Eigenvalues of band symmetric matrices - Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequential) Iterative sparse linear equations solvers (GMRES) - Up to 4.3x on Intel Clovertown, 8 core N-body (direct particle interactions with cutoff distance) - Up to 10x on Cray XT-4 (Hopper), 24K particles on 6K procs. 14 UC Berkeley Modeling Energy: Dynamic 15 UC Berkeley Modeling Energy: Memory Retention 16 UC Berkeley Modeling Energy: Background Power 17 UC Berkeley Energy Lower Bounds 18 UC Berkeley Early Result: Perfect Strong Scaling in Time and Energy Every time you add processor, use its memory M too Start with minimal number of procs: PM = 3n2 Increase P by factor c total memory increases by factor c Notation for timing model: - γt , βt , αt = secs per flop, per word_moved, per message of size m T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c Notation for energy model: - γe , βe , αe = Joules for same operations - δe = Joules per word of memory used per sec - εe = Joules per sec for leakage, etc. E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P) Perfect scaling extends to n-body, Strassen, … [IPDPS, 2013] 19 UC Berkeley C-A Algorithms Not Just for HPC In ASPIRE, apply to other key application areas: machine vision, databases, speech recognition, software-defined radio, … Initial results on lower bounds of database join algorithms 20 UC Berkeley From C-A Algorithms to Provably Optimal Systems? 1) Prove lower bounds on communication for a computation 2) Develop algorithm that achieves lower bound on a system 3) Find that communication time/energy cost is >90% of resulting implementation 4) We know we’re within 10% of optimal! Supporting technique: Optimizing software stack and compute engines to reduce compute costs and expose unavoidable communication costs 21 UC Berkeley ESP: An Applications Processor Architecture for ASPIRE Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm) Future server and mobile SoCs will have many fixedfunction accelerators and a general-purpose programmable multicore Well-known how to customize hardware engines for specific task ESP challenge is using specialized engines for general-purpose code 22 UC Berkeley ESP: Ensembles of Specialized Processors General-purpose hardware, flexible but inefficient Fixed-function hardware, efficient but inflexible Par Lab Insight: Patterns capture common operations across many applications, each with unique communication & computation structure Build an ensemble of specialized engines, each individually optimized for particular pattern but collectively covering application needs Bet: Will give us efficiency plus flexibility - Any given core can have a different mix of these depending on workload 23 UC Berkeley Par Lab: Motifs common across apps Audio Recognition Object Recognition Dense Sparse … Graph Scene Analysis Applications Berkeley View “Dwarfs” or Motifs 24 Motif (nee “Dwarf”) Popularity UC Berkeley (Red Hot / Blue Cool) Computing Domains Par Lab Apps 25 UC Berkeley Architecting Parallel Software Application Identify the Software Structure •Pipe-and-Filter •Agent-and-Repository •Event-based •Bulk Synchronous •Map-Reduce •Layered Systems •Model-view controller •Arbitrary Task Graphs •Puppeteer •Model-View-Controller Identify the Key Computations • Graph Algorithms • Dynamic programming • Dense/Spare Linear Algebra • Un/Structured Grids • Graphical Models • Finite State Machines • Backtrack Branch-and-Bound • N-Body Methods • Circuits • Spectral Methods • Monte-Carlo UC Berkeley Mapping Software to ESP: Specializers Object Recognition Dense Scene Analysis Audio Recognition Sparse … Graph Applications Berkeley View “Dwarfs” or Motifs Specializers with SEJITS Implementations and Autotuning ESP Code Glue Code Dense Code Sparse Code Graph Code ILP Engine Dense Engine Sparse Engine Graph ESP Engine Core Capture desired functionality at high-level using patterns in a productive high-level language Use pattern-specific compilers (Specializers) with autotuners to produce efficient low-level code ASP specializer infrastructure, open-source download 27 UC Berkeley Replacing Fixed Accelerators with Programmable Fabric Intel Ivy Bridge (22nm) Qualcomm Snapdragon MSM8960 (28nm) Future server and mobile SoCs will have many fixedfunction accelerators and a general-purpose programmable multicore Fabric challenge is retaining extreme energy efficiency while retaining programmability 28 Strawman Fabric Architecture UC Berkeley R R M A R M A M A R A A M A R M R M A R R M A M A R A R M R M A M A M R M A R A R R M A M A M R Will never have a C compiler Only programmed using pattern-based DSLs More dynamic, less static than earlier approaches Dynamic dataflow-driven execution Dynamic routing Large memory support 29 UC Berkeley “Agile Hardware” Development Current hardware design slow and arduous But now have huge design space to explore How to examine many design points efficiently? Build parameterized generators, not point designs! Adopt and adapt best practices from Agile Software - Complete LVS-DRC clean physical design of current version every ~ two weeks (“tapein”) - Incremental feature addition - Test & Verification first step 30 UC Berkeley Chisel: Constructing Hardware In a Scala Embedded Language Embed a hardware-description language in Scala, using Scala’s extension facilities A hardware module is just a data structure in Scala Different output routines can generate different types of output (C, FPGA-Verilog, ASIC-Verilog) from same hardware representation Full power of Scala for writing hardware generators - Object-Oriented: Factory objects, traits, overloading etc - Functional: Higher-order funcs, anonymous funcs, currying - Compiles to JVM: Good performance, Java interoperability 31 Chisel Design Flow UC Berkeley Chisel Program Scala/JVM C++ code C++ Compiler Software Simulator FPGA Verilog FPGA Tools FPGA Emulation ASIC Verilog ASIC Tools GDS Layout 32 UC Berkeley Chisel is much more than an HDL The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL But Chisel can be extended above with domain-specific languages (e.g., signal processing) for fabric Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum computing circuits) Only ~6,000 lines of code in current version including libraries! BSD-licensed open source at: chisel.eecs.berkeley.edu 33 CORE 3 VC3 CORE 1 VC1 CORE 0 VC0 Processor Site Clock test DCDC site test site 512KB L2 VFIXED CORE 2 VC2 Test Sites UC Berkeley Many processor tapeouts in few years with small group (45nm, 28nm) SRAM test site 34 UC Berkeley Resilient Circuits & Modeling Future scaled technologies have high variability but want to run with lowest-possible margins to save energy Significant increase in soft errors, need resilient systems Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect. Techniques to reduce operating voltage can be worse for energy due to rapid rise in errors 35 Algorithms and Specializers for Provably Optimal Implementations with Resiliency and Efficiency Software UC Berkeley Scene Analysis Audio Recognition Object Recognition Applications Computational and Structural Patterns Dense C-A GEMM Sparse C-A SpMV … Pipe&Filter Graph C-A BFS … Map-Reduce Communication-Avoiding Algorithms Specializers with SEJITS Implementations and Autotuning Glue Code Dense Code Sparse Code ESP Code Dense Sparse Graph ESP Engine Engine Engine Core Hardware Cache Coherence Local Stores + DMA Hardware Generators using Chisel HDL ILP Engine Hardware Graph Code C++ FPGA Simulation Emulation Validation/Verification ESP (Ensembles of Specialized Processors) Architecture FPGA ASIC Computer SoC Implementation Technologies Deep HW/SW Design-Space Exploration 36 UC Berkeley ASPIRE Project Initial $15.6M/5.5 year funding from DARPA PERFECT program - Started 9/28/2012 - Located in Par Lab space + BWRC Looking for industrial affiliates (see Krste!) Open House today, 5th floor Soda Hall Research funded by DARPA Award Number HR0011-12-2-0016. Approved for public release; distribution is unlimited. The content of this presentation does not necessarily reflect the position or the policy of the US government and no official endorsement should be inferred. 37