Pushing the Limits of Accelerator Efficiency While Retaining Programmability Tony Nowatzki*, Vinay Gangadhar*, Karu Sankaralingam*, Greg Wright+ *Vertical Research Group University of Wisconsin – Madison +Qualcomm 1 Executive Summary • 5 common principles of architectural specialization • A programmable architecture (LSSD) embodying the specialization principles • LSSD compared to single domain specific accelerator (DSA) • Performance: Matches DSA Area: Overhead of at most 4x Power: Overhead of at most 4x LSSD power overhead inconsequential with system-level energy efficiency tradeoffs 2 Outline • Introduction and Motivation Concurrency Computation Principles of architectural specialization Data Reuse Embodiment of principles in DSAs • Architecture for programmable specialization (LSSD) • Evaluation of LSSD with 4 DSAs (Performance, power & area) • Communication System-level energy efficiency tradeoffs with LSSD and DSA Coordination Energy • Speedup Accel. Core $ System Bus Memory 3 Era of Specialization DSAs Traditional Multicore Reg Expr. Cache Cache Core Core Application domain specialization Core Core Core Core Scan Deep Neural AI Graph Traversal Neural Approx. Linear Algebra Stencil Sort • Performance and/or energy gains from multicore chips is challenging • Specialization of application domains with custom hardware units Domain Specific Acceleration • Domain Specific Accelerators (DSAs): + High Efficiency 10 – 100x Performance/Power or Performance/Area - No Generality Not general purpose programmable - Obsoletion Prone 4 Our Goal: Programmable Specialization Specialization benefits of DSAs in a Programmable Architecture Programmable architecture matching the efficiency of DSAs 5 Key Insight: Commonality in DSAs’ Specialization Principles Host System Core Core Core Cache DSAs Reg Expr. AI Scan Graph Traversal Stencil Deep Neural Neural Approx. Linear Algebra Sort Most DSAs employ 5 common Specialization Principles Computation Communication S + Concurrency Data Reuse Coordination S FU S S FU 6 Solution: Architecture for Programmable Specialization Idea 1: Specialization principles can be exploited in a general way Idea 2: Composition of known uArch. mechanisms embodying the specialization principles Programmable Architecture (LSSD) Low power core Spatial fabric Scratchpad DMA LSSD as a programmable hardware template to map one or many application domains *Figures not to scale Deep Neural Stencil, Sort, Scan, AI Domain provisioned LSSD Balanced LSSD 7 Outline • Introduction and Motivation Concurrency Computation Principles of architectural specialization Data Reuse Embodiment of principles in DSAs • Architecture for programmable specialization (LSSD) • Evaluation of LSSD with 4 DSAs (Performance, power & area) • Communication System-level energy efficiency tradeoffs with LSSD and DSA Coordination Energy • Speedup Accel. Core $ System Bus Memory 8 Principles of Architectural Specialization • Match hardware concurrency to that of algorithm • Problem-specific computation units • Explicit communication as opposed to implicit communication • Customized structures for data reuse • Hardware coordination using simple low-power control logic Computation Communication S + Concurrency Data Reuse Coordination S FU S S FU 9 5 Specialization Principles Concurrency Computation Communication S Data Reuse Coordination S + FU S FU S How do DSAs embody these principles in a domain specific way ? Neural Approx. Reg Expr. NPU Scan Stencil AI Graph Traversal Convolution Deep Neural Engine Stencil Neural Approx. Deep Neural Linear Algebra Database Sort DianNao Q100 10 Principles in DSAs NPU – Neural Proc. Unit In Fifo Out Fifo High Level Organization General Purpose Processor Bus Sched PE PE PE PE PE PE PE PE Processing Engine Weight Buf. • • • • • Match hardware concurrency to that of algorithm Problem-specific computation units Explicit communication as opposed to implicit communication Customized structures for data reuse Hardware coordination using simple low-power control logic Fifo Controller Mult-Add Acc Reg. Sigmoid Out Buf. Concurrency Computation Communication Data Reuse Coordination 11 Most DSAs employ 5 common Specialization Principles Processing Units High Level Organization Principles in DSAs Concurrency Computation Communication Data Reuse Coordination 12 Outline • Introduction and Motivation Concurrency Computation Principles of architectural specialization Data Reuse Embodiment of principles in DSAs • Architecture for programmable specialization (LSSD) • Evaluation of LSSD with 4 DSAs (Performance, power & area) • Communication System-level energy efficiency tradeoffs with LSSD and DSA Coordination Energy • Speedup Accel. Core $ System Bus Memory 13 Implementation of Principles in a General Way Composition of simple micro-architectural mechanisms • Concurrency: Multiple tiles • Computation: Special FUs in spatial fabric • Communication:Dataflow + spatial fabric (Tile – hardware for coarse grain unit of work) • Data Reuse: Scratchpad (SRAMs) • Coordination: Low power simple core Concurrency Computation Communication Data Reuse Each Tile Coordination 14 LSSD Programmable Architecture Memory Memory Memory D$ DMA DMAD$ Scratchpad Scratchpad Input Interface Spatial Fabric Low-power Core (LX3) Low-power . . . Core (LX3) FU Spatial Fabric Input Interface FU S FU FU S – Switch Output Interface Output Interface Low power core | Spatial fabric | Scratchpad | DMA LSSD Concurrency Computation Communication Data Reuse Coordination 15 Instantiating LSSD Programmable hardware template for specialization LSSD Provisioned for one single application domain Neural Approx. LSSDN Stencil LSSDC Deep Neural LSSDD Database LSSDQ Provisioned for multiple application domains Deep Neural Stencil Neural Approx. Database LSSDBalanced or LSSDB Design point selection, Synthesis & Programming: More details in the paper….. *Figures not to scale 16 Outline • Introduction and Motivation Concurrency Computation Principles of architectural specialization Data Reuse Embodiment of principles in DSAs • Architecture for programmable specialization (LSSD) • Evaluation of LSSD with 4 DSAs (Performance, power & area) • Communication System-level energy efficiency tradeoffs with LSSD and DSA Coordination Energy • Speedup Accel. Core $ System Bus Memory 17 Methodology • Modeling framework for LSSD Perf: Trace driven simulator + application specific modeling Power & Area: Synthesized modules, CACTI and McPAT • Compared to four DSAs (published perf., area & power) • Four parameterized LSSDs • One combined balanced LSSD LSSDN LSSDC LSSDD LSSDQ LSSDB 1 Tile 1 Tile 8 Tiles 4 Tiles 8 Tiles NPU Conv. DianNao Q100 NPU Conv. DianNao Q100 Provisioned to match performance of DSAs Other tradeoffs possible (power, area, energy etc. ) 18 Geometric Mean sobel (9-8-1) kmeans (6-8-4-1) jpeg (64-16-64) jmeint (18-32-8-2) LSSDN (+reuse.) Spatial (+comm.) SIMD (+concur.) LP Core + Sig. (+comp.) NPU (DSA) inversek2j (2-8-2) 18 16 14 12 10 8 6 4 2 0 LSSDN vs. NPU fft (1-4-4-2) Speedup Performance Analysis (1) Baseline – 4 wide OOO core (Intel 3770K) 19 Performance Analysis (2) LSSDD vs. DianNao (1 Tile) 50 45 40 35 30 25 20 15 10 5 0 (8 Tiles) 400 350 300 Speedup LSSDC (+reuse.) Spatial (+comm.) SIMD (+concur.) LP core + FUs (+comp.) Conv. (domain-acccel) LSSDD (+reuse.) 250 Spatial (+comm.) 200 SIMD (+concur.) 150 8-Tile (+concur.) 100 LP core + Sig. (+comp.) Domain Provisioned LSSDs DianNao (domain-acccel) GeoMean pool5 conv5 conv4 class3 pool3 conv3 conv2 class1 conv1 pool1 0 Geometric Mean FME EXTR. DOG 50 • Performance: LSSD able to match DSA LSSDQ vs. Q100 (4 Tiles) 500 • Main contributor to speedup: Concurrency 400 LSSDQ (+comm.) Speedup SIMD (+concur.) 300 4-Tile (+concur.) LP core + SFUs (+comp.) 200 Q100 (domain-acccel) 100 GM q17 q16 q15 q10 q17 q6 q5 q4 q3 q2 0 q1 IME Speedup LSSDc vs. Conv. Baseline – 4 wide OOO core (Intel 3770K) 20 Domain Provisioned LSSDs LSSD area & power compared to a single DSA ? 21 Area Analysis Domain Provisioned LSSDs 4 3.8x Normalized Area 3.5 3 2.5 Domain provisioned LSSD overhead 2 1.5 1x –1.7x 4x worse in Area 1.2x 1 0.5 0.5x 0 *Detailed area breakdown in paper 22 Normalized Power Power Analysis 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Domain Provisioned LSSDs 4.1x 3.6x Domain provisioned LSSD overhead 2x 2x – 4x worse in Power *Detailed power breakdown in paper 0.6x 23 Balance LSSD design Area and power of LSSDBalanced design, when multiple domains mapped ? 24 LSSDBalanced Analysis Area 1.2 1 0.8 0.6 0.4 • 0.2 0 Power 3 Normalized Power Normalized Area 1.4 2.5 2.5x Balance LSSD design overheads 2 1.5 •0.6xArea efficient than multiple DSAs 1 2.5x worse in Power than multiple DSAs 0.5 0 25 Outline • Introduction and Motivation Concurrency Computation Principles of architectural specialization Data Reuse Embodiment of principles in DSAs • Architecture for programmable specialization (LSSD) • Evaluation of LSSD with 4 DSAs (Performance, power & area) • Communication System-level energy efficiency tradeoffs with LSSD and DSA Coordination Energy • Speedup Accel. Core $ System Bus Memory 26 LSSD’s power overhead of 2x - 4x matter in a system with accelerator? In what scenarios you want to build DSA over LSSD? 27 Energy Efficiency Tradeoffs System with accelerator Core power OOO Core t: execution time Caches Pcore: 5W Psys: 5W System power Accel. (LSSD or DSA) Pacc: 0.1 – 5W Accel. power S: accelerator’s speedup System Bus U: accelerator utilization Memory Overall energy of the computation executed on system E = Pacc * (U/S) * t + Psys * (1 – U + U/S) * t Accel. energy *Power numbers are example representation System energy + Pcore * (1 - U) * t Core energy 28 Energy Efficiency Gains of LSSD & DSA over OOO core 18 16 14 12 10 8 6 4 2 0 500mW Power overhead Plssd = 0.5W 18 16 14 U=1 12 U = 0.95 10 U = 0.9 8 U = 0.75 6 4 2 0 0 10 20 30 40 50 0 10 20 30 40 50 Accelerator Speedup w.r.t OOO core Accelerator Speedup w.r.t OOO core Pdsa ≈ 0.0W Energy Eff. of LSSD over OOO Energy Eff. of DSA over OOO Speeduplssd = Speedupdsa (Speedup w.r.t OOO) At higher speedups (S ), energy efficiency gains Baseline – 4∞ wide OOO core ‘capped’ due to large system power 29 LSSD’s power overhead of 2x - 4x matter in a system with accelerator? When Psys >> Plssd, 2x - 4x power overheads of LSSD become inconsequential 30 Energy Efficiency Gains of DSA over LSSD Energy Eff. of DSA over LSSD Speeduplssd = Speedupdsa (Speedup w.r.t OOO) 1.12 1.10 1.08 𝑬𝒇𝒇𝒅𝒔𝒂 1.06 U=1 = (1 / DSA energy) / (1 / LSSD energy) U = 0.95 = LSSD energy / DSA energy U = 0.9 U = 0.75 𝒍𝒔𝒔𝒅 1.04 1.02 1.00 0 10 20 30 40 50 Accelerator Speedup w.r.t OOO core 𝑬𝒇𝒇speedups, is no more than 10% even at 100% At benefits of DSA less than 5% on energy efficiency 𝒅𝒔𝒂 Athigher lower DSA’s energy efficiency gains 6 - utilization 10% over LSSD 𝒍𝒔𝒔𝒅 Baseline – LSSD 31 In what scenarios you want to build DSA over LSSD? Only when application speedups are small & small energy efficiency gains too important 32 Conclusion • 5 common principles for architectural specialization • Programmable architecture (LSSD) composed of simple uArch. mechanisms embodying the principles • LSSD competitive with DSA performance and overheads of only up to 4x in area and power • Power overhead inconsequential when system-level energy tradeoffs considered • LSSD as a baseline for future accelerator research 33 Back Up Slides 34 Design-Time vs. Runtime Decisions Synthesis – Time Run – Time Concurrency No. of LSSD Units Power-gating unused LSSD Units Computation Spatial fabric FU mix Scheduling of spatial fabric and core Communication Enabling spatial datapath elements, & SRAM interface widths Data Reuse Scratchpad (SRAM) size Config. of spatial datapath, switches and ports, memory access pattern Scratchpad used as DMA/reuse buffer 35 LSSD Design Point Selection Design Concurrency Computation Comm. Data Reuse No. of LSSD Units LSSDN 24-tile CGRA (8 Mul, 8 Add, 1 Sigmoid) 2k x 32b sigmoid lookup table 32b CGRA; 256b 2k x 32b SRAM interface weight buffer 1 Standard 16b FUs LSSDC 64-tile CGRA (32 Mul/Shift, 32 Add/logic) 16b CGRA; 512b 512 x 16b SRAM interface SRAM for inputs 1 LSSDD 64-tile CGRA (32 Mul, 32 Add, 2 Sigmoid) Piecewise linear sigmoid unit 32b CGRA; 512b 2k x 16b SRAMs SRAM interface for inputs 8 32-tile CGRA (16 ALU, 4 Agg, 4 Join) Join + Filter units 64b CGRA; 256b SRAMs for SRAM interface buffering 4 32-tile CGRA (Combination of above) Combination of above FUs 64b CGRA; 512b 4KB SRAM SRAM interface 8 LSSDQ LSSDB 36 Accelerator Workloads Neural Approx. DNN 1. Ample Parallelism 3. Large Datapath Convolution Database Streaming 2. Regular Memory 4. Computation Heavy 37 LSSD in Practice Designer 1. Design Synthesis Performance Requirements Perf. App. 1: ... App. 2: ... App. 3: ... Design decisions 2. Programming H/W Constraints Area goal: ... Power goal: ... FU Types No. of FUs Spatial fabric size No. of LSSD tiles Synthesis For each application: Write Control Program (C Prog. + Annotations) Write Datapath Program (spatial scheduling compiler framework) LSSD 38 Programming LSSD Pragmas Insert data transfer Memory #pragma lssd cores 2 #pragma reuse-scratchpad weights DMA D$ Scratchpad Input Interface Low-power Core x x x x x x x x + + + + + + + Ʃ Spatial Fabric void nn_layer(int num_in, int num_out, const float* weights, const float* in, const float* out ) { for (int j = 0; j < num_out; ++j) { for (int i = 0; i < num_in; ++i) { out[j] += weights[j][i] *in[i]; } out[j] = sigmoid(out[j]); } } LSSD Output Interface Loop Parallelize, Insert Communication, Modulo Schedule Resize Computation (Unroll), Extract Computation Subgraph, Spatial Schedule 39 Power & Area Analysis (1) LSSDN 1.2x more Area than DSA 2x more Power than DSA LSSDC 1.7x more Area than DSA 3.6x more Power than DSA 40 Power & Area Analysis (2) LSSDD LSSDQ 3.8x more Area than DSA 4.1x more Power than DSA 0.5x more Area than DSA 0.6x more Power than DSA 41 LSSD Area & Power Numbers Neural Approx. Stencil Deep Neural. Database Streaming Area (mm2) Power (mW) LSSDN 0.37 149 NPU 0.30 74 LSSDC 0.15 108 Conv. Engine 0.08 30 LSSDD 2.11 867 DianNao 0.56 213 LSSDQ 1.78 519 Q100 3.69 870 LSSDBalanced 2.74 352 *Intel Ivybridge 3770K CPU 1 core Area – 12.9mm2 | Power – 4.95W *Intel Ivybridge 3770K iGPU 1 execution lane Area – 5.75mm2 +AMD Kaveri APU Tahiti based GPU 1CU Area – 5.02mm2 *Source: http://www.anandtech.com/show/5771/the-intel-ivy-bridge-core-i7-3770k-review/3 +Estimate from die-photo analysis and block diagrams from wccftech.com 42 Power & Area Analysis (3) LSSDB Balanced LSSD design 2.7x more Area than DSAs 2.4x more Power than DSAs 0.6x more Area than DSA 2.5x more Power than DSA 43 Energy Efficiency Gains of DianNao over LSSD SpeedupLSSD = SpeedupDianNao (Speedup w.r.t OOO) Energy Eff. of DianNao over LSSD 1.14 1.12 1.10 U=1 1.08 U = 0.95 1.06 U = 0.9 1.04 U = 0.75 1.02 1.00 0 10 20 30 40 Accelerator Speedup w.r.t OOO 50 44 Does Accelerator power matter? • At Speedups > 10x, DSA eff. is around 5%, when accelerator power == core power • At smaller speedups, makes a bigger difference, up to 35% 45