DNN Accelerator Architectures ISCA Tutorial (2019) Website: http://eyeriss.mit.edu/tutorial.html Joel Emer, Vivienne Sze, Yu-Hsin Chen 1 Highly-Parallel Compute Paradigms Temporal Architecture (SIMD/SIMT) Memory Hierarchy Spatial Architecture (Dataflow Processing) Memory Hierarchy Register File ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Control ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 2 Memory Access is the Bottleneck Memory Read filter weight fmap activation partial sum MAC* Memory Write ALU updated partial sum * multiply-and-accumulate 3 Memory Access is the Bottleneck Memory Read MAC* Memory Write ALU DRAM DRAM * multiply-and-accumulate Worst Case: all memory R/W are DRAM accesses • Example: AlexNet [NIPS 2012] has 724M MACs à 2896M DRAM accesses required 4 Leverage Local Memory for Data Reuse Memory Read MAC* Memory Write ALU DRAM Mem Mem DRAM Extra levels of local memory hierarchy Smaller, but Faster and more Energy-Efficient 5 Types of Data Reuse in DNN Convolutional Reuse CONV layers only (sliding window) Filter Reuse: Input Fmap Activations Filter weights 6 Types of Data Reuse in DNN Convolutional Reuse Fmap Reuse CONV layers only (sliding window) CONV and FC layers Filters Filter Input Fmap Input Fmap 1 2 Reuse: Activations Filter weights Reuse: Activations 7 Types of Data Reuse in DNN Convolutional Reuse Fmap Reuse Filter Reuse CONV layers only (sliding window) CONV and FC layers CONV and FC layers (batch size > 1) Input Fmaps Filters Filter Input Fmap Input Fmap 1 Filter 1 2 2 Reuse: Activations Filter weights Reuse: Activations Reuse: Filter weights 8 Types of Data Reuse in DNN Convolutional Reuse Fmap Reuse Filter Reuse CONV layers only (sliding window) CONV and FC layers CONV and FC layers (batch size > 1) Input Fmaps Filters Filter Input Fmap Input Fmap Filter If all data reuse is exploited, DRAM accesses in AlexNet 1 1 can be reduced from 2896M to 61M (best case) 2 2 Reuse: Activations Filter weights Reuse: Activations Reuse: Filter weights 9 Leverage Parallelism for Higher Performance Memory Read MAC Memory Write ALU DRAM DRAM ALU Mem Mem … ALU 10 Leverage Parallelism for Spatial Data Reuse Memory Read MAC Memory Write ALU DRAM DRAM ALU Mem Mem … ALU 11 Spatial Architecture for DNN DRAM Global Buffer (100s – 1000s kB) On-Chip Network (NoC) • Global Buffer to PE • PE to PE ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Reg File ALU ALU ALU ALU Control Processing Element (PE) 1 – 10 kB 12 Multi-Level Low-Cost Data Access Global Buffer DRAM PE PE PE ALU fetch data to run a MAC here Normalized Energy Cost* 1s – 10s kB NoC: 100s – 1000s PEs 100s – 1000s kB Buffer DRAM PE RF ALU 1× (Reference) ALU 1× ALU 2× ALU ALU 6× 200× * measured from a commercial 65nm process 13 Multi-Level Low-Cost Data Access A Dataflow is required to maximally exploit data reuse with the low-cost memory hierarchy and parallelism Normalized Energy Cost* 1s – 10s kB NoC: 100s – 1000s PEs 100s – 1000s kB Buffer DRAM PE RF ALU 1× (Reference) ALU 1× ALU 2× ALU ALU 6× 200× * measured from a commercial 65nm process 14 Dataflow Taxonomy • Output Stationary (OS) • Weight Stationary (WS) • Input Stationary (IS) [Chen et al., ISCA 2016] 15 Output Stationary (OS) Global Buffer Weight Activation P0 P1 P2 P3 P4 P5 P6 P7 PE Psum • Minimize partial sum R/W energy consumption − maximize local accumulation • Broadcast/Multicast filter weights and reuse activations spatially across the PE array 16 OS Example: ShiDianNao Top-Level Architecture PE Architecture weights activations psums [Du et al., ISCA 2015] 17 OS Example: ENVISION activations weights [Moons et al., VLSI 2016, ISSCC 2017] 18 Variants of Output Stationary E … E … M … … OSC … E … Parallel Output Region M … … M OSB … OSA F F F # Output Channels Single Multiple Multiple # Output Activations Multiple Multiple Single Notes Targeting CONV layers Targeting FC layers 19 OS Example 3 2 1 … 2 8 … 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 Filter overlay 2 20 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 Incomplete partial sum 21 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 22 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 23 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 24 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 25 OS Example Cycle through input fmap and weights (hold psum of output fmap) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 Incomplete partial sum 26 1-D Convolution – Output Stationary Weights R Inputs * int I[H]; int W[R]; int O[E]; Outputs = E = H-R+1† H // Input activations // Filter weights // Output activations No constraints for (e = 0; e < E; e++) for (r = 0; r < R; r ++) O[e] += I[e+r] * W[r]; on loop permutations! † Assuming: ‘valid’ style convolution 27 1-D Convolution Weights R Inputs * int I[H]; int W[R]; int O[E]; Outputs = E = H-R+1† H // Input activations // Filter weights // Output activations for (r = 0; r < R; r++) for (e = 0; e < E; e++) O[e] += I[e+r] * W[r]; † Assuming: ‘valid’ style convolution 28 Weight Stationary (WS) Global Buffer Activation Psum W0 W1 W2 W3 W4 W5 W6 W7 PE Weight • Minimize weight read energy consumption − maximize convolutional and filter reuse of weights • Broadcast activations and accumulate psums spatially across the PE array. 29 WS Example: nn-X (NeuFlow) A 3×3 2D Convolution Engine activations weights psums [Farabet et al., ICCV 2009] 30 WS Example: NVDLA (simplified) Released Sept 29, 2017 Global Buffer PE Array: M x C MACs Image Source: Nvidia http://nvdla.org 31 WS Example: NVDLA (simplified) Different Output Channels input fmap I000 W0000 W1000 W2000 W3000 W4000 W5000 W6000 W7000 I100 W0100 W1100 W2100 W3100 W4100 W5100 W6100 W7100 W0200 W1200 W2200 W3200 W4200 W5200 W6200 W7200 I200 Weights Different Input Channels Psum P000 P100 P200 P300 P400 P500 P600 P700 output fmap / psum 32 WS Example: NVDLA (simplified) Cycle through input and output fmap (hold weights) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 33 WS Example: NVDLA (simplified) Cycle through input and output fmap (hold weights) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 34 WS Example: NVDLA (simplified) Cycle through input and output fmap (hold weights) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 35 WS Example: NVDLA (simplified) Cycle through input and output fmap (hold weights) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 36 WS Example: NVDLA (simplified) Load new weights 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 37 WS Example: NVDLA (simplified) Cycle through input and output fmap (hold weights) 3 8 2 1 … 2 3 … output fmap … filters 3 input fmap 3 2 … 2 3 2 8 2 38 Taxonomy: More Examples • Output Stationary (OS) [Peemen, ICCD 2013] [ShiDianNao, ISCA 2015] [Gupta, ICML 2015] [Moons, VLSI 2016] [Thinker, VLSI 2017] • Weight Stationary (WS) [Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [ISAAC, ISCA 2016] [PRIME, ISCA 2016] [TPU, ISCA 2017] 39 Input Stationary (IS) Global Buffer Weight Psum I0 I1 I2 I3 I4 I5 I6 Act I7 PE • Minimize activation read energy consumption − maximize convolutional and fmap reuse of activations • Unicast weights and accumulate psums spatially across the PE array. 40 IS Example: SCNN • Used for sparse CNNs – Sparse CNN is where many weights are zeros – Activations also have sparsity from ReLU [Parashar et al., ISCA 2017] 41 1-D Convolution – Weight Stationary Weights R Inputs * int I[H]; int W[R]; int O[E]; Outputs = E = H-R+1† H // Input activations // Filter weights // Output activations for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; How can we implement input stationary with no input index? † Assuming: ‘valid’ style convolution 42 1-D Convolution – Input Stationary Weights R Inputs Outputs = * E = H-R+1† H int I[H]; int W[R]; int O[E]; // Input activations // Filter weights // Output activations for (h = 0; h < H; h++) for (r = 0; r < R; r++) O[h-r] += I[h] * W[r]; Beware w-r must be >= 0 and <E † Assuming: ‘valid’ style convolution 43 Reference Patterns of Different Dataflows 44 Single PE Setup Weight Buffer PE Input Activation Buffer MAC Output Activation Buffer 45 Output Stationary – Reference Pattern for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; Outputs Layer Shape: - H = 12 -R=4 -E=9 Inputs Cycle Reference Index Reference Index Weights Cycle Cycle 46 Output Stationary – Reference Pattern Observations: - Single output is reused many times (R) 47 Output Stationary – Reference Pattern Observations: - Single output is reused many times (R) - All weights reused repeatedly 48 Output Stationary – Reference Pattern Observations: - Single output is reused many times (R) - All weights reused repeatedly - Sliding window of inputs (size = R) 49 Buffer Data Accesses - Weights for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; OS MACs Weight Reads E*R E*R E R Input Reads Output Reads Output Writes 50 Buffer Data Accesses - Inputs for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; OS MACs E*R Weight Reads E*R Input Reads E*R R E Output Reads Output Writes 51 Buffer Data Accesses - Outputs for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; OS MACs E*R Weight Reads E*R Input Reads E*R Output Reads 0 Output Writes E R E 52 1-D Convolution – Weight Stationary Weights R Inputs * int I[H]; int W[R]; int O[E]; Outputs = W E = W-ceil(R/2)† // Input activations // Filter weights // Output activations for (e = 0; e < E; e++) for (r = 0; r < R; r++) O[e] += I[e+r] * W[r]; † Assuming: ‘valid’ style convolution 53 Weight Stationary – Reference Pattern Observations: - Single weight is reused many times (E) - Large sliding window of inputs (size = E) - Fixed window of outputs (size = E) 54 L1 Weight Stationary - Costs for (r = 0; r < R; r++) for (e = 0; e < E; e++) O[e] += I[e+r] * W[r]; OS WS MACs E*R E*R Weight Reads E*R R Input Reads E*R E*R Output Reads 0 E*R Output Writes E E*R 55 1-D Convolution – Input Stationary Weights R Inputs Outputs = * E = W-ceil(R/2)† W int I[H]; int W[R]; int O[E]; // Input activations // Filter weights // Output activations for (h = 0; h < H; h++) for (r = 0; r < R; r++) O[h-r] += I[h] * W[r]; Beware w-r must be >= 0 and <E † Assuming: ‘valid’ style convolution 56 Input Stationary – Reference Pattern Observations: - Inputs used repeatedly (R times) - Weights reused in large window (size = R) - Sliding window of outputs (size = R) 57 Minimum Costs OS WS IS Min MACs E*R E*R E*R E*R Weight Reads E*R R E*R R Input Reads E*R E*R E E Output Reads 0 E*R E*R 0 Output Writes E E*R E*R E Assume: W ~= E 58 Intermediate Buffering L1 L0 Weights Weights L0 Weights L1 L0 Inputs Inputs L0 Inputs L1 L0 Outputs Outputs L0 Outputs PE 59 1-D Convolution – Buffered Weights R Inputs * int I[H]; int W[R]; int O[E]; Outputs = E = H-R+1† H // Input activations // Filter Weights // Output activations Note E and R are factored so: E0*E1 = E R0*R1 = R // Level 1 for (e1 = 0; e1 < E1; e1++) for (r1 = 0; r1 < R1; r1++) // Level 0 for (e0 = 0; e0 < E0; e0++) for (r0 = 0; r0 < R0; r0++) O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0] * W[r1*R0+r0]; † Assuming: ‘valid’ style convolution 60 Buffer sizes // Level 1 for (e1 = 0; e1 < E1; e1++) for (r1 = 0; r1 < R1; r1++) // Level 0 for (e0 = 0; e0 < E0; e0++) for (r0 = 0; r0 < R0; r0++) O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0] * W[r1*R0+r0]; • Level 0 buffer size is volume needed in each Level 1 iteration. • Level 1 buffer size is volume needed to be preserved and redelivered in future (usually successive) Level 1 iterations. A legal assignment of loop limits will fit into the hardware’s buffer sizes 61 Spatial PEs Weight Buffer PE0 Input Buffer PE1 Output Buffer 62 1-D Convolution – Spatial Weights R Inputs * int I[W]; int W[R]; int O[E]; Outputs = E = H-R+1† W // Input activations // Filter Weights // Output activations Note: • E0*E1 = E • R0*R1 = R • R0*E0 <= #PEs // Level 1 for (r1 = 0; r1 < R1; r1++) for (e1 = 0; e1 < E1; e1++) // Level 0 spatial-for (r0 = 0; r0 < R0; r0++) spatial-for (e0 = 0; e0 < E0; e0++) O[e1*E0+e0] += I[e1*E0+e0 + r1*R0+r0] * W[r1*R0+r0]; † Assuming: ‘valid’ style convolution 63 Summary of DNN Dataflows • Minimizing data movement is the key to achieving high energy efficiency for DNN accelerators • Dataflow taxonomy: – Output Stationary: minimize movement of psums – Weight Stationary: minimize movement of weights – Input Stationary: minimize movement of inputs • Loop nest provides a compact way to describe various properties of a dataflow, e.g., data tiling in multi-level storage and spatial processing. 64
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )