Eric S. Chung1, John D. Davis1, Srinidhi Kestur2 1Microsoft Research, Silicon Valley Lab 2The Pennsylvania State University *Presented at FCCM’12 Transistors are abundant, power is scarce Utilize abundant silicon for FPGA fabrics Utili b d t ili f FPGA f b i Energy efficient and post‐silicon flexibility Challenges FPGAs incur large compile‐time and FPGA i l il ti d reconfiguration overheads Must provide significant advantages over other Must provide significant advantages over other architectures (many‐core, GPGPU) What are the right applications? 2 3 4 A00 A01 A02 A03 A10 A11 A12 A13 A20 A21 A22 A23 A30 A31 A32 A33 A40 A41 A42 A43 5 x0 x1 x2 x3 y0 y1 y2 y3 y4 A00 A03 A12 x1 A21 x2 A32 A41 6 x0 x3 A43 y0 y1 y2 y3 y4 Matrix‐Vector‐Multiply is Critical HPC Kernel 10s of papers published/year on this topic 10s of papers published/year on this topic Existing works on GPU/CPU/FPGA g / / Performance sensitive to matrix sparsity and formats Processor‐centric data formats High power consumption (GPU/CPU) High power consumption (GPU/CPU) FPGA opportunities FPGA opportunities Exploit custom variable‐length formats Low power, large memory configurations Effi i t b t Efficient, robust resource utilization tili ti 7 8 PE Universal Format Format Decoder Rows Gaxpy PIPE 0 Gaxpy PIPE 1 Gaxpy PIPE 2 PE PE PE 9 Matrix Memory (A) Gaxpy Control Gaxpy PIPE 3 Vector Memory Vector Memory (x) Tiled DMA Engine Vector Memory (y) PE Universal Format Format Decoder Rows Gaxpy PIPE 0 Private Cache (x) A data streams Gaxpy PIPE 1 Private Cache (x) Gaxpy PIPE 2 PE Private Cache (x) PE Gaxpy PIPE 3 PE 10 A streams Gaxpy Control Private Cache (x) Tiled DMA Engine Vector Memory (y) 1 3 0 5 Data Array 11 0 7 0 8 4 0 2 0 0 0 9 7 1 4 3 7 2 9 5 8 7 12 Data Array 1 4 3 7 2 9 5 8 7 Row Index Row Index 0 0 1 1 2 2 3 3 3 Column Index l d 0 2 0 1 2 3 0 1 3 Data Array 1 4 3 7 2 9 5 8 7 Row Pointer Row Pointer 0 2 4 6 9 Column Index l d 13 0 2 0 1 2 3 0 1 3 Data w/ Padding 1 3 2 5 14 4 7 9 8 * * * 7 Column Metadata 0 0 2 0 2 1 3 1 k=3 * * * 3 15 Data Array 1 4 3 7 2 9 5 8 7 Bit Vector (BV) Bit Vector (BV) 1 0 1 01 1 0 0 0 0 1 1 1 1 0 1 16 Data Array 1 4 3 7 2 9 5 8 7 Compressed BV (CBV) Bit Vector (BV) 1 0 1 01 1 0 0 0 0 1 1 1 1 0 1 1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1 32‐bit “zero” fields 17 Data Array 1 4 3 7 2 9 5 8 7 Compressed Variable BV (CVBV) 1 0 1 01 1 0 0 0 0 1 1 1 1 0 1 1 (0,1) 1 (0,1) 1 1 (0,4) 1 1 1 1 (0,1) 1 4‐bit header + {4,8,…,32}‐bit zero field CVBV Overhead ~ input‐dependent p p 18 19 Denser Sp parsity In ndex Sparser 20 Universal Universal Universal Format/ Format CVBV Decoder Decoder Rows A data streams Gaxpy py PIPE 0 private cache (x) Private Cache (x) Gaxpy PIPE 1 private cache (x) Private Cache (x) Gaxpy PIPE 2 Private Cache (x) private cache (x) Gaxpy PIPE 3 A streams Gaxpy Control 21 private cache (x) Private Cache (x) Vector Memory (y) Universal Universal Format/ CVBV Decoder Rows Gaxpy PIPE 0 private cache (x) Gaxpy PIPE 1 A data streams private cache (x) Gaxpy PIPE 2 private cache (x) Gaxpy PIPE 3 Gaxpy Control 22 private cache (x) Vector Memory (y) Specify matrix format descriptors Fixed/variable length, padding, index/ptr, etc. Translates row/column into sequence # / q Generate *BV (reduce storage/BW) Generate modified COO (consumed by PEs) Generate modified COO (consumed by PEs) Row index, nonzero count per row, col indices 23 24 PEs LUT (% area) RAM (% area) DSP (% area) GFLOP s (Peak) GFLOPs (off-chip) BW (% peak) D Dense V5 V5-LX155T LX155T 16 72% 86% 88% 31 3.1 0 92 0.92 64 7 64.7 Dense V6-LX240T 32 71% 63% 56% 6.4 1.14 80 Dense+Sparse V5 16 74% 87% 91% 3.1 - - V5-LX155T (Ours) HC-1 (32 PE)1 Tesla S10702 GFLOPS / BWUsed GFLOPS / BWUsed GFLOPS / BWUsed dw8192 0.10 / 10.3% 1.7 / 13.2% 0.5 / 3.1% t2d_q9 0.15 / 14.4% 2.5 / 19.3% 0.9 / 5.7% epb1 b1 0 17 / 17 0.17 17.1% 1% 2 6 / 20 2.6 20.2% 2% 08/4 0.8 4.9% 9% raefsky1 0.20 / 18.5% 3.9 / 29.0% 2.6 / 15.3% psmigr_2 0.20 / 18.6% 3.9 / 29.6% 2.8 / 16.7% torso2 0.04 / 4.0% 1.2 / 9.1% [1] Nagar et al., A Sparse Matrix Personality for the Convey HC‐1, FCCM’11 [2] Bell et al., Implementing Sparse Matrix‐Vector Multiplication on Throughput‐Oriented Processors, SC’09 3.0 / 18.3% Sparse Inputs 25 We propose CVBV/CBV sparse format 25% reduction in storage/bandwidth compared 25% reduction in storage/bandwidth compared to well‐known CSR Exploits bit‐level manipulation of FPGA p p Single bit file for dense AND sparse MVM Universal matrix format decoder Universal matrix format decoder DMA and caches for memory management Stall‐free Stall free accumulator accumulator Scalable design, implemented on multiple platforms 26 Performance and energy efficiency compared to CPU CPUs and GPGPUs d GPGPU ASIC core integration Sparse *BV lload/store d/ t engine i ffor ffuture t CPU/GPGPU Pushing bits over wires is dominant energy cost 256bits across 10 mm; Today @ 310pJ/bit 2017 @ 200pJ/bit (vs. 7x reduction in pJ/FMADD) [Dally, IEEE MICRO’11] Will ‘bit ‘bit-level’ l l’ memory systems t make k sense iin future? 27 Bit-level encoding/compaction/efficiency g p y Sparse, bit-level load-store engine Applications? 28 29 Dense 30 31 32 33 34