Eric S. Chung , John D. Davis , Srinidhi Kestur Microsoft Research, Silicon Valley Lab

advertisement
Eric S. Chung1, John D. Davis1, Srinidhi Kestur2
1Microsoft Research, Silicon Valley Lab
2The Pennsylvania State University
*Presented at FCCM’12
Transistors are abundant, power is scarce
Utilize abundant silicon for FPGA fabrics
Utili
b d t ili
f FPGA f b i
Energy efficient and post‐silicon flexibility
Challenges
FPGAs incur large compile‐time and FPGA
i
l
il ti
d
reconfiguration overheads
Must provide significant advantages over other
Must provide significant advantages over other architectures (many‐core, GPGPU)
What are the right applications?
2
3
4
A00 A01 A02 A03
A10 A11 A12 A13
A20 A21 A22 A23
A30 A31 A32 A33
A40 A41 A42 A43
5
x0
x1
x2
x3
y0
y1
y2
y3
y4
A00
A03
A12
x1
A21
x2
A32
A41
6
x0
x3
A43
y0
y1
y2
y3
y4
Matrix‐Vector‐Multiply is Critical HPC Kernel
10s of papers published/year on this topic
10s of papers published/year on this topic
Existing works on GPU/CPU/FPGA
g
/
/
Performance sensitive to matrix sparsity and formats
Processor‐centric data formats
High power consumption (GPU/CPU)
High power consumption (GPU/CPU)
FPGA opportunities
FPGA opportunities
Exploit custom variable‐length formats
Low power, large memory configurations
Effi i t b t
Efficient, robust resource utilization
tili ti
7
8
PE
Universal Format
Format Decoder
Rows
Gaxpy PIPE 0 Gaxpy PIPE 1 Gaxpy PIPE 2 PE
PE
PE
9
Matrix
Memory
(A)
Gaxpy
Control
Gaxpy PIPE 3 Vector Memory Vector
Memory
(x)
Tiled DMA Engine
Vector Memory (y)
PE
Universal Format
Format Decoder
Rows
Gaxpy PIPE 0 Private Cache (x)
A data streams
Gaxpy PIPE 1 Private Cache (x)
Gaxpy PIPE 2 PE
Private Cache (x)
PE
Gaxpy PIPE 3 PE
10
A streams
Gaxpy
Control
Private Cache (x)
Tiled DMA Engine
Vector Memory (y)
1
3
0
5
Data Array 11
0
7
0
8
4
0
2
0
0
0
9
7
1 4 3 7 2 9 5 8 7
12
Data Array 1 4 3 7 2 9 5 8 7
Row Index
Row Index
0 0 1 1 2 2 3 3 3
Column Index
l
d
0 2 0 1 2 3 0 1 3
Data Array 1 4 3 7 2 9 5 8 7
Row Pointer
Row Pointer
0 2 4 6 9
Column Index
l
d
13
0 2 0 1 2 3 0 1 3
Data w/ Padding
1
3
2
5
14
4
7
9
8
*
*
*
7
Column Metadata 0
0
2
0
2
1
3
1
k=3
*
*
*
3
15
Data Array 1 4 3 7 2 9 5 8 7
Bit Vector (BV)
Bit Vector (BV)
1 0 1 01 1 0 0 0 0 1 1 1 1 0 1
16
Data Array 1 4 3 7 2 9 5 8 7
Compressed BV (CBV)
Bit Vector (BV)
1 0 1 01 1 0 0 0 0 1 1 1 1 0 1
1 (0,1) 1 (0,1) 1 1 (0,4) 1 1
1
1 (0,1) 1
32‐bit “zero” fields
17
Data Array 1 4 3 7 2 9 5 8 7
Compressed Variable BV (CVBV)
1 0 1 01 1 0 0 0 0 1 1 1 1 0 1
1 (0,1) 1 (0,1) 1 1 (0,4) 1 1
1
1 (0,1) 1
4‐bit header + {4,8,…,32}‐bit zero field
CVBV Overhead ~ input‐dependent
p
p
18
19
Denser
Sp
parsity In
ndex
Sparser
20
Universal Universal
Universal Format/
Format CVBV Decoder
Decoder
Rows
A data streams
Gaxpy
py PIPE 0 private cache (x)
Private Cache (x)
Gaxpy PIPE 1 private cache (x)
Private
Cache (x)
Gaxpy PIPE 2 Private Cache (x)
private cache (x)
Gaxpy PIPE 3 A streams
Gaxpy
Control
21
private cache (x)
Private Cache (x)
Vector Memory (y)
Universal Universal
Format/
CVBV Decoder
Rows
Gaxpy PIPE 0
private cache (x)
Gaxpy PIPE 1
A data streams
private cache (x)
Gaxpy PIPE 2
private cache (x)
Gaxpy PIPE 3
Gaxpy
Control
22
private cache (x)
Vector
Memory
(y)
Specify matrix format descriptors
Fixed/variable length, padding, index/ptr, etc.
Translates row/column into sequence #
/
q
Generate *BV (reduce storage/BW)
Generate modified COO (consumed by PEs)
Generate modified COO (consumed by PEs)
Row index, nonzero count per row, col indices
23
24
PEs
LUT
(% area)
RAM
(% area)
DSP
(% area)
GFLOP
s
(Peak)
GFLOPs
(off-chip)
BW
(% peak)
D
Dense
V5
V5-LX155T
LX155T
16
72%
86%
88%
31
3.1
0 92
0.92
64 7
64.7
Dense V6-LX240T
32
71%
63%
56%
6.4
1.14
80
Dense+Sparse V5
16
74%
87%
91%
3.1
-
-
V5-LX155T (Ours)
HC-1 (32 PE)1
Tesla S10702
GFLOPS / BWUsed
GFLOPS /
BWUsed
GFLOPS /
BWUsed
dw8192
0.10 / 10.3%
1.7 / 13.2%
0.5 / 3.1%
t2d_q9
0.15 / 14.4%
2.5 / 19.3%
0.9 / 5.7%
epb1
b1
0 17 / 17
0.17
17.1%
1%
2 6 / 20
2.6
20.2%
2%
08/4
0.8
4.9%
9%
raefsky1
0.20 / 18.5%
3.9 / 29.0%
2.6 / 15.3%
psmigr_2
0.20 / 18.6%
3.9 / 29.6%
2.8 / 16.7%
torso2
0.04 / 4.0%
1.2 / 9.1%
[1] Nagar et al., A Sparse Matrix Personality for the Convey HC‐1, FCCM’11
[2] Bell et al., Implementing Sparse Matrix‐Vector Multiplication on Throughput‐Oriented Processors, SC’09
3.0 / 18.3%
Sparse
Inputs
25
We propose CVBV/CBV sparse format
25% reduction in storage/bandwidth compared 25%
reduction in storage/bandwidth compared
to well‐known CSR
Exploits bit‐level manipulation of FPGA
p
p
Single bit file for dense AND sparse MVM
Universal matrix format decoder
Universal
matrix format decoder
DMA and caches for memory management
Stall‐free
Stall
free accumulator
accumulator
Scalable design, implemented on multiple platforms
26
Performance and energy efficiency compared to CPU
CPUs and GPGPUs
d GPGPU
ASIC core integration
Sparse *BV lload/store
d/ t
engine
i ffor ffuture
t
CPU/GPGPU
Pushing bits over wires is dominant energy cost
256bits across 10 mm;
Today @ 310pJ/bit  2017 @ 200pJ/bit
(vs. 7x reduction in pJ/FMADD) [Dally, IEEE MICRO’11]
Will ‘bit
‘bit-level’
l
l’ memory systems
t
make
k sense iin
future?
27
Bit-level encoding/compaction/efficiency
g
p
y
Sparse, bit-level load-store engine
Applications?
28
29
Dense
30
31
32
33
34
Download