severance-codes2013

advertisement
Embedded Supercomputing in FPGAs
with the VectorBlox
MXP Matrix Processor
Aaron Severance, UBC
VectorBlox Computing
Prof. Guy Lemieux, UBC
CEO VectorBlox Computing
http://www.vectorblox.com
Typical Usage and Motivation
• Embedded processing
– FPGAs often control custom devices
• Imaging, audio, radio, screens
– Heavy data processing requirements
• FPGA tools for data processing
– VHDL too difficult to learn and use
– C-to-hardware tools too “VHDL-like”
– FPGA-based CPUs (Nios/MicroBlaze) too slow
• Complications
– Very slow recompiles of FPGA bitstream
– Device control circuits may have sensitive timing requirements
2
© 2012 VectorBlox Computing Inc.
A New Tool
• MXP™ Matrix Processor
– Performance
• 100x – 1000x over Nios II/f, MicroBlaze
– Easy to use, pure software
• Just C, no VHDL/Verilog !
– No FPGA recompilation for each algorithm change
• No bitstream changes
• Save time (FPGA place+route can take hours, run out of space, etc)
– Correctness
• Easy-to-debug, e.g. printf() or gdb
• Simulator runs on PC, eg regression testing
• Run on real FPGA hardware, eg real-time testing
3
© 2012 VectorBlox Computing Inc.
Background: Vector Processing
• Data-level parallelism
• Organize data as long vectors
C Code
for ( i=0; i<8; i++ )
a[i] = b[i] * c[i];
4 SIMD Vector Lanes
Vector
Assembly
set vl, 8
vmult a, b, c
• Vector instruction execution
Destination
Vector
Source
– Multiple vector lanes (SIMD) Vectors
– Hardware automatically
repeats SIMD operation over
entire length of vector
4
© 2012 VectorBlox Computing Inc.
Preview: MXP Internals
6
SYSTEM DESIGN WITH MXP™
7
© 2012 VectorBlox Computing Inc.
MXP™ Processor: Configurable IP
8
© 2012 VectorBlox Computing Inc.
Integrates into Existing Systems
9
© 2012 VectorBlox Computing Inc.
Typical System
10
Programming MXP
• Libraries on top of vendor tools
– Eclipse based IDEs, command line tools
– GCC, GDB, etc.
• Functions and Macros extend C, C++
– Vector Instructions
• ALU, DMA, Custom Instructions
• Same software for different configurations
– Wide MXP -> higher performance
11
Example: Adding 3 Vectors
#include “vbx.h”
int main()
{
const int length = 8;
int A[length] = {1,2,3,4,5,6,7,8};
int B[length] = {10,20,30,40,50,60,70,80};
int C[length] = {100,200,300,400,500,600,700,800};
int D[length];
vbx_dcache_flush_all();
const int data_len = length * sizeof(int);
vbx_word_t *va = (vbx_word_t*)vbx_sp_malloc( data_len );
vbx_word_t *vb = (vbx_word_t*)vbx_sp_malloc( data_len );
vbx_word_t *vc = (vbx_word_t*)vbx_sp_malloc( data_len );
vbx_dma_to_vector( va, A, data_len );
vbx_dma_to_vector( vb, B, data_len );
vbx_dma_to_vector( vc, C, data_len );
vbx_set_vl( length );
vbx( VVW, VADD, vb, va, vb );
vbx( VVW, VADD, vc, vb, vc );
vbx_dma_to_host( D, vc, data_len );
vbx_sync();
vbx_sp_free();
}
© 2012 VectorBlox Computing Inc.
Algorithm Design on FPGAs
• HW and SW development is decoupled
• Select HW parameters and go
– No VHDL required for computing
– Only resynthesize when requirements change
• Design SW with these main concepts
– Vectors of data
– Scratchpad with DMA
– Same software can run on any FPGA
13
© 2012 VectorBlox Computing Inc.
MXP™ MATRIX PROCESSOR
14
© 2012 VectorBlox Computing Inc.
MXP™ System Architecture
Main
Memory
eg, DDR2
Altera Avalon
Fabric
Slave
2. Concurrent
DMA
3-way
Concurrency
D$
Custom
Instruction Port
Master
MXP DMA
Engine
1. Scalar
CPU
Nios
II/f
CPU
I$
Instruction
& DMA Queue
MXP Vector Engine
VectorBlox MXP Matrix Processor
3. Vector SIMD
15
MXP Internal Architecture (1)
DMA and Vector Work Queues, Instruction Decode & Control
Custom
Instructions
Address Generation
Nios II/f
I$
D$
BB
5 1
DMA
M
CC
6 2
M
AA
7 3
ALU0
Bank 0
S
BB
6 2
CC
7 3
M
Accum
AA
4 0
ALU1
Bank 1
Avalon
Fabric
BB
7 3
CC
4 0
AA
5 1
ALU2
Bank 2
S
DDR
Control
BB
4 0
Align 3
DstC
CC
5 1
AA
6 2
Bank 3
Vector Scratchpad
ALU3
Align 1
SrcA
Align 2
SrcB
16
© 2012 VectorBlox Computing Inc.
Scratchpad Memory
• Multi-banked, parallel access
– Addresses striped across banks, like RAID disks
C 8 4 0
Data is
Striped
Across
Memory
Banks
D 9 5 1
E A 6 2
F B 7 3
17
© 2012 VectorBlox Computing Inc.
Scratchpad Memory
• Multi-banked, parallel access
– Vector can start at any location
C 8 4 0
Data is
Striped
Across
Memory
Banks
D 9 5 1
E A 6 2
Vector starts here
F B 7 3
18
© 2012 VectorBlox Computing Inc.
Scratchpad Memory
• Multi-banked, parallel access
– Vector can start at any location
– Vector can have any length
C 8 4 0
Data is
Striped
Across
Memory
Banks
D 9 5 1
Vector starts here
Vector of length 10
E A 6 2
F B 7 3
19
© 2012 VectorBlox Computing Inc.
Scratchpad Memory
• Multi-banked, parallel access
– Vector can start at any location
– Vector can have any length
– One “wave” of elements can be read every cycle
Data is
Striped
Across
Memory
Banks
C 8 4 0
C 8 4 0
D 9 5 1
D 9 5 1
E A 6 2
E A 6 2
F B 7 3
F B 7 3
One
clock
cycle:
Parallel
access
to one full
“wave”
of vector
elements
20
© 2012 VectorBlox Computing Inc.
Scratchpad-based Computing
vbx_word_t *vdst, *vsrc1, *vsrc2;
vbx( VVW, VADD, vdst, vsrc1, vsrc2 );
21
© 2012 VectorBlox Computing Inc.
MXP Internal Architecture (2)
DMA and Vector Work Queues, Instruction Decode & Control
Custom
Instructions
Address Generation
Nios II/f
I$
D$
BB
5 1
DMA
M
CC
6 2
M
AA
7 3
ALU0
Bank 0
S
BB
6 2
CC
7 3
M
Accum
AA
4 0
ALU1
Bank 1
Avalon
Fabric
BB
7 3
CC
4 0
AA
5 1
ALU2
Bank 2
S
DDR
Control
BB
4 0
Align 3
DstC
CC
5 1
AA
6 2
Bank 3
Vector Scratchpad
ALU3
Align 1
SrcA
Align 2
SrcB
Custom
ALU 0
.
Custom ALUs
Custom
ALU 1
25
Custom Vector Instructions
clock
start
valid
clock
start
valid
A0
Q D
ADD
A0
SUB
B0
x2
C0
C0
B0
wr0
A1
wr0
ADD
A1
SUB
B1
x2
C1
C1
B1
wr1
A2
wr1
ADD
A2
SUB
B2
x2
C2
C2
B2
wr2
A3
ADD
A3
SUB
B3
opsize
opcode
wr2
x2
C3
C3
B3
wr3
2
2
00
a) Custom instruction
within lanes
opsize
opcode
wr3
2
2
01
b) Custom instruction
prefix sum across lanes
26
MXP Internal Architecture (3)
27
Rich Feature Set
Feature
Register file
MXP
4kB to 2MB
# Vectors (registers)
unlimited
Max Vector Length
unlimited
Max Element Width
32b
Sub-word SIMD
Automatic Dispatch/Increment
Parallelism
Clock speed
2 x 16b, 4 x 8b
2D/3D
1 to 128 (x4 for 8b)
Up to 245 MHz
Latency-hiding
Concurrent 1D/2D DMA
Floating-point
Optional via Custom Instructions
User-configurable
DMA, ALUs, Multipliers, S/G Ports
28
Performance Examples
Application
Kernels
Speedup
(factor)
VectorBlox MXPTM Processor Size
29
© 2012 VectorBlox Computing Inc.
Chip Area Requirements
Nios
II/f
V1
4k
V4
16k
V16
64k
V32
128k
V64
256k
Stratix
IV-530
ALMs
1,223
3,433
7,811
21,211
46,411
80,720
212,480
DSPs
4
12
36
132
260
516
1,024
M9Ks
14
29
39
112
200
384
1,280
Nios
II/f
LEs
V1
4k
V4
16k
V16
64k
V32
128k
Cyclone
IV-115
2,898
4,467
11,927
45,035
89,436
114,480
DSPs
4
12
48
192
388
532
M9Ks
21
32
36
97
165
432
30
© 2012 VectorBlox Computing Inc.
Average Speedup vs. Area
(Relative to Nios II/f = 1.0)
31
© 2012 VectorBlox Computing Inc.
Sobel Edge Detection
• MXP achieves high utilization
– Long vectors keep data streaming through FU’s
– In pipeline alignment, accumulate
– Concurrent vector/DMA/scalar alleviate stalling
32
Current/Future Work
• Multiple operand custom instructions
– Custom RTL performance, vector control
• Modular Instruction Set
– Application Specific Vector ISA Processor
• C++ object programming model
33
Conclusions
• Vector processing with MXP on FPGAs
– Easy to use/deploy
– Scalable performance (area vs speed)
• Speedups up to 1000x
– No hardware recompiling necessary
• Rapid algorithm development
• Hardware purely ‘sandboxed’ from algorithm
34
© 2012 VectorBlox Computing Inc.
The VectorBlox MXP™
Matrix Processor
•
•
•
•
•
Scalable performance
Pure C programming
Direct device access
No hardware design
Easy to debug
RTL
Application Performance
Comparison to Intel i7-2600
(running on one 3.4GHz core, without SSE/AVX instructions)
CPU
Fir
2Dfir
Life
Imgblend
Median
Motion
Estimatio
n
Matrix
Multiply
Intel i72600
0.05s
0.36s
0.13s
0.09s
9.86s
0.25s
50.0s
MXP
0.05s
0.43s
0.19s
0.50s
2.50s
0.21s
15.8s
Speedup
1.0x
0.8x
0.7x
0.2x
3.9x
1.7x
3.2x
36
© 2012 VectorBlox Computing Inc.
Benchmark Characteristics
Table III
Table III
cond)
V4
18.94
22.72
15.61
51.18
36.42
2.69
6.29
41.67
93.75
mean
B ENCHM A RK PERFORM A NCE A ND PROPE
PERFORM A NCE A ND PROPERTI ES
V1
12.9
3.9
1.3
16.1
8.0
7.3
27.4
6.1
12.6
7.95
Speedup
Benchmark
V2
V4
autocor
24.2
41.2
4.7
rgbcmyk5.0
2.1
rgbyiq3.0
30.1
52.0
imgblend
12.7
17.2
filt3x3
14.4
26.6
median
48.2
72.4
motest
10.5 fir 12.5
27.4
50.6
matmul
13.8 20.6
to&Nios&II/f,&
chmarks)&
NCHM A RK
25"
Performance
(Millions of elem.Benchmark
per second)Properties Speedup
Data Type
Nios
II/f
V1
V2 Set Size V4Taps V1 Origin
V2
In/Out
Intermed.
Data
0.46
5.94
24.2
halfword
word 11.11
1024 18.94 16 12.9
EEMBC
byte
EEMBC
4.56
17.68
21.41896⇥606 22.72
3.9
4.7
byte
word 11.09896⇥606 15.61
EEMBC
5.20
6.74
1.3
2.1
halfword
4.83
77.63 145.57320⇥240251.18
16.1VIRAM
30.1
byte
halfword
2.11
16.82
26.95320⇥240 36.423⇥3 8.0VIRAM
12.7
byte
0.10
0.74
1.45 128⇥21 2.695⇥5 7.3custom
14.4
byte
32
⇥
32
16
⇥
16
custom
0.09
2.37
4.18
6.29
27.4 48.2
halfword
3.32
20.11
34.95 4096 41.67 16 6.1custom
10.5
word
1024
⇥
1024
custom
11.7 148.20 322.22
593.75
12.6 27.4
Geomean
7.95 13.8
Speedup"vs"Nios"II/f"
20"
50"
In/O
halfw
b
b
halfw
b
b
b
halfw
w
Speedup"per"ALM"
60"
Nios"II/f"
Single"CPU"
V4
41.2
5.0
3.0
52.0
17.2
26.6
72.4
12.5
50.6
20.6
60"
Nios"II/f"
© 2012 VectorBlox Computing Inc.
Single"CPU"
50"
37
Download