Embedded Supercomputing in FPGAs with the VectorBlox MXP Matrix Processor Aaron Severance, UBC VectorBlox Computing Prof. Guy Lemieux, UBC CEO VectorBlox Computing http://www.vectorblox.com Typical Usage and Motivation • Embedded processing – FPGAs often control custom devices • Imaging, audio, radio, screens – Heavy data processing requirements • FPGA tools for data processing – VHDL too difficult to learn and use – C-to-hardware tools too “VHDL-like” – FPGA-based CPUs (Nios/MicroBlaze) too slow • Complications – Very slow recompiles of FPGA bitstream – Device control circuits may have sensitive timing requirements 2 © 2012 VectorBlox Computing Inc. A New Tool • MXP™ Matrix Processor – Performance • 100x – 1000x over Nios II/f, MicroBlaze – Easy to use, pure software • Just C, no VHDL/Verilog ! – No FPGA recompilation for each algorithm change • No bitstream changes • Save time (FPGA place+route can take hours, run out of space, etc) – Correctness • Easy-to-debug, e.g. printf() or gdb • Simulator runs on PC, eg regression testing • Run on real FPGA hardware, eg real-time testing 3 © 2012 VectorBlox Computing Inc. Background: Vector Processing • Data-level parallelism • Organize data as long vectors C Code for ( i=0; i<8; i++ ) a[i] = b[i] * c[i]; 4 SIMD Vector Lanes Vector Assembly set vl, 8 vmult a, b, c • Vector instruction execution Destination Vector Source – Multiple vector lanes (SIMD) Vectors – Hardware automatically repeats SIMD operation over entire length of vector 4 © 2012 VectorBlox Computing Inc. Preview: MXP Internals 6 SYSTEM DESIGN WITH MXP™ 7 © 2012 VectorBlox Computing Inc. MXP™ Processor: Configurable IP 8 © 2012 VectorBlox Computing Inc. Integrates into Existing Systems 9 © 2012 VectorBlox Computing Inc. Typical System 10 Programming MXP • Libraries on top of vendor tools – Eclipse based IDEs, command line tools – GCC, GDB, etc. • Functions and Macros extend C, C++ – Vector Instructions • ALU, DMA, Custom Instructions • Same software for different configurations – Wide MXP -> higher performance 11 Example: Adding 3 Vectors #include “vbx.h” int main() { const int length = 8; int A[length] = {1,2,3,4,5,6,7,8}; int B[length] = {10,20,30,40,50,60,70,80}; int C[length] = {100,200,300,400,500,600,700,800}; int D[length]; vbx_dcache_flush_all(); const int data_len = length * sizeof(int); vbx_word_t *va = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_word_t *vb = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_word_t *vc = (vbx_word_t*)vbx_sp_malloc( data_len ); vbx_dma_to_vector( va, A, data_len ); vbx_dma_to_vector( vb, B, data_len ); vbx_dma_to_vector( vc, C, data_len ); vbx_set_vl( length ); vbx( VVW, VADD, vb, va, vb ); vbx( VVW, VADD, vc, vb, vc ); vbx_dma_to_host( D, vc, data_len ); vbx_sync(); vbx_sp_free(); } © 2012 VectorBlox Computing Inc. Algorithm Design on FPGAs • HW and SW development is decoupled • Select HW parameters and go – No VHDL required for computing – Only resynthesize when requirements change • Design SW with these main concepts – Vectors of data – Scratchpad with DMA – Same software can run on any FPGA 13 © 2012 VectorBlox Computing Inc. MXP™ MATRIX PROCESSOR 14 © 2012 VectorBlox Computing Inc. MXP™ System Architecture Main Memory eg, DDR2 Altera Avalon Fabric Slave 2. Concurrent DMA 3-way Concurrency D$ Custom Instruction Port Master MXP DMA Engine 1. Scalar CPU Nios II/f CPU I$ Instruction & DMA Queue MXP Vector Engine VectorBlox MXP Matrix Processor 3. Vector SIMD 15 MXP Internal Architecture (1) DMA and Vector Work Queues, Instruction Decode & Control Custom Instructions Address Generation Nios II/f I$ D$ BB 5 1 DMA M CC 6 2 M AA 7 3 ALU0 Bank 0 S BB 6 2 CC 7 3 M Accum AA 4 0 ALU1 Bank 1 Avalon Fabric BB 7 3 CC 4 0 AA 5 1 ALU2 Bank 2 S DDR Control BB 4 0 Align 3 DstC CC 5 1 AA 6 2 Bank 3 Vector Scratchpad ALU3 Align 1 SrcA Align 2 SrcB 16 © 2012 VectorBlox Computing Inc. Scratchpad Memory • Multi-banked, parallel access – Addresses striped across banks, like RAID disks C 8 4 0 Data is Striped Across Memory Banks D 9 5 1 E A 6 2 F B 7 3 17 © 2012 VectorBlox Computing Inc. Scratchpad Memory • Multi-banked, parallel access – Vector can start at any location C 8 4 0 Data is Striped Across Memory Banks D 9 5 1 E A 6 2 Vector starts here F B 7 3 18 © 2012 VectorBlox Computing Inc. Scratchpad Memory • Multi-banked, parallel access – Vector can start at any location – Vector can have any length C 8 4 0 Data is Striped Across Memory Banks D 9 5 1 Vector starts here Vector of length 10 E A 6 2 F B 7 3 19 © 2012 VectorBlox Computing Inc. Scratchpad Memory • Multi-banked, parallel access – Vector can start at any location – Vector can have any length – One “wave” of elements can be read every cycle Data is Striped Across Memory Banks C 8 4 0 C 8 4 0 D 9 5 1 D 9 5 1 E A 6 2 E A 6 2 F B 7 3 F B 7 3 One clock cycle: Parallel access to one full “wave” of vector elements 20 © 2012 VectorBlox Computing Inc. Scratchpad-based Computing vbx_word_t *vdst, *vsrc1, *vsrc2; vbx( VVW, VADD, vdst, vsrc1, vsrc2 ); 21 © 2012 VectorBlox Computing Inc. MXP Internal Architecture (2) DMA and Vector Work Queues, Instruction Decode & Control Custom Instructions Address Generation Nios II/f I$ D$ BB 5 1 DMA M CC 6 2 M AA 7 3 ALU0 Bank 0 S BB 6 2 CC 7 3 M Accum AA 4 0 ALU1 Bank 1 Avalon Fabric BB 7 3 CC 4 0 AA 5 1 ALU2 Bank 2 S DDR Control BB 4 0 Align 3 DstC CC 5 1 AA 6 2 Bank 3 Vector Scratchpad ALU3 Align 1 SrcA Align 2 SrcB Custom ALU 0 . Custom ALUs Custom ALU 1 25 Custom Vector Instructions clock start valid clock start valid A0 Q D ADD A0 SUB B0 x2 C0 C0 B0 wr0 A1 wr0 ADD A1 SUB B1 x2 C1 C1 B1 wr1 A2 wr1 ADD A2 SUB B2 x2 C2 C2 B2 wr2 A3 ADD A3 SUB B3 opsize opcode wr2 x2 C3 C3 B3 wr3 2 2 00 a) Custom instruction within lanes opsize opcode wr3 2 2 01 b) Custom instruction prefix sum across lanes 26 MXP Internal Architecture (3) 27 Rich Feature Set Feature Register file MXP 4kB to 2MB # Vectors (registers) unlimited Max Vector Length unlimited Max Element Width 32b Sub-word SIMD Automatic Dispatch/Increment Parallelism Clock speed 2 x 16b, 4 x 8b 2D/3D 1 to 128 (x4 for 8b) Up to 245 MHz Latency-hiding Concurrent 1D/2D DMA Floating-point Optional via Custom Instructions User-configurable DMA, ALUs, Multipliers, S/G Ports 28 Performance Examples Application Kernels Speedup (factor) VectorBlox MXPTM Processor Size 29 © 2012 VectorBlox Computing Inc. Chip Area Requirements Nios II/f V1 4k V4 16k V16 64k V32 128k V64 256k Stratix IV-530 ALMs 1,223 3,433 7,811 21,211 46,411 80,720 212,480 DSPs 4 12 36 132 260 516 1,024 M9Ks 14 29 39 112 200 384 1,280 Nios II/f LEs V1 4k V4 16k V16 64k V32 128k Cyclone IV-115 2,898 4,467 11,927 45,035 89,436 114,480 DSPs 4 12 48 192 388 532 M9Ks 21 32 36 97 165 432 30 © 2012 VectorBlox Computing Inc. Average Speedup vs. Area (Relative to Nios II/f = 1.0) 31 © 2012 VectorBlox Computing Inc. Sobel Edge Detection • MXP achieves high utilization – Long vectors keep data streaming through FU’s – In pipeline alignment, accumulate – Concurrent vector/DMA/scalar alleviate stalling 32 Current/Future Work • Multiple operand custom instructions – Custom RTL performance, vector control • Modular Instruction Set – Application Specific Vector ISA Processor • C++ object programming model 33 Conclusions • Vector processing with MXP on FPGAs – Easy to use/deploy – Scalable performance (area vs speed) • Speedups up to 1000x – No hardware recompiling necessary • Rapid algorithm development • Hardware purely ‘sandboxed’ from algorithm 34 © 2012 VectorBlox Computing Inc. The VectorBlox MXP™ Matrix Processor • • • • • Scalable performance Pure C programming Direct device access No hardware design Easy to debug RTL Application Performance Comparison to Intel i7-2600 (running on one 3.4GHz core, without SSE/AVX instructions) CPU Fir 2Dfir Life Imgblend Median Motion Estimatio n Matrix Multiply Intel i72600 0.05s 0.36s 0.13s 0.09s 9.86s 0.25s 50.0s MXP 0.05s 0.43s 0.19s 0.50s 2.50s 0.21s 15.8s Speedup 1.0x 0.8x 0.7x 0.2x 3.9x 1.7x 3.2x 36 © 2012 VectorBlox Computing Inc. Benchmark Characteristics Table III Table III cond) V4 18.94 22.72 15.61 51.18 36.42 2.69 6.29 41.67 93.75 mean B ENCHM A RK PERFORM A NCE A ND PROPE PERFORM A NCE A ND PROPERTI ES V1 12.9 3.9 1.3 16.1 8.0 7.3 27.4 6.1 12.6 7.95 Speedup Benchmark V2 V4 autocor 24.2 41.2 4.7 rgbcmyk5.0 2.1 rgbyiq3.0 30.1 52.0 imgblend 12.7 17.2 filt3x3 14.4 26.6 median 48.2 72.4 motest 10.5 fir 12.5 27.4 50.6 matmul 13.8 20.6 to&Nios&II/f,& chmarks)& NCHM A RK 25" Performance (Millions of elem.Benchmark per second)Properties Speedup Data Type Nios II/f V1 V2 Set Size V4Taps V1 Origin V2 In/Out Intermed. Data 0.46 5.94 24.2 halfword word 11.11 1024 18.94 16 12.9 EEMBC byte EEMBC 4.56 17.68 21.41896⇥606 22.72 3.9 4.7 byte word 11.09896⇥606 15.61 EEMBC 5.20 6.74 1.3 2.1 halfword 4.83 77.63 145.57320⇥240251.18 16.1VIRAM 30.1 byte halfword 2.11 16.82 26.95320⇥240 36.423⇥3 8.0VIRAM 12.7 byte 0.10 0.74 1.45 128⇥21 2.695⇥5 7.3custom 14.4 byte 32 ⇥ 32 16 ⇥ 16 custom 0.09 2.37 4.18 6.29 27.4 48.2 halfword 3.32 20.11 34.95 4096 41.67 16 6.1custom 10.5 word 1024 ⇥ 1024 custom 11.7 148.20 322.22 593.75 12.6 27.4 Geomean 7.95 13.8 Speedup"vs"Nios"II/f" 20" 50" In/O halfw b b halfw b b b halfw w Speedup"per"ALM" 60" Nios"II/f" Single"CPU" V4 41.2 5.0 3.0 52.0 17.2 26.6 72.4 12.5 50.6 20.6 60" Nios"II/f" © 2012 VectorBlox Computing Inc. Single"CPU" 50" 37