Reconfigurable computing for future vision-capable devices Center for Machine Vision Research Miguel Bordallo López1, Alejandro Nieto2, Olli Silvén1, Jani Boutellier1, David López Vilariño2 1 Center for Machine Vision Research, University of Oulu, Finland 2 CITIUS, University of Santiago de Compostela, Spain Center for Machine Vision Research Outline Future vision-capable devices Experimental setup Reconfigurable architectures EnCore processor and Configurable Flow Accelerator Hybrid SIMD/MIMD architecture Transport-Triggered architecture based processors Summary Center for Machine Vision Research Future Vision capable devices • Computer Vision algorithms are becoming more and more sophisticated. – SISD, SIMD and MIMD paradigms. Several building blocks. • Efficient hardware architectures are a requirement in order to address their high computational demands: – Optimizing for Latency and Throughput • Current architectures have drawbacks in future scalability or flexibility • Processor design: compromise between flexibility and performance. • Power consumption becomes critical Center for Machine Vision Research Reconfigurable computing devices • Reconfigurable computing devices can rewire themselves dynamically • They adapt effectively to different programming tasks • Ideally, they transform themselves to run across different application types with the highest performance • They enable reusable transistors, saving die area, power consumption and increasing overall performance • Computer vision application characteristics have significant overlap with the advantages of reconfigurable architectures. Center for Machine Vision Research Contribution: Implementing a set of building blocks of computer vision algorithms in reconfigurable architectures, comparing them with their current “fixed” counterparts Scenario 1: CPU + SIMD unit vs Low-power CPU + reconfigurable accelerator vs Hybrid SIMD/MIMD processor vs TTA-Based ASIPS Scenario 2: Mobile GPU Scenario 3: DSP Center for Machine Vision Research Experimental setup for comparison • OMAP 3 family (OMAP3530) – ARM Cortex A8 CPU – PowerVR SGX530 GPU – TI C6416 DSP • 2 set-ups: – Beagleboard revision 3 – Nokia N900 • Measuring power consumption • Operating point: – ARM@600 MHz. – GPU@110 MHz. – DSP@430 MHz. – Multimeter (Beagleboard) – Battery sensor readings (N900) – Vendor spreadsheets … about 10-20% discrepancies Center for Machine Vision Research Extensible low-power processor and reconfigurable accelerator Center for Machine Vision Research EnCore Processor • EnCore Processor is a 32-bit RISC core – Implements ARCompact Instruction Set – Includes an expanded datapath to embed user-defined instruction set extensions • Based on the analysis of Standard compiled code – Expensive operations moved to dedicated hardware – Dedicated hardware implemented as a configurable flow accelerator, the CFA unit Center for Machine Vision Research Configurable Flow Accelerator • CFA utilizes several single-function ALUs – – – – Identify Allows spatial & temporal parallelism (resource sharing & pipelining) processor Fully programmable (up to 64 reconfigurable extension instructions) extension s Up to 12 inputs and 8 outputs per ISE, 32 extension registers Select Each ISE, 3 stage pipeline, 4 independent arithmetic operations • Identifying candidate specific instructions Extension s – Automatic (modified compiler) or manual Adapt to (Create – Resulting in templates to adapt to existing CFA or generation of a new new) one template – Trade-off between reusability and performance (lowCenter latency) for Machine Vision Research EnCore Processor implementation • Encore Castle chip – – – – – – – Developed at University of Edinburgh (PASTA group) Generic 90nm CMOS process Only 2.25 mm 2 including CFA + 2x32KB caches Embedded in a SOC with 32-bit memory interface Fixed clock@600 MHz. Core voltage: 0.9V to 1.1 V Chip level power consumption: 70 mW Center for Machine Vision Research Performance evaluation Performance (CPP) Energy (nJPP) Operation ARM NEON EnCore CFA ARM NEON EnCore CFA Grayscale conversion 216 156 240 66 198,1 174,2 28,0 7,7 78 56 50 47 71,5 62,5 5,8 5,5 Alpha Blending 141 100 86 20 129,1 111,7 10,0 2,3 Blur detection 73 52 84 19 19,5 58,0 9,8 2,3 423 302 199 58 66,0 37,2 23,2 6,8 21 21 29 20 19,3 23,5 3,4 2,3 Image Rotation 546 390 608 234 500,0 435,5 70,9 27,3 Image Scaling 384 250 390 143 352,3 279,2 45,5 16,7 Image displacement Convolution 3x3 Histogram Center for Machine Vision Research Performance evaluation Cycles per Pixel 600 ARM EnCore NEON CFA 500 400 300 5x 200 1x 100 0 Grayscale conversion Image displacement Alpha Blending Blur detection Convolution 3x3 Histogram Image Rotation Image Scaling Center for Machine Vision Research Performance evaluation Cycles per Pixel 600 EnCore NEON CFA 500 400 300 5x 200 1x 100 0 600 nJoules per Pixel ARM Grayscale conversion Image displacement Alpha Blending Blur detection Convolution 3x3 Histogram Image Rotation Image Scaling 500 400 300 200 50x 10x 100 0 Grayscale conversion Image displacement Alpha Blending Blur detection Convolution 3x3 Histogram Image Rotation Image Scaling Center for Machine Vision Research Reconfigurable SIMD-MIMD hybrid embedded processor Center for Machine Vision Research Hybrid SIMD/MIMD processor Reconfigurable on-the-fly between SIMD and MIMD operation modes Center for Machine Vision Research Nieto, A., López Vilariño, D., Brea, V.: SIMD/MIMD dynamically-reconfigurable architecture for high-performance embedded vision systems. IEEE ASAP 2012 Hybrid architecture elements • Programmable Input/Output Processor – – – Transfers data between computing cores and external RAM Address generation unit allows flexible data access pattern generation Simple program flow control. Single operation for data transfers: move <source_addr>, <dest_addr>, <size> • Reduced Instruction Set architecture (RISC) – – – Storage: Register Bank (temporal) & Memory Element (large data/program) Automatic transfers between Pes (Stream Queues) ALUs 1-3 operands (arithmetic/logical/DSP) Center for Machine Vision Research SIMD vs MIMD modes • SIMD layout: 1-dimensional processor array – – – • SIMD Control unit manages PEs, same instruction Overlap process and data transfer Data exchange: Synchronous side to side network MIMD layout: 2-dimensional processor torus – – – Every PE runs its own program Units work in parallel Data exchange: Synchronized local network Center for Machine Vision Research Implementation • Board: Xilinx ML605 – 512MB DDR3 64-bit SO-DIMM@400 MHz. – 1400MB/s (read), 1140MB/s (write) – Xilinx Power Estimator (XPE) • FPGA: Xilinx Virtex-6 XC6VLX240T-1: – 416 Block RAMs (36kbit each) – 468 DSP48E1 slices • SoC configuration – Main CPU: MicroBlaze soft-processor – 128x32-bit Processing elements • Fixed-point arithmetics • 8-word Registers Bank, • 1024-word ME, 4x4-word Stream Queue – Coding: Pseudo assembly – 19.6 GOP/s @ 150 MHz. – Max power consumption: 7.2 W (1.9W static 5.3W dynamic) Center for Machine Vision Research Performance evaluation Performance (CPP) Energy (nJPP) Operation ARM + NEON Mobile GPU ARM + NEON Mobile GPU SIMD/MIMD SIMD/MIMD Grayscale conversion 156 13,4 2,1 174,2 11,3 72,8 Image displacement 56 13,6 1,3 62,5 11,5 45,0 Alpha Blending 100 13,6 1,0 111,7 11,5 34,7 Feature Extraction 549 75,5 0,7 613,0 63,8 24,3 Blur detection 52 100,7 1,0 58,0 85,1 34,7 LBP extraction 37 18,1 0,2 41,3 15,2 6,9 2D-Convolution 302 160,0 1,0 337,2 135,3 34,7 Histogram 302 ---- 2,4 1,22 ---- 83,2 Image rotation 1 13,6 12,0 435,5 11,5 416 Image scaling 250 20 136,7 279,1 16,9 4714 Hybrid Hybrid Center for Machine Vision Research Cycles per Pixel Performance evaluation 600 ARM + NEON Mobile GPU Hybrid SIMD/MIMD 500 400 300 200 100 0 Grayscale conversion Image Alpha Blending Feature Extraction Blur detection LBP extraction 2D-Convolution Histogram Image rotation Center for Machine Vision Research Cycles per Pixel Performance evaluation 180 Mobile GPU 160 Hybrid SIMD/MIMD 140 120 100 80 60 40 20 0 Grayscale conversion Image Alpha Blending Feature Extraction Blur detection LBP extraction 2D-Convolution Histogram Image rotation Center for Machine Vision Research Cycles per Pixel Performance evaluation Mobile GPU Hybrid SIMD/MIMD 80 160 100 20 0 Grayscale conversion Image Alpha Blending Feature Extraction 150 x ? Blur detection LBP extraction 2D-Convolution Histogram 0.9 x Image rotation Center for Machine Vision Research nJoules per Pixel Cycles per Pixel Performance evaluation Mobile GPU Hybrid SIMD/MIMD 20 0.9x 150x ? 0 Grayscale conversion Image Alpha Blending Feature Extraction Blur detection LBP extraction 2D-Convolution Histogram Image rotation 500 400 300 0.02x 4x 200 100 ? 0 Grayscale conversion Image Alpha Blending Feature Extraction Blur detection LBP extraction 2D-Convolution Histogram Image rotation Center for Machine Vision Research Transport Triggered Architecture-based application specific processors Center for Machine Vision Research ASIPs vs others • Application-specific instruction processor (ASIP) for computing intensive computer vision operations • Benefit of ASIP vs. ASIC: programmability • Benefit of ASIP vs. GPP: less power consumption lower power consumption ASIC ASIP DSP GPP better programmability Center for Machine Vision Research Transport-Triggered Architecture • Currently a complete, mature, open-source co-design toolset exists • VLIW-style instruction-parallel architecture • Widely scalable – – – – from tiny 1-instruction-per-cycle, 8-bit processors to N instructions per cycle, 32-bit processors any number of register files and function units custom function units and instructions supported • C compiler based on LLVM • Processor generator for (FPGA) synthesis TTA-based Co-design Environment (TCE):Center for Machine Vision Research Tampere University of Technology http://tce.cs.tut.fi/ TTA-based processor • Exploits instruction-level parallelism • No theoretical limit to number of operations – Max. operating frequency goes down with number of buses Center for Machine Vision Research Example TTA-based ASIP: LBP extraction Center for Machine Vision Research Implementation • LBP processor * – Implemented on FPGA • Altera Cyclone IV EP4CE115F29C7 – HDTV video processing @ 304 MHz. max freq – Reconfigurable for LBP and iLBP flavours – 4.84 mW @ 50 MHz. (PowerPlay power estimator) • FFT processor ** – – – – Implemented on 130nm technology 1024-point FFTs @ 400 MHz. Designed for minimum latency. Fully Scalable 74 mW @ 250 MHz. , 1.5V (Synopsys Power Compiler) *Boutellier, J., Lundbom, I., Janhunen, J., Ylimäinen, J., Hannuksela, J.: Application-specific instruction processor for extracting local binary patterns. DASIP 2012 **Pitkänen T., Mäkinen R., Heikkinen J., Partanen T., Takala J..: Center for Machine Vision Research Low-Power, High-Performance TTA Processor for 1024-point FFT Embedded Computer Systems: Architectures, Modeling, and Simulation. 2006 Performance evaluation Performance (CPP) Energy (nJPP) Operation ARM DSP TTA ARM DSP TTA LBP 37,1 6,7 11,0 41,1 3,9 1,1* Interpolated LBP 76,8 11,8 20,0 86,2 6,9 2,0* 1024-point FFT 160,1 6,2 5,0 146 3,5 1,1** *FPGA implementation , **130nm technology Center for Machine Vision Research Cycles per Pixel Performance evaluation 160 ARM 140 DSP TTA 120 100 80 60 40 20 0 LBP Interpolated LBP 1024-point FFT Center for Machine Vision Research Cycles per Pixel Performance evaluation 25 DSP TTA 20 0.6x 15 1.1x 10 5 0 LBP Interpolated LBP 1024-point FFT Center for Machine Vision Research nJoules per Pixel Cycles per Pixel Performance evaluation 25 DSP TTA 20 0.6x 15 10 1.1x 5 0 LBP Interpolated LBP 1024-point FFT 8 7 6 5 4 3.5x 3 2 3.5x 3.5x 1 0 LBP Interpolated LBP 1024-point FFT Center for Machine Vision Research Summary and conclusion Reconfigurable accelerators Reconfigurable • Reconfigurable processors are a major opportunity for computer vision devices – – • Reconfigurability suits the different paradigms of computer vision applications Offer a good trade-off between performance & energy efficiency & space Still challenges to be addressed – – – Integration into existing HW and SW solutions Development of efficient tools Identification of novel applications and their bottlenecks Center for Machine Vision Research Thank you !! Any Question ? Center for Machine Vision Research Processor power consumption Different processors have different consumptions Small cuts in power consumption have huge impact in battery time !!!! Center for Machine Vision Research General Purpose Processors (GPPs) • Essentially SISD machines • Optimized for low latency • Single or multiple cores • Include SIMD units Center for Machine Vision Research Graphics processing units • Independent units (work concurrently with CPUs) • Essentially SIMD machines • Many simpler cores (hundreds) – Operating at lower clockrates • Operating in floating-point data • Built-in graphics primitives – Ideal for interpolation and filtering • Flow control, looping and branching restricted Center for Machine Vision Research