Evaluation of LBP computational performance in multiple architectures (and other local descriptors) (computing) Center for Machine Vision Research Miguel Bordallo López, Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén Sami Varjo, Henri Nykänen, Abdenour Hadid Center for Machine Vision Research, University of Oulu, Finland MACHINE VISION GROUP A sentence I read somewhere... LBP features are desirable because of their extremely high computational performance... ...per pixel , on a high-end , connected CPU to the power grid, if we use the basic and LBP, we don’t need interpolation. MACHINE VISION GROUP Contents 1. Introduction 2. Computational complexity of local descriptors 3. LBP in desktop computers 4. LBP in mobile devices 5. LBP in dedicated computing devices MACHINE VISION GROUP Why should we care ? • • Evaluation of descriptors/features done in terms of accuracy Computational performance (sometimes) disregarded – In Matlab, not processor specific, based on libraries, not measured, ... ... But ... • • • Faster methods are able to compute larger amounts of input Applications at lower framerate might perform worse than at higher rates Computational performance is a KEY measurement for application performance MACHINE VISION GROUP ... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy MACHINE VISION GROUP ... an example... (face recognition) • Method A: lower accuracy • Method B: higher accuracy MACHINE VISION GROUP ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame MACHINE VISION GROUP ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame MACHINE VISION GROUP ... an example... (face recognition) • Method A: lower accuracy, 100ms/frame • Method B: higher accuracy, 300ms/frame MACHINE VISION GROUP What descriptor to choose? Accuracy % Computation time MACHINE VISION GROUP Contents 1. Introduction 2. Computational complexity of local descriptors 3. LBP in desktop computers 4. LBP in mobile devices 5. LBP in dedicated computing devices MACHINE VISION GROUP LBPs are essentially local descriptors HD1080@60fps UHD 1920x1080x60 up to 2 Gpix/s !!! That’s a lot of throughput !! MACHINE VISION GROUP = 125Mpix/s Linear complexity of local descriptors 1000 900 LBP(8,1) Time (ms) 800 700 600 500 400 300 200 1920x1080 1280x720 100 0 0 500000 1000000 1500000 2000000 2500000 Number of pixels MACHINE VISION GROUP Linear complexity of local descriptors 1000 900 LBP(8,1) Time (ms) 800 700 600 500 400 300 200 1920x1080 1280x720 100 0 0 500000 1000000 1500000 2000000 2500000 Number of pixels Time grows linearly with the resolution MACHINE VISION GROUP Linear complexity of local descriptors 1000 LBP(8,1) 900 BSIF LPQ Time (ms) 800 700 600 500 400 300 200 1920x1080 1280x720 100 0 0 500000 1000000 1500000 2000000 2500000 Number of pixels Time grows linearly with the resolution MACHINE VISION GROUP LBP variants 450 Census (8x8) 400 350 300 250 LBP(24,3) VLBP(8,1) LBP-TOP(8,1) 200 150 100 LBP(8,1) LOCP(8,1) CLBP(8,1) LBP(16,2) Census(4x4) 50 0 8 16 24 32 40 48 56 Time grows linearly with the number of points MACHINE VISION GROUP 64 Implications Time = K * n_pixels * n_points K is implementation dependent K is platform dependent Allows for platform comparison: – CPP metric (cycles per pixel) – Time normalized by resolution and clock frequency MACHINE VISION GROUP Local descriptor computational breakdown 1. Filtering 2. Quantization 3. Feature composition 4. Histogramming MACHINE VISION GROUP Local descriptor computational breakdown 1. Filtering (LBP) 1 1 0 0 0 0 0 0 -1 f1 0 0 1 0 0 -1 f2 0 0 0 0 0 0 0 0 0 2.0 Quantization 3. Feature composition 4. Histogramming MACHINE VISION GROUP -1 f3 0 ... 0 ... 0 0 0 0 -1 1 0 f8 0 Local descriptor computational breakdown 1. Filtering (BSIF) -0.18 0.19 -0.19 2.50 -2.22 0.29 -1.56 -0.01 -3.16 -0.67 0.95 0.60 0,75 -0.14 3.29 f1-2.72 -1.07 0.25 -0.79 2.20 0.40 f2-0.67 1.46 0.69 2. Quantization 3. Feature composition 4. Histogramming MACHINE VISION GROUP 0.05 -0.35 2.74 -2.68 -0.38 0.03 -0.48 0.63 f3 0.13 0.22 ... 0.19 0.08 ... f8 Local descriptor computational breakdown 1. Filtering 2. Quantization (LBP, LPQ, BSIF) q1 = f1 > 0 , q2 = f2 >0 3. Feature composition 4. Histogramming MACHINE VISION GROUP ... q8 = f8>0 Local descriptor computational breakdown 1. Filtering 2. Quantization 3. Feature composition (LBP, LPQ, BSIF) LBP = q1*1 + q2*2 + q3*4 + ... LBP = q1 + q2<<1 + q3<<2 + ... 4. Histogramming MACHINE VISION GROUP + q8*128 + q8<<7 Local descriptor computational breakthrough 1. Filtering 2. Quantization 3. Feature composition 4. Histogramming (LBP, LPQ, BSIF if LBP = 1 then bin1++ if LBP = 2 then bin2++ ... ... if LBP = 255 then bin255++ MACHINE VISION GROUP LBP computational breakdown 1. Filtering 2. Quantization 3. Feature composition 4. Histogramming MACHINE VISION GROUP LBP computational breakdown Filtering Feature composition Quantization Histogramming 100 1. Filtering 90 80 2. Quantization 70 Time consumed (%) 3. Feature composition 34.5 60 7.8 50 40 4. Histogramming 30 56.6 20 10 0 MACHINE VISION GROUP 1 LBP Local descriptor computational breakdown Filtering 3.95x Quantization 1. Filtering Feature composition Histogramming 2. Quantization 4. Histogramming Time consumed (%) 3. Feature composition 2.80x 1x 34% 76% 83% 56% 20% 14% LBP MACHINE VISION GROUP LPQ BSIF Local descriptor computational breakthrough Filtering 3.95x Quantization 0. Interpolation Feature composition Histogramming 1. Filtering 3. Feature composition 4. Histogramming Time consumed (ms.) 2. Quantization 2.80x 1x 34% 76% 83% 56% 20% 14% LBP MACHINE VISION GROUP LPQ BSIF Local descriptor computational breakthrough Filtering Quantization 0. Interpolation 2. Quantization Histogramming Time consumed (ms.) 1. Filtering Feature composition 3.95x 3. Feature composition 2.80x 4. Histogramming 1x LBP MACHINE VISION GROUP LPQ BSIF Local descriptor computational breakthrough Interpolation Filtering 0. Interpolation Feature composition Histogramming 1. Filtering 1.25x 1x Time consumed (ms.) 2. Quantization 1.40x Quantization 86% 70% 62% 56% 4.6% 23% 31% 7.5% 6.1% 5.4% 3. Feature composition 4. Histogramming LBP MACHINE VISION GROUP LPQ BSIF Contents 1. Introduction 2. Computational complexity of local descriptors 3. LBP in desktop computers 4. LBP in mobile devices 5. LBP in dedicated computing devices MACHINE VISION GROUP Personal (desktop) computers High performance applications Not constrained (almost) by power Numerous available technologies: Libraries, programming languanges, support software Short developing times !!! MACHINE VISION GROUP Personal (desktop) computer applications • Main goal: Maximize performance - High speed High framerate Low latency High resolutions Best quality MACHINE VISION GROUP Personal (desktop) computers Computing devices: CPUs (single core or multicore) GPUs (single GPU or multiple GPUs) MACHINE VISION GROUP General Purpose Processors (GPPs) • Essentially SISD machines • Optimized for low latency • Single or multiple cores • Include SIMD units MACHINE VISION GROUP CPU implementation strategies for LBP • Avoiding conditional branching • Using SIMD units • Using all cores MACHINE VISION GROUP Avoiding conditional branching • Reduces the number of conditional branches – Result cannot be predicted • Substitutes comparisons for substractions • Use ”two’s complement” numeric representation to know sign of substraction – In practice equivalent to a comparison • Needs sufficient amounts of bits to avoid overflows Up to 3 times faster !!! Mäenpää, T., Turtinen, M., Pietikäinen, M.: Real-time surface inspection by texture. Real Time Imaging. 9( MACHINE VISION GROUP Use of SIMD units • Included in every modern CPU core • Exploited using inline assembly, specific functions, array annotations, pragmas or enabled compilers • Computes several pixels at the same time • Not independent units (shared control code with CPU) • Requires preprocessing for maximum efficiency – About 7% overhead Up to 7x speedup Juránek, R., Herout, A., Zemĉik, P.: Implementing local binary patterns with SIMD instructions of CPU. MACHINE VISION GROUP Exploiting multiple cores • Posix threads, Intel TBB, OpenMP, OpenCL • Divide image in multiple overlaping stripes • Asign one stripe per core • Overlaps cause contention on data reading and overhead For N cores, up to 0.9*N times faster 2 cores = 1.8x 4 cores = 3,7x 8 cores = 6,8x Humenberger, M., Zinner, C., Kubinger, W.: Performance evaluation of a census-based stereo matching embedded and multi-core hardware. MACHINE VISION GROUP Comparative performance Processor Time (ms) LBP/iLBP Speedup CPP LBP/iLBP CPP per core LBP/iLBP LBP/iLBP Single-core 2.5 GHz Scalar 49 / 350 1× / 1× 133/950 133/950 Branchless 16 / 129 3× / 2.7× 45/350 45/350 SIMD 6.6 / 48 7.2× 18/130 18/130 14.5× / 14× 9.2/67 18.4/134 6.2/45 18.6/135 5/36 20/144 7.4× / i5 Quadcore 2.5 GHz 2 Cores 3.4 / 24.5 3 Cores 2.3 / 16.7 21.5× / 4 Cores 1.8 / 13.2 27× / 26.5× MACHINE VISION GROUP 21× Comparative performance Processor Time (ms) LBP/iLBP Speedup CPP LBP/iLBP CPP per core LBP/iLBP LBP/iLBP Single-core 2.5 GHz Scalar 49 / 350 1× / 1× 133/950 133/950 Branchless 16 / 129 3× / 2.7× 45/350 45/350 SIMD 6.6 / 48 7.4× / 7.2× 18/130 18/130 14.5× / 14× 9.2/67 18.4/134 6.2/45 18.6/135 5/36 20/144 i5 Quadcore 2.5 GHz 2 Cores 3.4 / 24.5 3 Cores 2.3 / 16.7 21.5× / 4 Cores 1.8 / 13.2 27× / 26.5× MACHINE VISION GROUP 21× Graphics processing units • Independent units (work concurrently with CPUs) • Essentially SIMD machines • Many simpler cores (hundreds) – Operating at lower clockrates • Operating in floating-point data • Built-in graphics primitives – Ideal for interpolation and filtering • Flow control, looping and branching restricted MACHINE VISION GROUP GPU implementations • Stream processing • Exploiting shared and texture memory • Multi-platform code • Data transfer consideration MACHINE VISION GROUP Stream processing Input stream Output stream Processor array MACHINE VISION GROUP Exploiting shared and texture memory Stream processing model MACHINE VISION GROUP Shared memory model Exploiting shared and texture memory • Shared memory acts as a practical L2 cache • Texture memory as read-only shared memory • Textures have ”free” bilinear interpolation Up to 5x speedup MACHINE VISION GROUP Multi-platform code • GPU can be used concurrently with CPU • OpenCL allows the use of the same code • Concurrent implementations surpass GPU-only MACHINE VISION GROUP Multi-platform code CPU CPU and GPU used concurrently Input data Same code for both devices GPU MACHINE VISION GROUP Output data Data transfer • Data needs to be transferred to GPU memory - It can be a bottleneck • Data transfers can overlap computations - Latency can be hidden • Long imaging pipelines preferred MACHINE VISION GROUP - More computations per transfer Data transfer (LBP case) • LBP is memory bound • Most time consumed in memory acceses • Graphic Memory bandwitdh vs Graphics Bus bandwidth • Transfer time about 4 times smaller than computation time • Data transfer can be hidden (affects latency but not throughput) MACHINE VISION GROUP Comparative performance I/O Time (ms) Speedup CPP CPP per core (ms) LBP/iLBP LBP/iLBP LBP/iLBP LBP/iLBP 0 49/350 1×/1× 133/950 133/950 0.36 10.1/130 4.8×/2.7× 27.5/350 110/1400 OpenGL 0.31 6.3/6.3 7.8×/54× 17/17 528/541 CUDA 0.31 1.4/1.5 35×/233× 3.8/4.0 120/125 OpenCL 0.31 1.6/1.7 30.5×/205× 4.3/4.6 145/151 Processor Single core Scalar Quad core OpenCL FX5600 MACHINE VISION GROUP Comparative performance I/O Time (ms) Speedup CPP CPP per core (ms) LBP/iLBP LBP/iLBP LBP/iLBP LBP/iLBP 0 49/350 1×/1× 133 / 950 133 / 950 0.36 10.1/130 4.8×/2.7× 27.5 / 350 110 / 1400 OpenGL 0.31 6.3/6.3 7.8×/54× 17/17 528 / 541 CUDA 0.31 1.4/1.5 35× / 233× 3.8 / 4.0 120 / 125 OpenCL 0.31 1.6/1.7 30.5×/ 205× 4.3 / 4.6 145 / 151 Processor Single core Scalar Quad core OpenCL FX5600 MACHINE VISION GROUP CPU vs GPU Interpolation Filtering Quantization 1.15x Feature composition 1.10x Histogramming 4.2% 4.5% Time consumed (ms.) 1x 5% 86% 86% 70% 70% 62% 40% 45% 47% 5.5% 4.8% 4.4% 62% 47% 4.6% 23% 31% 7.5% 6.1% 5.4% LBP LPQ MACHINE VISION GROUP BSIF 2.5% LBP 43% 2.3% LPQ 41% 2.1% BSIF Contents 1. Introduction 2. Computational complexity of local descriptors 3. LBP in desktop computers 4. LBP in mobile devices 5. LBP in dedicated computing devices MACHINE VISION GROUP Mobile devices are not ”smaller” computers The small physical size of a mobile device implies constraints not present in desktops !!! MACHINE VISION GROUP Mobile devices ”Ready made” package: processors + sensors Growing number of technologies: Libraries, programming languages, support software Massive application deployment BATTERY POWERED !!! MACHINE VISION GROUP Mobile device applications • Main goal: Find ”sweet spot” between performance and power consumption - Energy efficient implementations - Best performance per Joule • Metrics: CPP and JPP (Joules per pixel) MACHINE VISION GROUP Mobile SoCs • Computing Devices – CPU – GPU – DSP (+ ISP) • Bottlenecks – Memory Scen – Communication MACHINE VISION GROUP Mobile SoCs • Computing Devices – CPU – GPU – DSP (+ ISP) • Bottlenecks – Memory Scen – Communication MACHINE VISION GROUP Experiments setup • OMAP 3 family (OMAP3530) – ARM Cortex A8 CPU – Power VRSGX535 GPU • 3 set-ups: – Beagleboard revision 3 – Zoom AM3517EVM (TI Sitara) – Nokia N900 MACHINE VISION GROUP Processor power consumption Different processors have different consumptions Small cuts in power consumption have huge impact in battery time !!!! MACHINE VISION GROUP Mobile processors (CPUs) • RISC designs • ARM architecture less complex instructions • Single or multiple cores • Include VFP coprocessors and SIMD units MACHINE VISION GROUP ARM optimization strategies for LBP • Use built-in ARM registers • Using NEON coprocessor • Offloading tasks to other energy-efficient cores MACHINE VISION GROUP ARM optimization • Access to memory is crucial • Built-in ARM registers very fast • No increase in power consumption • Use general ARM optimization strategies – Do/while loops, decrement pointers Up to 25% more efficient !!! MACHINE VISION GROUP Use of NEON coprocessor • Included in most of ARM cores • Similar to desktop SIMD units • Not independent units (shared control code with CPU) • Incurs in a power overhead !!!! – About 20% overhead Up to 40% performance gain MACHINE VISION GROUP ARM@600Mhz – 550mW ARM+NEON – 670mW Mobile GPUs • Independent units • Unified Memory Architecture • Memory access is shared • Bandwidth bottleneck • Smaller Energy per Instruction (EPI) !!!! – Only 93mW@110MHz MACHINE VISION GROUP Stream processing (OpenGL ES) • Stuck with stream processing (for the moment) • No shared memory, limited texture memory MACHINE VISION GROUP Mobile GPU implementation • Four RGBA channels can be used at the same time (requires preprocessing) • Interpolation ”for free”, multiscale very fast Up to 3 times less energy consumption Bordallo López, M., Nykänen, H., Hannuksela, J., Silvén, O., Vehviläinen, M.: Accelerating image recognition on mobile devices using gpgpu. In: SPIE Electronic Imaging, 2011 MACHINE VISION GROUP Mobile GPU implementation • Accuracy and precission might be not optimal Drawback: Shader implementation might not be consistent MACHINE VISION GROUP Mobile DSP implementation • VLIW architectures • Very energy efficient long instructions over multiple data optimized for signal processing – Only 0.39mW/MHz at 430 MHz (Fixed point) • Fixed-point vs floating-point the gap is closing Up to 10 times less energy consumption !!! MACHINE VISION GROUP Comparative performance (Nokia N900) Time(ms) Speedup CPP Power pJ/pixel LBP/iLBP LBP/iLBP LBP/iLBP (mW) LBP/iLBP OpenCV 115/229 1×/1× 75.9/149.9 550 70/137 Branchless 88/175 1.3×/1.3× 57.3/113.9 550 52/104 NEON 57/118 2×/1.9× 37.1/76.8 670 41/86 Branchless 28/50 4.1×/ 4.5× 13.2/23.5 248 7.6/13.6 Intrinsics 14/26 8.2×/8.8× 6.7/11.8 248 3.9/6.9 158/190 0.7×/1.2× 18.9/22.7 93 15.9/19.1 Processor ARM 600 MHz DSP 430 MHz GPU 110 MHz OpenGL ES MACHINE VISION GROUP Contents 1. Introduction 2. Computational complexity of local descriptors 3. LBP in desktop computers 4. LBP in mobile devices 5. LBP in dedicated computing devices MACHINE VISION GROUP Dedicated hardware • Dedicated (programmable) architectures offer: – Incredibly high performance (Hybrid SIMD/MIMD) or.. – Extremely good energy efficiency (TTA) Longer developing times !!! MACHINE VISION GROUP Hybrid SIMD-MIMD architecture Essentially a processor array (GPU) that allows branching reconfiguring to MIMD mode Extremely fast !!! Nieto, A., López Vilariño, D., Brea, V.: SIMD/MIMD dynamically-reconfigurable architecture for high-performance embedded vision systems. IE MACHINE VISION GROUP Transport Triggered Architecture Essentially a reconfigurable HW codec that moves data across arithmetical units Extremely energy efficient and still programmable Boutellier, J., Lundbom, I., Janhunen, J., Ylimäinen, J., Hannuksela, J.: Application-specific instruction processor for extracting local binary patterns. DASIP 2012 MACHINE VISION GROUP Comparative performance FPGA I/O time Time(ms) CPP Power pJ/pixel model (ms) LBP/iLBP LBP/iLBP (mW) LBP/iLBP SIMD/MIMD Xilinx Virtex 6 6.9 1.15/1.92 0.19/0.31 5,227 6.8/11.4 TTA Altera Cyclone IV 0 67.5 /122.88 11.0/20.0 14.5 1.1/2.0 Census Stereo Altera 0 Cyclone II 7.78/N/A 0.42/N/A 687.5 5.8/N/A Processor MACHINE VISION GROUP Comparative performance MACHINE VISION GROUP Can we go the extra mile? MACHINE VISION GROUP LBP with specific hardware • LBP ASIC – Very small area – Not programmable – About 6 cpp per core • Massively parallel processor array – 1 processor per pixel – 5us to calculate 4096 pixels = 1ns per pixel – Has to be built in the sensor (experimental) MACHINE VISION GROUP Have you run the LBP everywhere? MACHINE VISION GROUP LBP in a supercomputer • Minotauro cluster at Barcelona Supercomputing Center -128 blades (2 processors at 2.53GHz + 2 M2090 Tesla Cards) - CPUs + GPU concurrently MACHINE VISION GROUP Summary • LBP and local descriptors computational needs to grow with resolution • Performance is a crucial measurement of the descriptor quality • Efficient implementation of LBP highly dependent on the architecture • Tradeoff in performance, energy efficiency and developing time • Descriptor selection is architecture dependent – Desktop GPUs suitable for complex descriptors and interpolations – Mobile DSPs and CPUs best tradeoff with power consumption – Dedicated hardware best solution if developing time is not an issue MACHINE VISION GROUP Thanks !! Any question?? More details in: - Bordallo López M., Nieto A., Boutellier J., Hannuksela J., and Silvén O. "Evaluation of real-time LBP computing in multiple architectures," Journal of Real Time Image Processing, 2014. - Hadid A., Ylioinas J., Bordallo López M., (soon to appear in IPTA2014) "Face and Texture Analysis Using Local Descriptors: A Comparative Analysis" Some source code available at our webpages: http://www.cse.oulu.fi/CMV/Research/LBP http://www.ee.oulu.fi/~miguelbl/LBP-Software/ (other LBP implementations available under request) MACHINE VISION GROUP Concurrent use of several processors MACHINE VISION GROUP Concurrent use of several processors Scene render ing MACHINE VISION GROUP LBP on CPU/GPU clusters Host memory at a different distance for different d Device memory at different distance for different c MACHINE VISION GROUP