PowerPoint-esitys

advertisement
Evaluation of LBP
computational performance
in multiple
architectures
(and other local descriptors)
(computing)
Center for Machine Vision Research
Miguel Bordallo López,
Alejandro Nieto, Jani Boutellier, Jari Hannuksela, Olli Silvén
Sami Varjo, Henri Nykänen, Abdenour Hadid
Center for Machine Vision Research, University of Oulu, Finland
MACHINE VISION GROUP
A sentence I read somewhere...
LBP features are desirable because of their
extremely high computational performance...
...per pixel
, on a high-end
, connected
CPU
to the power grid,
if we use the basic
and LBP,
we don’t need interpolation.
MACHINE VISION GROUP
Contents
1. Introduction
2. Computational complexity of local descriptors
3. LBP in desktop computers
4. LBP in mobile devices
5. LBP in dedicated computing devices
MACHINE VISION GROUP
Why should we care ?
•
•
Evaluation of descriptors/features done in terms of accuracy
Computational performance (sometimes) disregarded
–
In Matlab, not processor specific, based on libraries, not measured, ...
... But ...
•
•
•
Faster methods are able to compute larger amounts of input
Applications at lower framerate might perform worse than at higher rates
Computational performance is a KEY measurement for application performance
MACHINE VISION GROUP
... an example... (face recognition)
• Method A: lower accuracy
• Method B: higher accuracy
MACHINE VISION GROUP
... an example... (face recognition)
• Method A: lower accuracy
• Method B: higher accuracy
MACHINE VISION GROUP
... an example... (face recognition)
• Method A: lower accuracy, 100ms/frame
• Method B: higher accuracy, 300ms/frame
MACHINE VISION GROUP
... an example... (face recognition)
• Method A: lower accuracy, 100ms/frame
• Method B: higher accuracy, 300ms/frame
MACHINE VISION GROUP
... an example... (face recognition)
• Method A: lower accuracy, 100ms/frame
• Method B: higher accuracy, 300ms/frame
MACHINE VISION GROUP
What descriptor to choose?
Accuracy %
Computation time
MACHINE VISION GROUP
Contents
1. Introduction
2. Computational complexity of local descriptors
3. LBP in desktop computers
4. LBP in mobile devices
5. LBP in dedicated computing devices
MACHINE VISION GROUP
LBPs are essentially local descriptors
HD1080@60fps
UHD
1920x1080x60
up to 2 Gpix/s !!!
That’s a lot of throughput !!
MACHINE VISION GROUP
= 125Mpix/s
Linear complexity of local descriptors
1000
900
LBP(8,1)
Time (ms)
800
700
600
500
400
300
200
1920x1080
1280x720
100
0
0
500000
1000000
1500000
2000000
2500000
Number of pixels
MACHINE VISION GROUP
Linear complexity of local descriptors
1000
900
LBP(8,1)
Time (ms)
800
700
600
500
400
300
200
1920x1080
1280x720
100
0
0
500000
1000000
1500000
2000000
2500000
Number of pixels
Time grows linearly with the resolution
MACHINE VISION GROUP
Linear complexity of local descriptors
1000
LBP(8,1)
900
BSIF
LPQ
Time (ms)
800
700
600
500
400
300
200
1920x1080
1280x720
100
0
0
500000
1000000
1500000
2000000
2500000
Number of pixels
Time grows linearly with the resolution
MACHINE VISION GROUP
LBP variants
450
Census (8x8)
400
350
300
250
LBP(24,3)
VLBP(8,1)
LBP-TOP(8,1)
200
150
100
LBP(8,1)
LOCP(8,1)
CLBP(8,1)
LBP(16,2)
Census(4x4)
50
0
8
16
24
32
40
48
56
Time grows linearly with the number of points
MACHINE VISION GROUP
64
Implications
Time = K * n_pixels * n_points
K is implementation dependent
K is platform dependent
Allows for platform comparison:
– CPP metric (cycles per pixel)
– Time normalized by resolution and clock frequency
MACHINE VISION GROUP
Local descriptor computational breakdown
1. Filtering
2. Quantization
3. Feature composition
4. Histogramming
MACHINE VISION GROUP
Local descriptor computational breakdown
1. Filtering (LBP)
1
1
0
0
0
0
0
0
-1
f1
0
0
1
0
0
-1
f2
0
0
0
0
0
0
0
0
0
2.0 Quantization
3. Feature composition
4. Histogramming
MACHINE VISION GROUP
-1
f3
0
...
0
...
0
0
0
0
-1
1
0
f8
0
Local descriptor computational breakdown
1. Filtering
(BSIF)
-0.18 0.19 -0.19
2.50 -2.22 0.29
-1.56
-0.01
-3.16
-0.67 0.95
0.60
0,75
-0.14
3.29
f1-2.72
-1.07
0.25
-0.79
2.20
0.40
f2-0.67
1.46
0.69
2. Quantization
3. Feature composition
4. Histogramming
MACHINE VISION GROUP
0.05
-0.35
2.74
-2.68
-0.38
0.03
-0.48
0.63
f3
0.13
0.22
...
0.19
0.08
...
f8
Local descriptor computational breakdown
1. Filtering
2. Quantization (LBP, LPQ, BSIF)
q1 = f1 > 0 , q2 = f2 >0
3. Feature composition
4. Histogramming
MACHINE VISION GROUP
...
q8 = f8>0
Local descriptor computational breakdown
1. Filtering
2. Quantization
3. Feature composition (LBP, LPQ, BSIF)
LBP = q1*1 + q2*2 + q3*4 + ...
LBP = q1 + q2<<1 + q3<<2 + ...
4. Histogramming
MACHINE VISION GROUP
+ q8*128
+ q8<<7
Local descriptor computational breakthrough
1. Filtering
2. Quantization
3. Feature composition
4. Histogramming (LBP, LPQ, BSIF
if LBP = 1 then bin1++
if LBP = 2 then bin2++ ...
...
if LBP = 255 then bin255++
MACHINE VISION GROUP
LBP computational breakdown
1. Filtering
2. Quantization
3. Feature composition
4. Histogramming
MACHINE VISION GROUP
LBP computational breakdown
Filtering
Feature composition
Quantization
Histogramming
100
1. Filtering
90
80
2. Quantization
70
Time consumed (%)
3. Feature composition
34.5
60
7.8
50
40
4. Histogramming
30
56.6
20
10
0
MACHINE VISION GROUP
1
LBP
Local descriptor computational breakdown
Filtering
3.95x
Quantization
1. Filtering
Feature composition
Histogramming
2. Quantization
4. Histogramming
Time consumed (%)
3. Feature composition
2.80x
1x
34%
76%
83%
56%
20%
14%
LBP
MACHINE VISION GROUP
LPQ
BSIF
Local descriptor computational breakthrough
Filtering
3.95x
Quantization
0. Interpolation
Feature composition
Histogramming
1. Filtering
3. Feature composition
4. Histogramming
Time consumed (ms.)
2. Quantization
2.80x
1x
34%
76%
83%
56%
20%
14%
LBP
MACHINE VISION GROUP
LPQ
BSIF
Local descriptor computational breakthrough
Filtering
Quantization
0. Interpolation
2. Quantization
Histogramming
Time consumed (ms.)
1. Filtering
Feature composition
3.95x
3. Feature composition
2.80x
4. Histogramming
1x
LBP
MACHINE VISION GROUP
LPQ
BSIF
Local descriptor computational breakthrough
Interpolation
Filtering
0. Interpolation
Feature composition
Histogramming
1. Filtering
1.25x
1x
Time consumed (ms.)
2. Quantization
1.40x
Quantization
86%
70%
62%
56%
4.6%
23%
31%
7.5%
6.1%
5.4%
3. Feature composition
4. Histogramming
LBP
MACHINE VISION GROUP
LPQ
BSIF
Contents
1. Introduction
2. Computational complexity of local descriptors
3. LBP in desktop computers
4. LBP in mobile devices
5. LBP in dedicated computing devices
MACHINE VISION GROUP
Personal (desktop) computers
High performance applications
Not constrained (almost) by power
Numerous available technologies:
Libraries, programming languanges,
support software
Short developing times !!!
MACHINE VISION GROUP
Personal (desktop) computer applications
• Main goal:
Maximize performance
-
High speed
High framerate
Low latency
High resolutions
Best quality
MACHINE VISION GROUP
Personal (desktop) computers
Computing devices:
CPUs
(single core or multicore)
GPUs
(single GPU or multiple GPUs)
MACHINE VISION GROUP
General Purpose Processors (GPPs)
• Essentially SISD machines
• Optimized for low latency
• Single or multiple cores
• Include SIMD units
MACHINE VISION GROUP
CPU implementation strategies for LBP
• Avoiding conditional branching
• Using SIMD units
• Using all cores
MACHINE VISION GROUP
Avoiding conditional branching
• Reduces the number of conditional branches
– Result cannot be predicted
• Substitutes comparisons for substractions
• Use ”two’s complement” numeric representation
to know sign of substraction
– In practice equivalent to a comparison
• Needs sufficient amounts of bits to avoid
overflows
Up to 3 times faster !!!
Mäenpää, T., Turtinen, M., Pietikäinen, M.: Real-time surface inspection by texture. Real Time Imaging. 9(
MACHINE VISION GROUP
Use of SIMD units
• Included in every modern CPU core
• Exploited using inline assembly, specific functions, array
annotations, pragmas or enabled compilers
• Computes several pixels at the same time
• Not independent units (shared control code with CPU)
• Requires preprocessing for maximum efficiency
– About 7% overhead
Up to 7x speedup
Juránek, R., Herout, A., Zemĉik, P.: Implementing local binary patterns with SIMD instructions of CPU.
MACHINE VISION GROUP
Exploiting multiple cores
• Posix threads, Intel TBB, OpenMP, OpenCL
• Divide image in multiple overlaping stripes
• Asign one stripe per core
• Overlaps cause contention on data reading and
overhead
For N cores, up to 0.9*N times faster
2 cores = 1.8x
4 cores = 3,7x
8 cores = 6,8x
Humenberger, M., Zinner, C., Kubinger, W.: Performance evaluation of a census-based stereo matching
embedded and multi-core hardware.
MACHINE VISION GROUP
Comparative performance
Processor
Time (ms)
LBP/iLBP
Speedup
CPP
LBP/iLBP
CPP per core
LBP/iLBP
LBP/iLBP
Single-core 2.5 GHz
Scalar
49 / 350
1×
/
1×
133/950
133/950
Branchless
16 / 129
3× / 2.7×
45/350
45/350
SIMD
6.6 / 48
7.2×
18/130
18/130
14.5× / 14×
9.2/67
18.4/134
6.2/45
18.6/135
5/36
20/144
7.4× /
i5 Quadcore 2.5 GHz
2 Cores
3.4 / 24.5
3 Cores
2.3 / 16.7
21.5× /
4 Cores
1.8 / 13.2
27× / 26.5×
MACHINE VISION GROUP
21×
Comparative performance
Processor
Time (ms)
LBP/iLBP
Speedup
CPP
LBP/iLBP
CPP per core
LBP/iLBP
LBP/iLBP
Single-core 2.5 GHz
Scalar
49 / 350
1×
/
1×
133/950
133/950
Branchless
16 / 129
3× / 2.7×
45/350
45/350
SIMD
6.6 / 48
7.4× /
7.2×
18/130
18/130
14.5× / 14×
9.2/67
18.4/134
6.2/45
18.6/135
5/36
20/144
i5 Quadcore 2.5 GHz
2 Cores
3.4 / 24.5
3 Cores
2.3 / 16.7
21.5× /
4 Cores
1.8 / 13.2
27× / 26.5×
MACHINE VISION GROUP
21×
Graphics processing units
• Independent units (work concurrently with CPUs)
• Essentially SIMD machines
• Many simpler cores (hundreds)
– Operating at lower clockrates
• Operating in floating-point data
• Built-in graphics primitives
– Ideal for interpolation and filtering
• Flow control, looping and branching restricted
MACHINE VISION GROUP
GPU implementations
• Stream processing
• Exploiting shared and texture memory
• Multi-platform code
• Data transfer consideration
MACHINE VISION GROUP
Stream processing
Input stream
Output stream
Processor array
MACHINE VISION GROUP
Exploiting shared and texture memory
Stream processing model
MACHINE VISION GROUP
Shared memory model
Exploiting shared and texture memory
• Shared memory acts as a practical L2 cache
• Texture memory as read-only shared memory
• Textures have ”free” bilinear interpolation
Up to 5x speedup
MACHINE VISION GROUP
Multi-platform code
• GPU can be used concurrently with CPU
• OpenCL allows the use of the same code
• Concurrent implementations surpass GPU-only
MACHINE VISION GROUP
Multi-platform code
CPU
CPU and GPU used concurrently
Input data
Same code for both devices
GPU
MACHINE VISION GROUP
Output data
Data transfer
• Data needs to be transferred to GPU memory
- It can be a bottleneck
• Data transfers can overlap computations
- Latency can be hidden
• Long imaging pipelines preferred
MACHINE VISION GROUP
- More
computations per transfer
Data transfer (LBP case)
• LBP is memory bound
• Most time consumed in memory acceses
• Graphic Memory bandwitdh vs Graphics Bus bandwidth
• Transfer time about 4 times smaller than computation time
• Data transfer can be hidden (affects latency but not throughput)
MACHINE VISION GROUP
Comparative performance
I/O
Time (ms)
Speedup
CPP
CPP per core
(ms)
LBP/iLBP
LBP/iLBP
LBP/iLBP
LBP/iLBP
0
49/350
1×/1×
133/950
133/950
0.36
10.1/130
4.8×/2.7×
27.5/350
110/1400
OpenGL
0.31
6.3/6.3
7.8×/54×
17/17
528/541
CUDA
0.31
1.4/1.5
35×/233×
3.8/4.0
120/125
OpenCL
0.31
1.6/1.7
30.5×/205×
4.3/4.6
145/151
Processor
Single core
Scalar
Quad core
OpenCL
FX5600
MACHINE VISION GROUP
Comparative performance
I/O
Time (ms)
Speedup
CPP
CPP per core
(ms)
LBP/iLBP
LBP/iLBP
LBP/iLBP
LBP/iLBP
0
49/350
1×/1×
133 / 950
133 / 950
0.36
10.1/130
4.8×/2.7×
27.5 / 350
110 / 1400
OpenGL
0.31
6.3/6.3
7.8×/54×
17/17
528 / 541
CUDA
0.31
1.4/1.5
35× / 233×
3.8 / 4.0
120 / 125
OpenCL
0.31
1.6/1.7
30.5×/ 205×
4.3 / 4.6
145 / 151
Processor
Single core
Scalar
Quad core
OpenCL
FX5600
MACHINE VISION GROUP
CPU
vs
GPU
Interpolation
Filtering
Quantization
1.15x
Feature composition
1.10x
Histogramming
4.2%
4.5%
Time consumed (ms.)
1x
5%
86%
86%
70%
70%
62%
40%
45%
47%
5.5%
4.8%
4.4%
62%
47%
4.6%
23%
31%
7.5%
6.1%
5.4%
LBP
LPQ
MACHINE VISION GROUP
BSIF
2.5%
LBP
43%
2.3%
LPQ
41%
2.1%
BSIF
Contents
1. Introduction
2. Computational complexity of local descriptors
3. LBP in desktop computers
4. LBP in mobile devices
5. LBP in dedicated computing devices
MACHINE VISION GROUP
Mobile devices are not ”smaller” computers
The small physical size of a mobile device implies constraints not present in desktops !!!
MACHINE VISION GROUP
Mobile devices
”Ready made” package: processors + sensors
Growing number of technologies:
Libraries, programming languages,
support software
Massive application deployment
BATTERY POWERED !!!
MACHINE VISION GROUP
Mobile device applications
• Main goal:
Find ”sweet spot” between performance
and power consumption
- Energy efficient implementations
- Best performance per Joule
• Metrics: CPP and JPP (Joules per pixel)
MACHINE VISION GROUP
Mobile SoCs
• Computing Devices
– CPU
– GPU
– DSP (+ ISP)
• Bottlenecks
– Memory
Scen
– Communication
MACHINE VISION GROUP
Mobile SoCs
• Computing Devices
– CPU
– GPU
– DSP (+ ISP)
• Bottlenecks
– Memory
Scen
– Communication
MACHINE VISION GROUP
Experiments setup
• OMAP 3 family (OMAP3530)
– ARM Cortex A8 CPU
– Power VRSGX535 GPU
• 3 set-ups:
– Beagleboard revision 3
– Zoom AM3517EVM (TI Sitara)
– Nokia N900
MACHINE VISION GROUP
Processor power consumption
Different processors have different consumptions
Small cuts in power consumption have huge impact in battery time !!!!
MACHINE VISION GROUP
Mobile processors (CPUs)
• RISC designs
• ARM architecture
less complex instructions
• Single or multiple cores
• Include VFP coprocessors and SIMD units
MACHINE VISION GROUP
ARM optimization strategies for LBP
• Use built-in ARM registers
• Using NEON coprocessor
• Offloading tasks to other energy-efficient cores
MACHINE VISION GROUP
ARM optimization
• Access to memory is crucial
• Built-in ARM registers very fast
• No increase in power consumption
• Use general ARM optimization strategies
– Do/while loops, decrement pointers
Up to 25% more efficient !!!
MACHINE VISION GROUP
Use of NEON coprocessor
• Included in most of ARM cores
• Similar to desktop SIMD units
• Not independent units (shared control code with CPU)
• Incurs in a power overhead !!!!
– About 20% overhead
Up to 40% performance gain
MACHINE VISION GROUP
ARM@600Mhz – 550mW
ARM+NEON – 670mW
Mobile GPUs
• Independent units
• Unified Memory Architecture
• Memory access is shared
• Bandwidth bottleneck
• Smaller Energy per Instruction (EPI) !!!!
– Only 93mW@110MHz
MACHINE VISION GROUP
Stream processing (OpenGL ES)
• Stuck with stream processing (for the
moment)
• No shared memory, limited texture memory
MACHINE VISION GROUP
Mobile GPU implementation
• Four RGBA channels can be used at the same time (requires
preprocessing)
• Interpolation ”for free”, multiscale very fast
Up to 3 times less energy consumption
Bordallo López, M., Nykänen, H., Hannuksela, J., Silvén, O., Vehviläinen, M.:
Accelerating image recognition on mobile devices using gpgpu. In: SPIE Electronic Imaging, 2011
MACHINE VISION GROUP
Mobile GPU implementation
• Accuracy and precission might be not optimal
Drawback: Shader implementation might not be consistent
MACHINE VISION GROUP
Mobile DSP implementation
• VLIW architectures
• Very energy efficient
long instructions over multiple data
optimized for signal processing
– Only 0.39mW/MHz at 430 MHz (Fixed point)
• Fixed-point vs floating-point
the gap is closing
Up to 10 times less energy consumption !!!
MACHINE VISION GROUP
Comparative performance (Nokia N900)
Time(ms)
Speedup
CPP
Power
pJ/pixel
LBP/iLBP
LBP/iLBP
LBP/iLBP
(mW)
LBP/iLBP
OpenCV
115/229
1×/1×
75.9/149.9
550
70/137
Branchless
88/175
1.3×/1.3×
57.3/113.9
550
52/104
NEON
57/118
2×/1.9×
37.1/76.8
670
41/86
Branchless
28/50
4.1×/ 4.5×
13.2/23.5
248
7.6/13.6
Intrinsics
14/26
8.2×/8.8×
6.7/11.8
248
3.9/6.9
158/190
0.7×/1.2×
18.9/22.7
93
15.9/19.1
Processor
ARM 600 MHz
DSP 430 MHz
GPU 110 MHz
OpenGL ES
MACHINE VISION GROUP
Contents
1. Introduction
2. Computational complexity of local descriptors
3. LBP in desktop computers
4. LBP in mobile devices
5. LBP in dedicated computing devices
MACHINE VISION GROUP
Dedicated hardware
• Dedicated (programmable) architectures offer:
– Incredibly high performance (Hybrid SIMD/MIMD)
or..
– Extremely good energy efficiency (TTA)
Longer developing times !!!
MACHINE VISION GROUP
Hybrid SIMD-MIMD architecture
Essentially a processor array (GPU) that allows branching reconfiguring to MIMD mode
Extremely fast !!!
Nieto, A., López Vilariño, D., Brea, V.:
SIMD/MIMD dynamically-reconfigurable architecture for high-performance embedded vision systems. IE
MACHINE VISION GROUP
Transport Triggered Architecture
Essentially a reconfigurable HW codec that moves data across arithmetical units
Extremely energy efficient and still programmable
Boutellier, J., Lundbom, I., Janhunen, J., Ylimäinen, J., Hannuksela, J.:
Application-specific instruction processor for extracting local binary patterns. DASIP 2012
MACHINE VISION GROUP
Comparative performance
FPGA
I/O time
Time(ms)
CPP
Power
pJ/pixel
model
(ms)
LBP/iLBP
LBP/iLBP
(mW)
LBP/iLBP
SIMD/MIMD
Xilinx
Virtex 6
6.9
1.15/1.92
0.19/0.31
5,227
6.8/11.4
TTA
Altera
Cyclone
IV
0
67.5
/122.88
11.0/20.0
14.5
1.1/2.0
Census Stereo
Altera
0
Cyclone II
7.78/N/A
0.42/N/A
687.5
5.8/N/A
Processor
MACHINE VISION GROUP
Comparative performance
MACHINE VISION GROUP
Can we go the extra mile?
MACHINE VISION GROUP
LBP with specific hardware
• LBP ASIC
– Very small area
– Not programmable
– About 6 cpp per core
• Massively parallel processor array
– 1 processor per pixel
– 5us to calculate 4096 pixels = 1ns per pixel
– Has to be built in the sensor (experimental)
MACHINE VISION GROUP
Have you run the LBP everywhere?
MACHINE VISION GROUP
LBP in a supercomputer
• Minotauro cluster at Barcelona Supercomputing Center
-128 blades (2 processors at 2.53GHz + 2 M2090 Tesla Cards)
- CPUs + GPU concurrently
MACHINE VISION GROUP
Summary
• LBP and local descriptors computational needs to grow with resolution
• Performance is a crucial measurement of the descriptor quality
• Efficient implementation of LBP highly dependent on the architecture
• Tradeoff in performance, energy efficiency and developing time
• Descriptor selection is architecture dependent
– Desktop GPUs suitable for complex descriptors and interpolations
– Mobile DSPs and CPUs best tradeoff with power consumption
– Dedicated hardware best solution if developing time is not an issue
MACHINE VISION GROUP
Thanks !!
Any question??
More details in:
- Bordallo López M., Nieto A., Boutellier J., Hannuksela J., and Silvén O.
"Evaluation of real-time LBP computing in multiple architectures,"
Journal of Real Time Image Processing, 2014.
- Hadid A., Ylioinas J., Bordallo López M., (soon to appear in IPTA2014)
"Face and Texture Analysis Using Local Descriptors: A Comparative Analysis"
Some source code available at our webpages:
http://www.cse.oulu.fi/CMV/Research/LBP
http://www.ee.oulu.fi/~miguelbl/LBP-Software/
(other LBP implementations available under request)
MACHINE VISION GROUP
Concurrent use of several processors
MACHINE VISION GROUP
Concurrent use of several processors
Scene
render
ing
MACHINE VISION GROUP
LBP on CPU/GPU clusters
Host memory at a different distance for different d
Device memory at different distance for different c
MACHINE VISION GROUP
Download