PowerPoint-esitys

advertisement
Reconfigurable computing for
future vision-capable devices
Center for Machine Vision Research
Miguel Bordallo López1,
Alejandro Nieto2, Olli Silvén1, Jani Boutellier1, David López Vilariño2
1 Center
for Machine Vision Research, University of Oulu, Finland
2 CITIUS, University of Santiago de Compostela, Spain
Center for Machine Vision Research
Outline
Future vision-capable devices
Experimental setup
Reconfigurable architectures
EnCore processor and Configurable Flow
Accelerator
Hybrid SIMD/MIMD architecture
Transport-Triggered architecture based
processors
Summary
Center for Machine Vision Research
Future Vision capable devices
• Computer Vision algorithms are becoming more and more
sophisticated.
– SISD, SIMD and MIMD paradigms. Several building blocks.
• Efficient hardware architectures are a requirement in order to address
their high computational demands:
– Optimizing for Latency and Throughput
• Current architectures have drawbacks in future scalability or flexibility
• Processor design: compromise between flexibility and performance.
• Power consumption becomes critical
Center for Machine Vision Research
Reconfigurable computing devices
• Reconfigurable computing devices can rewire themselves
dynamically
• They adapt effectively to different programming tasks
• Ideally, they transform themselves to run across different
application types with the highest performance
• They enable reusable transistors, saving die area, power
consumption and increasing overall performance
• Computer vision application characteristics have significant
overlap with the advantages of reconfigurable
architectures.
Center for Machine Vision Research
Contribution:
Implementing a set of building blocks of computer vision algorithms in
reconfigurable architectures, comparing them with their current “fixed” counterparts
Scenario 1:
CPU +
SIMD unit
vs
Low-power CPU +
reconfigurable accelerator
vs
Hybrid SIMD/MIMD processor
vs
TTA-Based ASIPS
Scenario 2:
Mobile GPU
Scenario 3:
DSP
Center for Machine Vision Research
Experimental setup for comparison
• OMAP 3 family (OMAP3530)
– ARM Cortex A8 CPU
– PowerVR SGX530 GPU
– TI C6416 DSP
• 2 set-ups:
– Beagleboard revision 3
– Nokia N900
• Measuring power consumption
• Operating point:
– ARM@600 MHz.
– GPU@110 MHz.
– DSP@430 MHz.
– Multimeter (Beagleboard)
– Battery sensor readings (N900)
– Vendor spreadsheets
… about 10-20% discrepancies
Center for Machine Vision Research
Extensible low-power processor and
reconfigurable accelerator
Center for Machine Vision Research
EnCore Processor
•
EnCore Processor is a 32-bit RISC core
– Implements ARCompact Instruction Set
– Includes an expanded datapath to embed
user-defined instruction set extensions
•
Based on the analysis of Standard compiled code
– Expensive operations moved to dedicated hardware
– Dedicated hardware implemented as a
configurable flow accelerator, the CFA unit
Center for Machine Vision Research
Configurable Flow Accelerator
• CFA utilizes several single-function ALUs
–
–
–
–
Identify
Allows spatial & temporal parallelism (resource sharing & pipelining)
processor
Fully programmable (up to 64 reconfigurable extension instructions)
extension
s
Up to 12 inputs and 8 outputs per ISE, 32 extension registers
Select
Each ISE, 3 stage pipeline, 4 independent arithmetic operations
• Identifying candidate specific instructions
Extension
s
– Automatic (modified compiler) or manual
Adapt to
(Create
– Resulting in templates to adapt to existing CFA or generation of a new
new)
one
template
– Trade-off between reusability and performance (lowCenter
latency)
for Machine Vision Research
EnCore Processor implementation
• Encore Castle chip
–
–
–
–
–
–
–
Developed at University of Edinburgh (PASTA group)
Generic 90nm CMOS process
Only 2.25 mm 2 including CFA + 2x32KB caches
Embedded in a SOC with 32-bit memory interface
Fixed clock@600 MHz.
Core voltage: 0.9V to 1.1 V
Chip level power consumption: 70 mW
Center for Machine Vision Research
Performance evaluation
Performance (CPP)
Energy (nJPP)
Operation
ARM
NEON
EnCore
CFA
ARM
NEON
EnCore
CFA
Grayscale
conversion
216
156
240
66
198,1
174,2
28,0
7,7
78
56
50
47
71,5
62,5
5,8
5,5
Alpha Blending
141
100
86
20
129,1
111,7
10,0
2,3
Blur detection
73
52
84
19
19,5
58,0
9,8
2,3
423
302
199
58
66,0
37,2
23,2
6,8
21
21
29
20
19,3
23,5
3,4
2,3
Image Rotation
546
390
608
234
500,0
435,5
70,9
27,3
Image Scaling
384
250
390
143
352,3
279,2
45,5
16,7
Image
displacement
Convolution 3x3
Histogram
Center for Machine Vision Research
Performance evaluation
Cycles per Pixel
600
ARM
EnCore
NEON
CFA
500
400
300
5x
200
1x
100
0
Grayscale
conversion
Image
displacement
Alpha Blending
Blur detection
Convolution 3x3
Histogram
Image Rotation
Image Scaling
Center for Machine Vision Research
Performance evaluation
Cycles per Pixel
600
EnCore
NEON
CFA
500
400
300
5x
200
1x
100
0
600
nJoules per Pixel
ARM
Grayscale
conversion
Image
displacement
Alpha Blending
Blur detection
Convolution 3x3
Histogram
Image Rotation
Image Scaling
500
400
300
200
50x
10x
100
0
Grayscale
conversion
Image
displacement
Alpha Blending
Blur detection
Convolution 3x3
Histogram
Image Rotation
Image Scaling
Center for Machine Vision Research
Reconfigurable SIMD-MIMD
hybrid embedded processor
Center for Machine Vision Research
Hybrid SIMD/MIMD processor
Reconfigurable on-the-fly between SIMD and MIMD operation modes
Center for Machine Vision Research
Nieto, A., López Vilariño, D., Brea, V.:
SIMD/MIMD dynamically-reconfigurable architecture for
high-performance embedded vision systems. IEEE ASAP 2012
Hybrid architecture elements
•
Programmable Input/Output Processor
–
–
–
Transfers data between computing cores and external RAM
Address generation unit allows flexible data access pattern
generation
Simple program flow control. Single operation for data
transfers:
move <source_addr>, <dest_addr>, <size>
• Reduced Instruction Set architecture (RISC)
–
–
–
Storage: Register Bank (temporal) & Memory Element
(large data/program)
Automatic transfers between Pes (Stream Queues)
ALUs 1-3 operands (arithmetic/logical/DSP)
Center for Machine Vision Research
SIMD vs MIMD modes
•
SIMD layout: 1-dimensional processor array
–
–
–
•
SIMD Control unit manages PEs, same instruction
Overlap process and data transfer
Data exchange: Synchronous side to side network
MIMD layout: 2-dimensional processor torus
–
–
–
Every PE runs its own program
Units work in parallel
Data exchange: Synchronized local network
Center for Machine Vision Research
Implementation
• Board: Xilinx ML605
– 512MB DDR3 64-bit SO-DIMM@400 MHz.
– 1400MB/s (read), 1140MB/s (write)
– Xilinx Power Estimator (XPE)
• FPGA: Xilinx Virtex-6 XC6VLX240T-1:
– 416 Block RAMs (36kbit each)
– 468 DSP48E1 slices
• SoC configuration
– Main CPU: MicroBlaze soft-processor
– 128x32-bit Processing elements
• Fixed-point arithmetics
• 8-word Registers Bank,
• 1024-word ME, 4x4-word Stream Queue
– Coding: Pseudo assembly
– 19.6 GOP/s @ 150 MHz.
– Max power consumption: 7.2 W
(1.9W static 5.3W dynamic)
Center for Machine Vision Research
Performance evaluation
Performance (CPP)
Energy (nJPP)
Operation
ARM +
NEON
Mobile
GPU
ARM +
NEON
Mobile
GPU
SIMD/MIMD
SIMD/MIMD
Grayscale
conversion
156
13,4
2,1
174,2
11,3
72,8
Image
displacement
56
13,6
1,3
62,5
11,5
45,0
Alpha Blending
100
13,6
1,0
111,7
11,5
34,7
Feature Extraction
549
75,5
0,7
613,0
63,8
24,3
Blur detection
52
100,7
1,0
58,0
85,1
34,7
LBP extraction
37
18,1
0,2
41,3
15,2
6,9
2D-Convolution
302
160,0
1,0
337,2
135,3
34,7
Histogram
302
----
2,4
1,22
----
83,2
Image rotation
1
13,6
12,0
435,5
11,5
416
Image scaling
250
20
136,7
279,1
16,9
4714
Hybrid
Hybrid
Center for Machine Vision Research
Cycles per Pixel
Performance evaluation
600
ARM + NEON
Mobile GPU
Hybrid SIMD/MIMD
500
400
300
200
100
0
Grayscale
conversion
Image
Alpha Blending
Feature
Extraction
Blur detection LBP extraction 2D-Convolution
Histogram
Image rotation
Center for Machine Vision Research
Cycles per Pixel
Performance evaluation
180
Mobile GPU
160
Hybrid SIMD/MIMD
140
120
100
80
60
40
20
0
Grayscale
conversion
Image
Alpha Blending
Feature
Extraction
Blur detection LBP extraction 2D-Convolution
Histogram
Image rotation
Center for Machine Vision Research
Cycles per Pixel
Performance evaluation
Mobile GPU
Hybrid SIMD/MIMD
80
160
100
20
0
Grayscale
conversion
Image
Alpha Blending
Feature
Extraction
150
x ?
Blur detection LBP extraction 2D-Convolution
Histogram
0.9
x
Image rotation
Center for Machine Vision Research
nJoules per Pixel
Cycles per Pixel
Performance evaluation
Mobile GPU
Hybrid SIMD/MIMD
20
0.9x
150x
?
0
Grayscale
conversion
Image
Alpha Blending
Feature
Extraction
Blur detection LBP extraction 2D-Convolution
Histogram
Image rotation
500
400
300
0.02x
4x
200
100
?
0
Grayscale
conversion
Image
Alpha Blending
Feature
Extraction
Blur detection LBP extraction 2D-Convolution
Histogram
Image rotation
Center for Machine Vision Research
Transport Triggered Architecture-based
application specific processors
Center for Machine Vision Research
ASIPs vs others
• Application-specific instruction processor (ASIP) for
computing intensive computer vision operations
• Benefit of ASIP vs. ASIC: programmability
• Benefit of ASIP vs. GPP: less power consumption
lower power consumption
ASIC
ASIP
DSP
GPP
better programmability
Center for Machine Vision Research
Transport-Triggered Architecture
• Currently a complete, mature, open-source co-design
toolset exists
• VLIW-style instruction-parallel architecture
• Widely scalable
–
–
–
–
from tiny 1-instruction-per-cycle, 8-bit processors
to N instructions per cycle, 32-bit processors
any number of register files and function units
custom function units and instructions supported
• C compiler based on LLVM
• Processor generator for (FPGA) synthesis
TTA-based Co-design Environment (TCE):Center for Machine Vision Research
Tampere University of Technology http://tce.cs.tut.fi/
TTA-based processor
• Exploits instruction-level parallelism
• No theoretical limit to number of operations
– Max. operating frequency goes down with number of buses
Center for Machine Vision Research
Example TTA-based ASIP: LBP extraction
Center for Machine Vision Research
Implementation
• LBP processor *
– Implemented on FPGA
•
Altera Cyclone IV EP4CE115F29C7
– HDTV video processing @ 304 MHz. max freq
– Reconfigurable for LBP and iLBP flavours
– 4.84 mW @ 50 MHz. (PowerPlay power estimator)
• FFT processor **
–
–
–
–
Implemented on 130nm technology
1024-point FFTs @ 400 MHz.
Designed for minimum latency. Fully Scalable
74 mW @ 250 MHz. , 1.5V (Synopsys Power Compiler)
*Boutellier, J., Lundbom, I., Janhunen, J., Ylimäinen, J., Hannuksela, J.:
Application-specific instruction processor for extracting local binary patterns.
DASIP 2012
**Pitkänen T., Mäkinen R., Heikkinen J., Partanen T., Takala J..: Center for Machine Vision Research
Low-Power, High-Performance TTA Processor for 1024-point FFT
Embedded Computer Systems: Architectures, Modeling, and Simulation. 2006
Performance evaluation
Performance (CPP)
Energy (nJPP)
Operation
ARM
DSP
TTA
ARM
DSP
TTA
LBP
37,1
6,7
11,0
41,1
3,9
1,1*
Interpolated
LBP
76,8
11,8
20,0
86,2
6,9
2,0*
1024-point FFT
160,1
6,2
5,0
146
3,5
1,1**
*FPGA implementation , **130nm technology
Center for Machine Vision Research
Cycles per Pixel
Performance evaluation
160
ARM
140
DSP
TTA
120
100
80
60
40
20
0
LBP
Interpolated LBP
1024-point FFT
Center for Machine Vision Research
Cycles per Pixel
Performance evaluation
25
DSP
TTA
20
0.6x
15
1.1x
10
5
0
LBP
Interpolated LBP
1024-point FFT
Center for Machine Vision Research
nJoules per Pixel
Cycles per Pixel
Performance evaluation
25
DSP
TTA
20
0.6x
15
10
1.1x
5
0
LBP
Interpolated LBP
1024-point FFT
8
7
6
5
4
3.5x
3
2
3.5x
3.5x
1
0
LBP
Interpolated LBP
1024-point FFT
Center for Machine Vision Research
Summary and conclusion
Reconfigurable
accelerators
Reconfigurable
•
Reconfigurable processors are a major opportunity for computer
vision devices
–
–
•
Reconfigurability suits the different paradigms of computer vision applications
Offer a good trade-off between performance & energy efficiency & space
Still challenges to be addressed
–
–
–
Integration into existing HW and SW solutions
Development of efficient tools
Identification of novel applications and their bottlenecks
Center for Machine Vision Research
Thank you !!
Any Question ?
Center for Machine Vision Research
Processor power consumption
Different processors have different consumptions
Small cuts in power consumption have huge impact in battery time !!!!
Center for Machine Vision Research
General Purpose Processors (GPPs)
• Essentially SISD machines
• Optimized for low latency
• Single or multiple cores
• Include SIMD units
Center for Machine Vision Research
Graphics processing units
• Independent units (work concurrently with CPUs)
• Essentially SIMD machines
• Many simpler cores (hundreds)
– Operating at lower clockrates
• Operating in floating-point data
• Built-in graphics primitives
– Ideal for interpolation and filtering
• Flow control, looping and branching restricted
Center for Machine Vision Research
Download