An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems Shorin KYO

advertisement
NEC Corporation
6th/June, ISCA2005, 1/30
An Integrated Memory Array Processor
Architecture for Embedded Image
Recognition Systems
Shorin KYO
*1 Shin'ichiro OKAZAKI
*2 Tamio ARAI
*1
*1
Media and Information Research Laboratories, NEC Corporation
*2 School of Engineering, University of Tokyo
NEC Corporation
6th/June, ISCA2005, 2/30
Outline
1. Challenges of Embedded Image Recognition Systems
2. Integrated Memory Array Processor (IMAP) Architecture
3. Programming Language and Compiler Design
4. Evaluations
5. Summary
NEC Corporation
6th/June, ISCA2005, 3/30
Three Basic Requirements
Ex. Embedded Driver Asistant Systems
GOPS
Realtime Response
1) High Performance
1000
100
10
1
Lane Marks
Robustness
2) Cost/Power
Efficiency
Low cost
Easy cooling (< 2 Watt)
High Quality / Reliability
Low EMI
3) High Flexibility
(Scalability and Versatility)
Able to handle the combination of
[ applications × situations×targets ]
NEC Corporation
6th/June, ISCA2005, 4/30
Applications × Situations × Targets
Lane Change Assist
Park Slot Measurement Side Pre-Crash
Blind Spot Detection
Cut-In Traffic Sign Recognition
Dynamic Back Up Aid
Drownsiness
warning
Backup Parking Assist
Following Distance Warning
Front Pre-Crash
Stop&Go
Pedestrian Protection
Cross Traffic Warning
NEC Corporation
6th/June, ISCA2005, 5/30
COR: Control versus Operational circuit Ratio
Trading-off items
1) Performance (higher)
2) Cost (lower)
3) Flexibility (higher)
Cost (Die size / power consumption)
Control circuit
Flexibility
% of Control Circuitry
(Flexibility)
100
Operation circuit
(peak) performance
a) Desktop/Server CPU (GPPs)
Itanium
b) MIMDs (Multi-Cores)
Sparc64
SPE(CELL)
c) DSPs
FR1000
d) Highly parallel SIMDs
FR500
IMAP-CE, IMAPCAR
CODEC LSI
e) Special purpose LSI
100
% of Operational Circuitry
(Performance)
NEC Corporation
6th/June, ISCA2005, 6/30
Overcoming the Flexibility Gap
Flexibility
Fixed Cost & Technology Constrain
(a Technology Barrier)
a)
Ctrl. circuits
b)
(a) GPPs
(b) DSPs and MIMDs
(c) Highly parallel SIMDs
(d) Custom logics+DSP core
(e) Custom logics only
Op.
Ctrl. circuits Op. circuit
c)
Flexibility gap
Ctrl. Op. circuit
Op. circuit
d)
e)
Op. circuit
Performance
Challenge of embedded image processors
⇒ Minimizing COR while overcoming the "Flexibility Gap"
NEC Corporation
6th/June, ISCA2005, 7/30
Outline
1. Challenge of Embedded Image Recognition Systems
2. Integrated Memory Array Processor (IMAP) Architecture
3. Programming Language and Compiler Design
4. Evaluation
5. Summary
NEC Corporation
6th/June, ISCA2005, 8/30
IMAP Series Processors
Peak Performance(GOPS)
ISSCC’03
1000
100MHz, 128PE/Chip
4-Way VLIW ,50GOPS
0.18um, 2~4Watt
100
IMAPCAR
ISSCC’95
IMAP-2
40MHz, 64PE/Chip
(PE8: eight PEs integration block)
11.0mm
CAMP’97
IMAP-VISION
DPLL
40MHz, 32PE/Chip
1
IMAP-1
15MHz, 8PE /Chip
0.1
1990
1995
2000
2005
EXTIF
PE8 PE8
11.0mm
10
100MHz, 128PE/Chip
4-Way VLIW+MAC, 100GOPS
(-40℃~85℃), 0.13 um, <2Watt
IMAP-CE
PE8 PE8
PE8
CP
PE8
PE8 PE8 PE8
PE8
PE8
PE8 PE8 PE8
PE8
PE8
2010
Year
IMAP-CE(32.7M Tr, 0.18um)
NEC Corporation
6th/June, ISCA2005, 9/30
Block Diagram and Features
12.8 GByte/s
EMEM
EMEM
EMEM
EMEM
External Mem. I/F
0.8 GByte/s
ALUx1,MULx1,LOGx1,LSUx1
IMEM
4 Way VLIW PE
IMEM
4 Way VLIW PE 1
Video IN
1)
2)
3)
4)
5)
6)
ADD
LOG
MUL
RDU
LSU
COMM
24 x 8b General Purpose Registers
IMEM
127
4 Way VLIW PE
128
Video OUT
To/Fr To/Fr To/Fr
CP IMEM other
PEs
0
SR0
SR1
SR2
SR3
Control Processor (CP)
P$,D$,STK RAM
Host Processor
one pixel data source
(image) data
column(s)
of image
IMEM of
one PE
100MHz 128 4Way VLIW linear array PEs
Two level memory architecture + user DMA
PE
Automated mapping of image data to each PE
128 individual RAM blocks configuration
instruction
1DC (One Dimensional C) + “Line methods” broadcast
Enhanced PE instruction set design for 1DC
(SIMD)
2KB
128
PE
CP
PE
PE
SDRAM/
SSRAM
64MB~
NEC Corporation
6th/June, ISCA2005, 10/30
Memory Access Pattern Categories
Geometric. Op. (GeO)
Sensors
pixels
Input Image X Output Image Y
ex. affine
Point Op. (PO)
Low-level
Image Processing
Pre-processing
Low-level Feature
Extraction
Input Image X Output Image Y
Statistical Op. (SO)
pixels
Intermediate-level
Image Processing
Measurements
Local Neigh. Op. (LNO) Local Feature based
Discrimination
Object Op. (OO)
Input Image X
symbols
Output vector /
Input Image X scalar V
High-level
Decision
Higher level
Feature extraction
(RNO)
Recursive Neigh. Op.
Output Image Y
ex. histogram
Global Op. (GlO)
Input Image X
ex. 2d-filters,NN
Input Image X
Output vector /
scalar V
ex. labelling/propagation
Output Image Y
ex. FFT
Input Image X Output Image Y
ex. distance trans.
E.R.Komen: Low-level Image Processing Architectures, Ph.d Thesis, TUD,Netherlands, 1990.
P.P.Jonker: Architectures for Multidimensional Low- and Intermidiate Level Image rocessing,
Proc. of IAPR Workshop on Machine Vision Applications (MVA'90), pp.307--316, 1990.
NEC Corporation
6th/June, ISCA2005, 11/30
Memory Access Pattern Parallelization Issue
 Conventional continous (or strided) address data
supply (ex. streaming data supply) is not sufficient for
parallelizing most memory access patterns been required
Unified
RAM
PE PE PE PE
PE
SIMD + VLIW PEs
PO
LNO
SO
GlO
GeO
RNO
OO
○
Completely local
○
Local Neighborhood
×
Global
×
Global
×
Global
×
Recursive
×
Data dependent
NEC Corporation
6th/June, ISCA2005, 12/30
Memory Access Pattern Parallelization Design
Constrained pixel update
Locality
No
Unconstrained
pixel update
Statically constrained
dynamically constrained
update location is
statically predictable
update location must be
dynamically determined
SO, GlO,GeO
-
-
PO, LNO
RNO
Yes
row-wise (PUL)
row-systolic
OO
slant-systolic
autonomous
image
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
PE
requires one RAM block / PE configuration
Line Methods
(PUL: Pixel Updating Line)
NEC Corporation
6th/June, ISCA2005, 13/30
Line Methods (1) ー Combination of PULs ー
90 degree rotation
2 times
PE
PE
PE
PE
PE
PE
Thinning
+
PE
PE
+
PE
PE
PE
PE
PE
PE
PE
Connect component labeling
+
PE
PE
PE
Propagation
PE
PE
PE
NEC Corporation
6th/June, ISCA2005, 14/30
Line Methods (2) ー Expected Speedup ー
*1: When under an unified RAM approach
*2: When using the memory array architecture
*1
*2
(when using N PEs)
N/2~N time speedup by N PEs
NEC Corporation
6th/June, ISCA2005, 15/30
Outline
1. Challenge of Embedded Image Recognition Systems
2. Integrated Memory Array Processor (IMAP) Architecture
3. Programming Language and Compiler Design
4. Evaluation
5. Summary
NEC Corporation
6th/June, ISCA2005, 16/30
1DC: An Extended C Language
One (vector like) data structure and six operators
int d, e;
sep char a,b;
sep char c,ary[256];
Correspondence between parallelizing
techniques and the 1DC syntax.
NEC Corporation
6th/June, ISCA2005, 17/30
1DC: Line-wise Parallel Operation
• Sequential Languages (Ex. C)
for (y=0; y < {number of lines}; y++)
for (x=0; x < {number of columns}; x++) .........
• When using 1DC, skip the {number of columns} loop
for (y=0; y < {number of lines};y++) ...........
Ex. An Edge Detection Filter
y=0
y=120
y=200
y= {number of lines}
NEC Corporation
6th/June, ISCA2005, 18/30
Average Filter in 1DC (1)
sep uchar src[256], dst[256];
ave33( ){
int i;
Summing three lines at the same time
sep int csum;
for(i=1;i<LINES-1;i++){
csum = src[i-1] + src[i] + src[i+1]; /*1*/
dst[i] = :>csum + csum + :<csum; /*2*/
dst[i] /= 9;
}
}
src[i-1]
・・・・
a6
src[i]
・・・・
b6
・・・・
c6
+
+
src[i+1]
=
csum
a7
+
b7
+
c7
↓
a8
・・・・
b8
・・・・
c8
・・・・
・・・・ a6+b6+c6 a7+b7+c7 a8+b8+c8
・・・・
NEC Corporation
6th/June, ISCA2005, 19/30
Average Filter in 1DC (2)
ave33( ){
int i;
Neigh. ref.(:>,:<) and “+”
sep int csum;
for(i=1;i<LINES-1;i++){
csum = src[i-1] + src[i] + src[i+1]; /*1*/
dst[i] = :>csum + csum + :<csum; /*2*/
dst[i] /= 9;
}
}
:>csum ・・・・ a5+b5+c5 a6+b6+c6 a7+b7+c7 ・・・・
+
+
csum
・・・・ a6+b6+c6 a7+b7+c7 a8+b8+c8
+
+
:<csum
・・・・ a7+b7+c7 a8+b8+c8 a9+b9+c9
・・・・
↓
=
dst[i]
・・・・
・・・・
a5+b5+c5 a6+b6+c6 a7+b7+c7 ・・・・
a6+b6+c6 a7+b7+c7 a8+b8+c8
a7+b7+c7 a8+b8+c8 a9+b9+c9
NEC Corporation
6th/June, ISCA2005, 20/30
Toward Efficient Execution of 1DC Codes
1DC program
Row
Systolic
Slant
Autonomous
PE array
PE array
PE array
PE array
Fast PE
grouping
External Mem. I/F
Fast index
addressing
SDRAM/SSRAM
1DC compiler / linker
IMEM
4 Way VLIW PE 0
IMEM
4 Way VLIW PE 1
IMEM
4 Way VLIW PE
127
128
Video
OUT
Fast left/right
referencing
Video
IN
SR0
SR1
SR2
SR3
Control Processor (CP)
P$,D$,STK RAM
Host Processor
Pipelined data
exchange
NEC Corporation
6th/June, ISCA2005, 21/30
Programming Environment
1DC Source Code
1DC Source code
window
Assign variables Source image
to sliders
window
Real-time
value tuning
debugging
1DC Optimizing Compiler
Library
IMAP Assembler
Linker
1DC Symbolic Debugger
Timing measurement result for Image recognition
result window
each source code line
IMAP-CE PCI board
NEC Corporation
6th/June, ISCA2005, 22/30
Outline
1. Challenge of Embedded Image Recognition Systems
2. Integrated Memory Array Processor (IMAP) Architecture
3. Programming Language and Compiler Design
4. Evaluation
5. Summary
NEC Corporation
6th/June, ISCA2005, 23/30
Operation Group Kernels
 Flexibility against various memory access patterns
IMAP-CE@100MHz, 1DC compiler codes
GPP@2.4GHz , Intel C compiler codes
speedup
Operation group kernels
Op. Grp. Kernel Name
parallelism (max.128) PO
8
140
7
GPP
IMAP-CE
Parallelism
6
5
Color format
trans.
1.40
120
LNO
3x3 ave. filter
1.33
100
SO
Histogram
1.66
80
GlO
FFT
1.55
60
GeO
90 degree
rotation
1.23
RNO
Distance
transform
1.52
OO
Connected
component
labeling
1.40
4
3
40
2
20
1
0
0
PO
LNO
SO
GlO
GeO
RNO
OO
(Ave.)
IPC
NEC Corporation
6th/June, ISCA2005, 24/30
Highly Parallel vs. Sub-Word SIMD
GOPS : in byte operation
Processor Op.Freq.
 Flexibility against
algorithmic complexity
P4(SIMD)
2.4GHz
1PEx8x2
100MHz 128PEx4
IMAP-CE
IMAP-CE
GPP
20
18
16
14
12
10
8
6
4
2
0
GPP(MMX)
IMAP-CE
x 1.33
x 32
Benchmark kernels
name
PO
LNO
(Ave.)
Smoothing
Canny
Var5oct
Mexican13
Gauss5
Complexity
GreyOpen3
x 1/24
38.4GOPS
51.2GOPS
# of if-clause per pixel op.
speed-up
Add2
10
9
8
7
6
5
4
3
2
1
0
Peak Perf.
PE #
IMAP-CE@100MHz, 1DC compiler codes
GPP@2.4GHz , MMX codes
Purpose
Add2
dyadic arithmetic
GreyOpen3
3x3 grey morphology
Gauss5
5x5 filter
Mexican13
13x13 conv.
Var5Oct
5x5 texture analysis
Canny
edge detection (3x3)
Smoothing
edge preserving
smoothing (7x7)
Only PO,LNO kernels are used
due to the nature of MMX inst.
NEC Corporation
6th/June, ISCA2005, 25/30
Compared with Some Recent Media Processors
(scratch pad memories)
128 bank memory
IMAP
One to several banks
SRF of Imagine (Stanford)
Frame Buffer of Morphosys (UC)
Local Store of SPE(CELL:Sony)
2KB
Image
PE
PE
PE
PE
PE
On chip vector partitioning & chaining
VIRAM (UCB), CODE (Stanford)
static vector partitioning
1024 point 1D-FFT performance compared with other media processors
Cycle count
Word Size
Imagine(Float)
2176
Morphosys2
Processor Name
IMAP-CE(IMAPCAR)
VIRAM
Die-size
Pwr(W) Tech(um)
16
12*12
4
0.15
2636
16
16*16
4
0.13
5000(3700)
8
11*11
4(2)
0.18(0.13)
5280
16
15*18
2
0.18
NEC Corporation
6th/June, ISCA2005, 26/30
A Real Application -Vehicle Detection-
 Flexibility at the application level
IMAP-CE@100MHz: use 1DC
GPP@2.4GHz: use C
IMAP -CE
GPP
0
20
40
60
80ms
Lane Mark Detection
Vehicle Detection
four local windows
foreward
looking camera
Search
in max. six vehicles
Validate
Lane Mark Detection
Tracking
vechicles
NEC Corporation
6th/June, ISCA2005, 27/30
The Uneven Workload Issue
Processing time distribution
GPP
Search
IMAP-CE
Validate
Search
0%
10%
20%
Validate
30%
40%
50%
Search
60%
70%
P
E
P
E
P
E
90%
100%
Validation
PE array fully
utilized
P
E
80%
P
E
P
E
P
E
P
E
P
E
P
E
Partial activation of
PE array during
sequential validatation
of each candidate area
NEC Corporation
6th/June, ISCA2005, 28/30
Outline
1. Challenge of Embedded Image Recognition Systems
2. Integrated Memory Array Processor (IMAP) Architecture
3. Programming Language and Compiler Design
4. Evaluation
5. Summary
NEC Corporation
6th/June, ISCA2005, 29/30
Summary
Assembly
programmed
DSPs
1) High Performance
2) Low Cost/ High Reliability
3) High Flexibility
Flexibility
Flexibility Gap
(a)
GPPs
(b)
Media Extended DSPs
Technology
Barrier
(c) Highly parallel SIMD
(d)
(e)
Wired logics
(+DSP core)
Performance
Embedded Image
Recognition
Processor
The IMAP approach
Parallel and systolic
algorithm design
methodology
+
Hardware support of
parallelizing methods
+
Extended C Compiler
& GUI Debugger
NEC Corporation
6th/June, ISCA2005, 30/30
The END
(Thank you for your attention)
Download