NEC Corporation 6th/June, ISCA2005, 1/30 An Integrated Memory Array Processor Architecture for Embedded Image Recognition Systems Shorin KYO *1 Shin'ichiro OKAZAKI *2 Tamio ARAI *1 *1 Media and Information Research Laboratories, NEC Corporation *2 School of Engineering, University of Tokyo NEC Corporation 6th/June, ISCA2005, 2/30 Outline 1. Challenges of Embedded Image Recognition Systems 2. Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluations 5. Summary NEC Corporation 6th/June, ISCA2005, 3/30 Three Basic Requirements Ex. Embedded Driver Asistant Systems GOPS Realtime Response 1) High Performance 1000 100 10 1 Lane Marks Robustness 2) Cost/Power Efficiency Low cost Easy cooling (< 2 Watt) High Quality / Reliability Low EMI 3) High Flexibility (Scalability and Versatility) Able to handle the combination of [ applications × situations×targets ] NEC Corporation 6th/June, ISCA2005, 4/30 Applications × Situations × Targets Lane Change Assist Park Slot Measurement Side Pre-Crash Blind Spot Detection Cut-In Traffic Sign Recognition Dynamic Back Up Aid Drownsiness warning Backup Parking Assist Following Distance Warning Front Pre-Crash Stop&Go Pedestrian Protection Cross Traffic Warning NEC Corporation 6th/June, ISCA2005, 5/30 COR: Control versus Operational circuit Ratio Trading-off items 1) Performance (higher) 2) Cost (lower) 3) Flexibility (higher) Cost (Die size / power consumption) Control circuit Flexibility % of Control Circuitry (Flexibility) 100 Operation circuit (peak) performance a) Desktop/Server CPU (GPPs) Itanium b) MIMDs (Multi-Cores) Sparc64 SPE(CELL) c) DSPs FR1000 d) Highly parallel SIMDs FR500 IMAP-CE, IMAPCAR CODEC LSI e) Special purpose LSI 100 % of Operational Circuitry (Performance) NEC Corporation 6th/June, ISCA2005, 6/30 Overcoming the Flexibility Gap Flexibility Fixed Cost & Technology Constrain (a Technology Barrier) a) Ctrl. circuits b) (a) GPPs (b) DSPs and MIMDs (c) Highly parallel SIMDs (d) Custom logics+DSP core (e) Custom logics only Op. Ctrl. circuits Op. circuit c) Flexibility gap Ctrl. Op. circuit Op. circuit d) e) Op. circuit Performance Challenge of embedded image processors ⇒ Minimizing COR while overcoming the "Flexibility Gap" NEC Corporation 6th/June, ISCA2005, 7/30 Outline 1. Challenge of Embedded Image Recognition Systems 2. Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary NEC Corporation 6th/June, ISCA2005, 8/30 IMAP Series Processors Peak Performance(GOPS) ISSCC’03 1000 100MHz, 128PE/Chip 4-Way VLIW ,50GOPS 0.18um, 2~4Watt 100 IMAPCAR ISSCC’95 IMAP-2 40MHz, 64PE/Chip (PE8: eight PEs integration block) 11.0mm CAMP’97 IMAP-VISION DPLL 40MHz, 32PE/Chip 1 IMAP-1 15MHz, 8PE /Chip 0.1 1990 1995 2000 2005 EXTIF PE8 PE8 11.0mm 10 100MHz, 128PE/Chip 4-Way VLIW+MAC, 100GOPS (-40℃~85℃), 0.13 um, <2Watt IMAP-CE PE8 PE8 PE8 CP PE8 PE8 PE8 PE8 PE8 PE8 PE8 PE8 PE8 PE8 PE8 2010 Year IMAP-CE(32.7M Tr, 0.18um) NEC Corporation 6th/June, ISCA2005, 9/30 Block Diagram and Features 12.8 GByte/s EMEM EMEM EMEM EMEM External Mem. I/F 0.8 GByte/s ALUx1,MULx1,LOGx1,LSUx1 IMEM 4 Way VLIW PE IMEM 4 Way VLIW PE 1 Video IN 1) 2) 3) 4) 5) 6) ADD LOG MUL RDU LSU COMM 24 x 8b General Purpose Registers IMEM 127 4 Way VLIW PE 128 Video OUT To/Fr To/Fr To/Fr CP IMEM other PEs 0 SR0 SR1 SR2 SR3 Control Processor (CP) P$,D$,STK RAM Host Processor one pixel data source (image) data column(s) of image IMEM of one PE 100MHz 128 4Way VLIW linear array PEs Two level memory architecture + user DMA PE Automated mapping of image data to each PE 128 individual RAM blocks configuration instruction 1DC (One Dimensional C) + “Line methods” broadcast Enhanced PE instruction set design for 1DC (SIMD) 2KB 128 PE CP PE PE SDRAM/ SSRAM 64MB~ NEC Corporation 6th/June, ISCA2005, 10/30 Memory Access Pattern Categories Geometric. Op. (GeO) Sensors pixels Input Image X Output Image Y ex. affine Point Op. (PO) Low-level Image Processing Pre-processing Low-level Feature Extraction Input Image X Output Image Y Statistical Op. (SO) pixels Intermediate-level Image Processing Measurements Local Neigh. Op. (LNO) Local Feature based Discrimination Object Op. (OO) Input Image X symbols Output vector / Input Image X scalar V High-level Decision Higher level Feature extraction (RNO) Recursive Neigh. Op. Output Image Y ex. histogram Global Op. (GlO) Input Image X ex. 2d-filters,NN Input Image X Output vector / scalar V ex. labelling/propagation Output Image Y ex. FFT Input Image X Output Image Y ex. distance trans. E.R.Komen: Low-level Image Processing Architectures, Ph.d Thesis, TUD,Netherlands, 1990. P.P.Jonker: Architectures for Multidimensional Low- and Intermidiate Level Image rocessing, Proc. of IAPR Workshop on Machine Vision Applications (MVA'90), pp.307--316, 1990. NEC Corporation 6th/June, ISCA2005, 11/30 Memory Access Pattern Parallelization Issue Conventional continous (or strided) address data supply (ex. streaming data supply) is not sufficient for parallelizing most memory access patterns been required Unified RAM PE PE PE PE PE SIMD + VLIW PEs PO LNO SO GlO GeO RNO OO ○ Completely local ○ Local Neighborhood × Global × Global × Global × Recursive × Data dependent NEC Corporation 6th/June, ISCA2005, 12/30 Memory Access Pattern Parallelization Design Constrained pixel update Locality No Unconstrained pixel update Statically constrained dynamically constrained update location is statically predictable update location must be dynamically determined SO, GlO,GeO - - PO, LNO RNO Yes row-wise (PUL) row-systolic OO slant-systolic autonomous image PE PE PE PE PE PE PE PE PE PE PE PE requires one RAM block / PE configuration Line Methods (PUL: Pixel Updating Line) NEC Corporation 6th/June, ISCA2005, 13/30 Line Methods (1) ー Combination of PULs ー 90 degree rotation 2 times PE PE PE PE PE PE Thinning + PE PE + PE PE PE PE PE PE PE Connect component labeling + PE PE PE Propagation PE PE PE NEC Corporation 6th/June, ISCA2005, 14/30 Line Methods (2) ー Expected Speedup ー *1: When under an unified RAM approach *2: When using the memory array architecture *1 *2 (when using N PEs) N/2~N time speedup by N PEs NEC Corporation 6th/June, ISCA2005, 15/30 Outline 1. Challenge of Embedded Image Recognition Systems 2. Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary NEC Corporation 6th/June, ISCA2005, 16/30 1DC: An Extended C Language One (vector like) data structure and six operators int d, e; sep char a,b; sep char c,ary[256]; Correspondence between parallelizing techniques and the 1DC syntax. NEC Corporation 6th/June, ISCA2005, 17/30 1DC: Line-wise Parallel Operation • Sequential Languages (Ex. C) for (y=0; y < {number of lines}; y++) for (x=0; x < {number of columns}; x++) ......... • When using 1DC, skip the {number of columns} loop for (y=0; y < {number of lines};y++) ........... Ex. An Edge Detection Filter y=0 y=120 y=200 y= {number of lines} NEC Corporation 6th/June, ISCA2005, 18/30 Average Filter in 1DC (1) sep uchar src[256], dst[256]; ave33( ){ int i; Summing three lines at the same time sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } } src[i-1] ・・・・ a6 src[i] ・・・・ b6 ・・・・ c6 + + src[i+1] = csum a7 + b7 + c7 ↓ a8 ・・・・ b8 ・・・・ c8 ・・・・ ・・・・ a6+b6+c6 a7+b7+c7 a8+b8+c8 ・・・・ NEC Corporation 6th/June, ISCA2005, 19/30 Average Filter in 1DC (2) ave33( ){ int i; Neigh. ref.(:>,:<) and “+” sep int csum; for(i=1;i<LINES-1;i++){ csum = src[i-1] + src[i] + src[i+1]; /*1*/ dst[i] = :>csum + csum + :<csum; /*2*/ dst[i] /= 9; } } :>csum ・・・・ a5+b5+c5 a6+b6+c6 a7+b7+c7 ・・・・ + + csum ・・・・ a6+b6+c6 a7+b7+c7 a8+b8+c8 + + :<csum ・・・・ a7+b7+c7 a8+b8+c8 a9+b9+c9 ・・・・ ↓ = dst[i] ・・・・ ・・・・ a5+b5+c5 a6+b6+c6 a7+b7+c7 ・・・・ a6+b6+c6 a7+b7+c7 a8+b8+c8 a7+b7+c7 a8+b8+c8 a9+b9+c9 NEC Corporation 6th/June, ISCA2005, 20/30 Toward Efficient Execution of 1DC Codes 1DC program Row Systolic Slant Autonomous PE array PE array PE array PE array Fast PE grouping External Mem. I/F Fast index addressing SDRAM/SSRAM 1DC compiler / linker IMEM 4 Way VLIW PE 0 IMEM 4 Way VLIW PE 1 IMEM 4 Way VLIW PE 127 128 Video OUT Fast left/right referencing Video IN SR0 SR1 SR2 SR3 Control Processor (CP) P$,D$,STK RAM Host Processor Pipelined data exchange NEC Corporation 6th/June, ISCA2005, 21/30 Programming Environment 1DC Source Code 1DC Source code window Assign variables Source image to sliders window Real-time value tuning debugging 1DC Optimizing Compiler Library IMAP Assembler Linker 1DC Symbolic Debugger Timing measurement result for Image recognition result window each source code line IMAP-CE PCI board NEC Corporation 6th/June, ISCA2005, 22/30 Outline 1. Challenge of Embedded Image Recognition Systems 2. Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary NEC Corporation 6th/June, ISCA2005, 23/30 Operation Group Kernels Flexibility against various memory access patterns IMAP-CE@100MHz, 1DC compiler codes GPP@2.4GHz , Intel C compiler codes speedup Operation group kernels Op. Grp. Kernel Name parallelism (max.128) PO 8 140 7 GPP IMAP-CE Parallelism 6 5 Color format trans. 1.40 120 LNO 3x3 ave. filter 1.33 100 SO Histogram 1.66 80 GlO FFT 1.55 60 GeO 90 degree rotation 1.23 RNO Distance transform 1.52 OO Connected component labeling 1.40 4 3 40 2 20 1 0 0 PO LNO SO GlO GeO RNO OO (Ave.) IPC NEC Corporation 6th/June, ISCA2005, 24/30 Highly Parallel vs. Sub-Word SIMD GOPS : in byte operation Processor Op.Freq. Flexibility against algorithmic complexity P4(SIMD) 2.4GHz 1PEx8x2 100MHz 128PEx4 IMAP-CE IMAP-CE GPP 20 18 16 14 12 10 8 6 4 2 0 GPP(MMX) IMAP-CE x 1.33 x 32 Benchmark kernels name PO LNO (Ave.) Smoothing Canny Var5oct Mexican13 Gauss5 Complexity GreyOpen3 x 1/24 38.4GOPS 51.2GOPS # of if-clause per pixel op. speed-up Add2 10 9 8 7 6 5 4 3 2 1 0 Peak Perf. PE # IMAP-CE@100MHz, 1DC compiler codes GPP@2.4GHz , MMX codes Purpose Add2 dyadic arithmetic GreyOpen3 3x3 grey morphology Gauss5 5x5 filter Mexican13 13x13 conv. Var5Oct 5x5 texture analysis Canny edge detection (3x3) Smoothing edge preserving smoothing (7x7) Only PO,LNO kernels are used due to the nature of MMX inst. NEC Corporation 6th/June, ISCA2005, 25/30 Compared with Some Recent Media Processors (scratch pad memories) 128 bank memory IMAP One to several banks SRF of Imagine (Stanford) Frame Buffer of Morphosys (UC) Local Store of SPE(CELL:Sony) 2KB Image PE PE PE PE PE On chip vector partitioning & chaining VIRAM (UCB), CODE (Stanford) static vector partitioning 1024 point 1D-FFT performance compared with other media processors Cycle count Word Size Imagine(Float) 2176 Morphosys2 Processor Name IMAP-CE(IMAPCAR) VIRAM Die-size Pwr(W) Tech(um) 16 12*12 4 0.15 2636 16 16*16 4 0.13 5000(3700) 8 11*11 4(2) 0.18(0.13) 5280 16 15*18 2 0.18 NEC Corporation 6th/June, ISCA2005, 26/30 A Real Application -Vehicle Detection- Flexibility at the application level IMAP-CE@100MHz: use 1DC GPP@2.4GHz: use C IMAP -CE GPP 0 20 40 60 80ms Lane Mark Detection Vehicle Detection four local windows foreward looking camera Search in max. six vehicles Validate Lane Mark Detection Tracking vechicles NEC Corporation 6th/June, ISCA2005, 27/30 The Uneven Workload Issue Processing time distribution GPP Search IMAP-CE Validate Search 0% 10% 20% Validate 30% 40% 50% Search 60% 70% P E P E P E 90% 100% Validation PE array fully utilized P E 80% P E P E P E P E P E P E Partial activation of PE array during sequential validatation of each candidate area NEC Corporation 6th/June, ISCA2005, 28/30 Outline 1. Challenge of Embedded Image Recognition Systems 2. Integrated Memory Array Processor (IMAP) Architecture 3. Programming Language and Compiler Design 4. Evaluation 5. Summary NEC Corporation 6th/June, ISCA2005, 29/30 Summary Assembly programmed DSPs 1) High Performance 2) Low Cost/ High Reliability 3) High Flexibility Flexibility Flexibility Gap (a) GPPs (b) Media Extended DSPs Technology Barrier (c) Highly parallel SIMD (d) (e) Wired logics (+DSP core) Performance Embedded Image Recognition Processor The IMAP approach Parallel and systolic algorithm design methodology + Hardware support of parallelizing methods + Extended C Compiler & GUI Debugger NEC Corporation 6th/June, ISCA2005, 30/30 The END (Thank you for your attention)