Real-time Signal Processing on Embedded Systems Advanced Cutting-edge Research Seminar I&III Practical Applications Pedestrian Detection FPGA-based system Pedestrian Tracking GPU-based system Hardware Architecture for High-Accuracy Real-Time Pedestrian Detection with CoHOG Features Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Pedestrian detection on automotive systems Challenges: Various appearances of pedestrians …Clothes’ shape and color, pose, etc. Template-base or simple gradient-base method does not perform high-accuracy recognition Viewpoint movement …all objects in an image are moving Background subtraction or frame subtraction cannot be used A robust recognition method suitable for pedestrians is required Pedestrian detection algorithms Recent trend: Combination of gradients and histograms Gradient: robust for illumination and color change Histogram: robust for deformation Histograms of oriented gradients (HOG) Examples Co-occurrence histograms of oriented gradients (CoHOG)* HOG-based method Using pairs of oriented gradients One of today’s best algorithms for pedestrian detection However, Real-time execution is difficult to be achieved by software implementation (e.g. a few seconds are required for processing on a 320x240 image) Specialized hardware for real-time processing * T. Watanabe, S. Ito, and K. Yokoi, “Co-occurrence histograms of oriented gradients for pedestrian detection,” PSIVT2009 Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Pedestrian detection using CoHOG Calculate gradient orientations Divide into small regions (BLOCKS) Pick up pairwise pixels Calculate co-occurrence histograms Co-occurrence histogram of oriented gradients Offset 1 CoHOG feature vector Classified by SVM Offset 2 Gradient orientations Repeat for various positions of pixel pairs (called as OFFSETS) Variations of offsets (31 offsets) Detection procedure Sliding window approach Feature vectors are extracted in a scan line order. Image size or window size is scaled to detect pedestrians in another scale. Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Parallel execution of CoHOG feature calculation Large number of co-occurrence histograms must be calculated → All histograms can be calculated in parallel Offsets 31 parallel threads Blocks Horizontal:6 parallel threads Vertical: 12 parallel threads Large parallelism Block number: 6x12=72 Offset variations: 31 We execute 31 parallel offsets and 6 horizontal block-threads =186 parallel threads Processing performance is drastically improved! Merging histogram calculation and SVM prediction Matrix size: 8x8=64 Offset variations: 31 Block number: 6x12=72 Dimensions of CoHOG feature vector is very high 64×31offsets×72blocks=about 140k dimensions Large memory is required to store the feature vector Many multiplications must be executed during SVM prediction f(x)=sign(w・x+b) Our proposal: Execute histogram calculation and SVM prediction simultaneously Merging histogram calculation and SVM prediction Straightforward approach Scan image +1 to a corresponding bin i j x ( xi, j ) +1 +1 xi, j +1 j if orientatio SVM prediction (w ×wi,j i,j×wi,j + ns are ( i,j ) otherwise i, j i, j ×wi,j×w Inner product is calculated for SVM prediction 1, 0, image wx i Histogram is generated Histogram calculation Weighting vector values x i, j ) Merging histogram calculation and SVM prediction Proposed method Histogram calculation Scan image x ( xi, j ) i j +wi,j +wi,j xi, j 1, 0, image if orientatio ns are ( i,j ) otherwise SVM prediction +wi,j + Directly accumulate weighting vector values wx i, j x i, j ) i, j i, j Large memory to store histograms and many multipliers for SVM prediction are unnecessary (w 1, w i, j 0, image i , j image wi, j , 0, if orientatio ns are ( i,j ) otherwise if orientatio ns are ( i,j ) Circuit size can be drastically reduced! otherwise Proposed architecture Gradient orientation image generator Input image Line buffers Sobel filter (horizontal) Sobel filter (vertical) Orientation classifier Combined module for histogram calculation and SVM prediction Shift registers Frame buffer WxH Controller Weighting vector ROMs Subwindow data 31 offsets 6 blocks Accumulator Results Proposed architecture Gradient orientation image generator Input image Line buffers Sobel filter (horizontal) Orientation classifier Combined module for histogram calculation and SVM prediction Shift registers Frame buffer WxH Sobel filter (vertical) Controller Parallel execution 31 offsets×6 blocks = 186 parallel threads Merging histogram calculation and SVM prediction Weighting vector ROMs Subwindow data 31 offsets 6 blocks Accumulator No histogram memory and multipliers Only weighting vector ROMs and an accumulator Efficient hardware architecture is successfully designed by using proposed methods Results Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion FPGA implementation Implementation result Target FPGA: Xilinx Virtex-5 XC5VLS330T-2 Device name Used Number of Slice Registers Number of Slice LUTs Number of occupied Slices Number of BlockRAM Total Memory used (KB) Number of DSP48Es Available 5,980 28,495 8,580 61 2,196 2 Utilization 207,360 207,360 51,840 324 11,664 192 2% 13% 16% 18% 18% 1% Max delay: 5.997ns (Max frequency: 167MHz) Our system can process 139,166 sub-windows / second Intel Core i7 3.2GHz: about 1,100 sub-windows / second Capable for real-time processing on 38 fps 320x240 video sequence More than 100 times faster! 20 Pedestrian detection system FPGA board Receives input images from host PC, and returns results of pedestrian detection Xilinx Virtex-5 FPGA LX330T PCI Express PCI Express endpoint DDR2 memory Host PC Transfers images captured by a camera, and displays detection results CPU: Intel Core i7 3.2GHz Camera: USB webcam (640x480 resolution) Detection result Outline Introduction Pedestrian detection using CoHOG features Proposed hardware architecture Parallel execution Merging histogram calculation and SVM prediction FPGA implementation Conclusion Conclusion High-performance and efficient hardware architecture for CoHOG-based pedestrian detection is proposed Effectively exploits parallelism in CoHOG algorithm → 186 parallel processing is realized Drastically reduces circuit area (memory and multipliers) by proposing simultaneous execution of histogram calculation and SVM prediction Achieves more than 100 times faster processing by FPGA implementation than CPU → Capable for real-time processing on 38 fps 320x240 video sequence Parallel Implementation of Pedestrian Tracking Using Multiple Cues on GPGPU Outline Introduction Pedestrian Tracking using Multiple Cues Parallel Implementation on NVIDIA GPU Conclusion Outline Introduction Pedestrian Tracking using Multiple Cues Parallel Implementation on NVIDIA GPU Conclusion Introduction Pedestrian recognition Detection Tracking Combination of 2 steps Scan entire image Input image Detection Track the pedestrians over the frames Tracking Introduction Pedestrian Tracking Particle Filter HSV color histogram (K. Okuma et.al., ECCV2004) Succeed to track Fail to track Simple background Complex background HSV histogram within the rectangle Introduction Red shirt Color information Red car Gray gnd. Gray gnd. HSV histogram HSV histogram Shape information Combining both color and shape information Introduction The contributions of this paper New pedestrian tracking algorithm using both color and shape information based on particle filters Parallel implementation on GPGPU for realtime processing Outline Introduction Pedestrian Tracking using Multiple Cues Parallel Implementation on NVIDIA GPU Conclusion Particle Filter (pedestrian tracking) Scatter particles Eliminate low likelihood particles and replicate high Current frame (time t-1) likelihood particles. Particle Re-sampling (time t) MeasurePrediction the pedestrian likelihood Measurement Particle Filter (pedestrian tracking) To define pedestrian likelihood, we use Current frame Shape information…HOG feature Color information…HSV histogram Particle Re-sampling Prediction Measurement Histograms of Oriented Gradients Represent object shape information Calculate gradient orientation Aggregate gradient orientation of each block Map the vector on the feature space Learn beforehand by SVM Non-pedestrian HOG Discriminant border Pedestrian HOG Feature space HSV Histogram Represent object color information Convert an input image into a HSV image Calculate a HSV hist. Calculate a Bhattacharyya dist. HSV color space Hue Saturation Value Input image HSV histogram Bhattacharyya distance HSV Reference HSV hist. HSV feature space Pedestrian tracking using multiple cues Measurement Prediction Non-pedestrian HSV HOG Pedestrian Existing algorithm cf ( HOG ) (1 c ) g ( HSV ) Reference HSV hist. HOG feature space Pedestrian likelihood Weighted coefficient [0,1] HSV feature space Tracking results HOG+HSV (our proposed algorithm) HSV only (K. Okuma et.al., ECCV2004) HOG only Outline Introduction Pedestrian Tracking using Multiple Cues Parallel Implementation on NVIDIA GPU Conclusion NVIDIA GPU architecture Streaming multiprocessors (SM) 32-bit scalar processors (SP) Shared memory Read only cache Device memory SM SM SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shrd mem Shrd mem Shrd mem Cache Cache Cache In case of Tesla C1060, •4GB Device memory •30 streaming multiprocessors (total 240 SPs) •1.3 GHz processor clock Device memory Implementation strategy Current frame Re-sampling Prediction Measurement SM SM SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shrd mem Shrd mem Shrd mem Cache Cache Cache Device memory Run measurement process on GPU. Almost 99% computation time Implementation strategy Current frame Re-sampling Prediction SM SM SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shrd mem Shrd mem Shrd mem Cache Cache Cache Device memory Allocate each particle on SM Measurement Independent process of each particle Implementation strategy Current frame Prediction SM SM SM SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Shrd mem Shrd mem Shrd mem Cache Cache Cache Re-sampling Device memory Measurement Exploit pixel level parallelism on SPs Sync. among SPs is fast. HSV likelihood calculation Transfer the results to the CPU Sum allmemory the histograms Calculate HSV histogram on SPs Allocate each particle per SM line calculation to the Calculate the Bhattacharyya dist. Bhattacharyya distance HSV Reference HSV hist. Input image HSV histogram HSV feature space HOG likelihood calculation Calculate the distance to the discriminant border Transfer the results Sum histograms to the CPU memory Calculate grad. andCalculate angle HOG histogram on SPs on SPs per some pixels Non-pedestrian Allocate each particle calculation to the SM HOG Discriminant border Pedestrian HOG Feature space Processing time GPU: NVIDIA Tesla C1060 Number of multiprocessors: 30 Total number of scalar processors: 240 Comparing Intel Core i7 965 @ 3.2 GHz 140 120 13.9 times faster 100 80 processing time per frame[ms] 60 40 20 113.6 fps 0 Core i7 Tesla C1060 Outline Introduction Pedestrian Tracking using Multiple Cues Parallel Implementation on NVIDIA GPU Conclusion Conclusion Pedestrian tracking algorithm using HSV and HOG features is proposed Real-time processing can be achieved by the parallel implementation using NVIDIA GPU Report subject (not mandatory) What do you think about the advance of signal processing on embedded systems in the future? Please submit the report by email to miya@is.naist.jp. Please write your student ID and name. Deadline: Feb 3rd 17:00 レポート課題(必須ではない) 組込みシステムにおける信号処理の今後 について自由に述べよ(応用でも、やりた いことでも何でもOK) 提出先 miya@is.naist.jp IDと名前をメール本文に明記すること。 締切 2/3 17:00