Proceedings of the 7th Annual ISC Graduate Research Symposium ISC-GRS 2013 April 24, 2013, Rolla, Missouri Yungxiang Mao Department of Computer Science Missouri University of Science and Technology, Rolla, MO 65409 PEDESTRIAN DETECTION USING MULTIPLE FEATURES ABSTRACT Pedestrian detection has been a quite hot topic in the field of computer vision. Due to variation of pedestrian pose, illumination condition, viewpoints and also occlusion, robust pedestrian detection is still a challenge for researchers to conquer. To better distinguish pedestrian from background, we need effective features to encode the Region Of Interest (ROI) which contains pedestrians. To achieve our goal, we propose to combine three features to build our own classifier: Histogram of Oriented Gradient (HOG) feature is adopted to describe the shape information from one single image; celled Local Binary Pattern (LBP) provides texture continuity information; Histogram of Oriented Motion (HOM) feature makes use of motion information to decide whether a given ROI contains a pedestrian. Experiment on our own airborne live video proves that the proposed approach has the potential to distinguish between pedestrians and background. by HOG, Wang .e.t. proposed a feature combined HOG and cell-structured LBP together and also a partial occlusion handing method to improve the overall detection performance. Motion is also a key cue for human perception. Vioal et al. [17] successfully incorporated motion features into detectors, resulting in large performance gain. Dalal et al. [18] build motion model based on optical flow differences. Most of current methods are use a single scalar model to scan different scalar input images. Rodrigo Benenson in [19] proposed to use multiple scalar models to scan one single scalar input image for both speeding up purpose and eliminating blurring effect. (a) Caltech 1. INTRODUCTION Several methods have been proposed to conquer the problem that is how to detect pedestrian in live video robustly and efficiently. Among all the approaches, there exist mainly two philosophies: global descriptor [1-6] and part-based descriptor [7, 8]. Our methodology belongs to the global descriptor class. Available dataset ETH [9], TUD-Brussels [10], Daimler [11], INRIA [1], and Caltech data set [13] appear enough for both training and test. Just in the Caltech data set, it contains 350,000 pedestrian bounding boxes labeled in 250,000 frames. As shown in figure (1). However, we cannot adopt these data sets as our both training and test data sets for the following reasons: (1), pedestrians in these available dataset are all sideview while pedestrians in our airborne detection task display great variant shape, (2), we need motion information as one of our three feature, but images in these data set are all still images. Therefore, we build our own eagle-view data sets. As for the learning process, Linear Support Vector Machines [14] and boosted classifiers [15] are very popular among most of current methods due to their good performance. As for some detection details, there does not exist too much difference [16]. Therefore, the most important factor which affects pedestrian detection performance is how to choose suitable features. A significant number of features have been explored in the past decade. Dalal and Triggs HOG [1] brings large gain to the performance of pedestrian detection. Inspired (c) ETH (b) Caltech-Japan (d) our data set Fig. 1. Example images cropped from six pedestrian detection data sets. As it shows, the existing data sets pedestrians in (a) (b) (c) are all side-view, but airborne videos are eagle-view. 1.1. Contribution Data Set Existing popular data sets are all collected in the side-view. Obviously this difference of view point between existing data sets and our videos would harm our performance. To obtain the optimal performance, we build our own data sets by self-built quad copter with GoPro mounted on it. Also we obtain motion data set which is not included in existing popular data sets for training our motion feature. Multiple Features Single feature can detect pedestrian in a reasonable performance but it cannot excess the performance of multiple features. Inspired by previous work, [3], and [18] we proposed to build a three-feature classifier. Performance gain is obvious through our experiment. This paper is organized as follow: we introduce our multiple features in section 2; our own data set and details 1 involved in detection are introduced in section 3; in section 4, we discuss the performance of multiple features compared with single feature alone and HOG-LBP feature. 2. MULTIPLE FEATURES We are dealing with pedestrian detection in airborne videos, which are more challenging. To achieve satisfactory performance, new feature needs to be created. In our proposed pedestrian detection procedure, we integrate HOG, cellstructured LBP and also Histogram of Oriented Motion (HOM) together to build a large scale feature space. The framework of our multiple features classifiers is shown in Figure (2). Details about how to build each feature are illustrated in next 3 subsections. Consecutive frame Current frame HOG Compute gradients HOM LBP image Weighted vote into bins in each cell Normalize over overlapping spatial blocks Histogram of Oriented Gradient (HOG), which is an efficient descriptor for objects gradient in images, along with Support Vector Machine, is highly popular in the last decade to perform classification. Briefly, the HOG method calculates the orient and magnitude of each pixel, vote their orients into bins in each cell. The weight of vote is according to the magnitude of each pixel. Then we can normalize each histogram of cell by the nearby 4 blocks each cell belongs to. The HOG effetely not only dress the problem of how to normalize each cell to get rid of the effect of illumination, but also spread the effect of each cell to their neighboring cells. There are two ways for sliding window techniques. Usually the most common one is to resize the input frame image, and then use one scale model to scan those different scalar images. While in [19] it suggests that using multi-scale classifiers for pedestrian detection not only conquers the blurring effect of resizing image, but also speeds up the whole procedure. Therefore, in our proposed pedestrian detection system, we mainly follow their idea to build 9 scale classifiers to scan the input frame image. Image warping Compute LBP at each pixel Compute differential difference Count times frame transition Fig.3. Visualization of HOG feature for pedestrian and background Weighted vote into bins in each cell Vote into bins in each cell Normalize over overlapping spatial blocks Normalize over overlapping spatial blocks Collected multiple features Fig.2. Procedure of our proposed method. Separately we collect HOG, cell-structured LBP and HOM features to obtain our final large-scale feature vector. 2.1. Histogram of Oriented Gradient 2.2. Cell-structured Local Binary Pattern While no single feature can perform better than HOG, additional feature provides complementary information to improve the performance. LBP has been widely used in various applications such as face recognition and has achieved good results. Its key advantage is that it is invariant to monotonic gray level change and computational efficiency. This makes it possible for applications such as pedestrian detection. Inspired by HOG, Wang et al. [3] proposed to take LBP operator as a descriptor for pedestrians. They add cellstructured LBP feature as another augmented feature vector. It is know the HOG feature perform poorly when there are some noisy edges in the background. To this point, LBP can filter these noises with the concept of uniform pattern [20]. To combine the characteristics of HOG and cell-structured LBP, the descriptor which capture two features of pedestrians perform better than descriptor with only HOG alone. Followed by the procedure of extracting HOG feature, first we extract the LBP pixel-wisely. We use LBP8,1 [3] to extract LBP of each pixel. For one pixel, we use its neighboring 8 pixels within radius 1 to get its LBP. For pixel whose value is 2 great or equal to the central pixel, we write it as 1, otherwise we write it as 0. The second step is to count the 0-1 and 1-0 transitions of LBP. There are 8 bits for one LBP, so transition times vary from 0 to 7, i.e. 8 bins in total. The third step is to vote the LBP transition times of pixels within each cell to 8 bins. Finally we can normalize each cell with the four blocks which they belong to. Fig. 4. Example of LBP8,1 feature extraction. 2.3. Histogram of Oriented Motion With the combination of HOG and cell-structured LBP, pedestrian detection has achieved a better performance. However, another notable feature which is highly useful and should be used for distinguishing pedestrians and background is the motion feature. Although some one will argue that HOG has already captured the boundary information of pedestrian, there is no need to build HOM. However, we still insist to do so. Admittedly, motion will appear along the boundary of pedestrian, but the boundary of background will disappear in HOM but still exist in the HOG. Figure () show the difference of HOG and HOF. Since the videos are collected by the moving cameras mounted on quad copters, to obtain the real motion of pedestrians we must perform video stabilization first. In this way, backgrounds in consecutive frames appear stable as they are in stationary cameras. Usually the interval between the two frames to perform homograph should not be large since in that way the noise of background will be large, it is not suitable to make background and pedestrians distinguishable. Also inspired by the HOG feature, Histogram of Oriented Motion (HOM) is a little different from the Motion descriptors introduced in *(human detection using Oriented Histograms of Flow and Appearance). Instead of using optical flow as the input of HOM, we directly use frame difference as the input of Fig. 5. Comparison of HOG feature and the motion input. The first column is the cropped background and pedestrian training samples. The second column is the visualization of HOG feature from background and pedestrian. The third column is the motion of background and pedestrian without filtering by a threshold. The last column is the motion of background and pedestrian filtered by a threshold. It is obvious that from our human perception, HOGs of pedestrian and background do not make a big difference, but the motion feature does. 3. TRAINING DATA SET AND SOME DETAILS We use our own built quad copter to collect videos for building our training dataset. The dataset consists of 5 pieces of videos which in total last more than 10 minutes. Due to the high frequency vibration of quad copter, videos taken by common cameras tend to has much blurring effect, which will definitely decrease the detection performance. To solve the blurring effect, GoPro camera is used since it provides good video stabilization quality. When cropping pedestrians from training dataset, we fixed the ratio of height to weight of windows to 2:1. We divide the whole positive training dataset into 9 scales. The height of positive training dataset starts from 28 pixels to 168 pixels. HOM. Denote frame difference as I c , the x- and y-derivative differential as I cx , I cy . We follow the procedure of HOG to build HOM. One thing should be noticed is that the motion of background is much less than pedestrians. If we still use the same normalization step as HOG, we do not use make good use of this information to make pedestrian and background more distinguishable. Therefore, in order to keep the difference between motion of pedestrian and motion of background, before we perform normalization as the same technique in HOG, we set a threshold to filter those motions with low value. Fig. 6. The distribution of the height of training positive samples 3 For the cell size, orientation number of HOG and the cost C in SVM, we simply follow the original paper’s suggestion. For the cell size and orientation number of HOM we use the same parameters as HOG. Linear SVM is quite popular for pedestrian detection due to its low time consuming. We use SVM Lite to complete this task. We start building classifiers at each scale with more than 2386 positive training samples from 40 pedestrians along with their flipped images and 150 random selected negative training samples. Meanwhile their motion samples are also collected. We run the initial 9 classifiers on our training videos and add the different false positive to negative training dataset. A crucial point in training which often is underestimated is that when the size of training dataset is not big enough one specific pedestrian with the same posture should not be added into the positive training dataset. If training samples of one pedestrian with the similar postures are added to training dataset too many times, this will increase the weight of that posture, then the whole performance of pedestrian detection will decrease. This also applies to negative training dataset. Motion information does not only works as an effective a0 area ( BBdt BBgt ) area ( BBdt BBgt ) 0.5 (3) To take the large detection time-consuming into consideration, we evaluation these three feature on 102 pedestrians. These 102 pedestrians are in different scale. And also the viewpoints are not all the same. Experiment result is shown in Table.1. HOG HOG-LBP HOG-LBP-HOM recall 60.1% 72.3% 76.5% precision 58.5% 65.5% 72.3% Table. 1. Comparison of three features by the measurement of recall and precision As shown is table1, our multiple-feature classifier outperform both HOG and HOG-LBP by 27%, 5% in recall and 23.5%, 10.4% in precision. Some of our test images are shown in figure 7. More test image results using multiple-feature classifier are shown in figure 8. feature, but also can provide us a strong cue for where should we put the scanning window. However, due to the vibration and mobility of quad copter, image stabilization must be performed first to get motion information. Then we enlarge the detection area around each motion point and perform uniform sampling to cut down the detection cost. 4. EVALUATION Currently popular pedestrian detection is not suitable for our test due to both the different view point which will definitely harm the performance and the lack of motion information for training and test. Our training procedure and test are performed on our own dataset. In the evaluation, we try to test three kinds of feature: HOG alone, combination of HOG and LBP and our HOG-LBPHOM feature. For the evaluation purpose, we adopt two methods to compare these three features, i.e. recall and precision. Denote BBdt is the detected bounding box, BBgt is (a) (b) (c) (d) (e) (f) the ground truth bounding box. recall precision successful detected BB BBgt (1) successful detected BB all detected BB (2) We call the pedestrian is detected when the below condition is satisfied: Fig. 7. Test HOG, HOG-LBP and HOG-LBP-HOM classifier. (a), (b) are the result of HOG classifier. (c), (d) are the result of HOG-LBP classifier. (e), (f) are the result of our HOG-LBPHOM classifier. 4 5. CONCLUSIONS In this paper, we propose a three-feature classifier for pedestrian detection. With adding well addressed Histogram of Oriented Motion, our multi-feature detector performs better than HOG alone and HOG-LBP detector. Also for the use of airborne pedestrian detection, we build our own data set for both training and test purposes. Motion information is also added to the data set so that motion feature can be used. [8] [9] [10] [11] One drawback of our work is that extracting three features while use SVM for classification can be time-consuming. “Feature mining” has been proposed by Dolla´r et al. [21] to efficiently utilize very large feature spaces using various strategies. In the feature, we will try to transfer their work and also parallel computing technology to ours to reach a higher computing speed. [12] 6. ACKNOWLEDGMENTS The authors would like to acknowledge the great support of the Intelligent Systems Center. [14] 7. REFERENCES [1] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2005. [2] C. Wojek and B. Schiele, 2001, “A Performance Evaluation of Single and Multi-Feature People Detection,” Proc. DAGM Symp. Pattern Recognition, 2008. [3] X. Wang, T.X. Han, and S. Yan, “An HOG-LBP Human Detector with Partial Occlusion Handling,” Proc. IEEE Int’l Conf. Computer Vision, 2009. [4] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New Features and Insights for Pedestrian Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2010. [5] C. Wojek and B. Schiele, “A Performance Evaluation of Single and Multi-Feature People Detection,” Proc. DAGM Symp. Pattern Recognition, 2008. [6] P. Dolla´ r, Z. Tu, P. Perona, and S. Belongie, “Integral Channel Features,” Proc. British Machine Vision Conf., 2009. [7] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A Discriminatively Trained, Multiscale, Deformable Part Model,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008. [13] [15] [16] [17] [18] [19] [20] [21] 5 C. Wojek and B. Schiele, “A Performance Evaluation of Single and Multi-Feature People Detection,” Proc. DAGM Symp. Pattern Recognition, 2008. A. Ess, B. Leibe, and L. Van Gool, “Depth and Appearance for Mobile Scene Analysis,” Proc. IEEE Int’l Conf. Computer Vision, 2007. C. Wojek, S. Walk, and B. Schiele, “Multi-Cue Onboard Pedestrian Detection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. M. Enzweiler and D.M. Gavrila, “Monocular Pedestrian Detection: Survey and Experiments,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2179- 2195, Dec. 2009. C. Wojek and B. Schiele, “A Performance Evaluation of Single and Multi-Feature People Detection,” Proc. DAGM Symp, Pattern Recognition, 2008. P. Dolla´r, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Detection: A Benchmark,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009. C. Papageorgiou and T. Poggio, “A Trainable System for Object Detection,” Int’l J. Computer Vision, vol. 38, no. 1, pp. 15-33, 2000. S. Walk, K. Schindler, and B. Schiele, “Disparity Statistics for Pedestrian Detection: Combining Appearance, Motion and Stereo,” Proc. European Conf. Computer Vision, 2010. Piotr Dolla´ r, Christian Wojek, Bernt Schiele, and Pietro Perona, “Pedestrian Detection: An Evaluation of the State of the Art”, IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2179- 2195, Dec. 2009. P.A. Viola, M.J. Jones, and D. Snow, “Detecting Pedestrians Using Patterns of Motion and Appearance,” Int’l J. Computer Vision, vol. 63, no. 2, pp. 153-161, 2005. N. Dalal, B. Triggs, and C. Schmid, “Human Detection Using Oriented Histograms of Flow and Appearance,” Proc. European Conf. Computer Vision, 2006. Rodrigo Benenson, Markus Mathias, Radu Timofte and Luc Van Gool, “Pedestrian detection at 100 frames per second”, cvpr, 2012. T. Ojala, M. Pietikinen, and D. Harwood. A comparative study of texture measures with classification based on feature distributions. Pattern Recognition, 29(1):51–59, 1998. P. Dolla´r, Z. Tu, H. Tao, and S. Belongie, “Feature Mining for Image Classification,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007. Fig. 8 HOG-LBP-HOM classifier test results. 6