Pedestrian Detection Using Multiple Features

advertisement
Proceedings of the 7th Annual ISC Graduate Research Symposium
ISC-GRS 2013
April 24, 2013, Rolla, Missouri
Yungxiang Mao
Department of Computer Science
Missouri University of Science and Technology, Rolla, MO 65409
PEDESTRIAN DETECTION USING MULTIPLE FEATURES
ABSTRACT
Pedestrian detection has been a quite hot topic in the field of
computer vision. Due to variation of pedestrian pose,
illumination condition, viewpoints and also occlusion, robust
pedestrian detection is still a challenge for researchers to
conquer. To better distinguish pedestrian from background, we
need effective features to encode the Region Of Interest (ROI)
which contains pedestrians. To achieve our goal, we propose to
combine three features to build our own classifier: Histogram
of Oriented Gradient (HOG) feature is adopted to describe the
shape information from one single image; celled Local Binary
Pattern (LBP) provides texture continuity information;
Histogram of Oriented Motion (HOM) feature makes use of
motion information to decide whether a given ROI contains a
pedestrian. Experiment on our own airborne live video proves
that the proposed approach has the potential to distinguish
between pedestrians and background.
by HOG, Wang .e.t. proposed a feature combined HOG and
cell-structured LBP together and also a partial occlusion
handing method to improve the overall detection performance.
Motion is also a key cue for human perception. Vioal et al. [17]
successfully incorporated motion features into detectors,
resulting in large performance gain. Dalal et al. [18] build
motion model based on optical flow differences. Most of
current methods are use a single scalar model to scan different
scalar input images. Rodrigo Benenson in [19] proposed to use
multiple scalar models to scan one single scalar input image for
both speeding up purpose and eliminating blurring effect.
(a) Caltech
1. INTRODUCTION
Several methods have been proposed to conquer the problem
that is how to detect pedestrian in live video robustly and
efficiently. Among all the approaches, there exist mainly two
philosophies: global descriptor [1-6] and part-based descriptor
[7, 8]. Our methodology belongs to the global descriptor class.
Available dataset ETH [9], TUD-Brussels [10], Daimler [11],
INRIA [1], and Caltech data set [13] appear enough for both
training and test. Just in the Caltech data set, it contains
350,000 pedestrian bounding boxes labeled in 250,000 frames.
As shown in figure (1). However, we cannot adopt these data
sets as our both training and test data sets for the following
reasons: (1), pedestrians in these available dataset are all sideview while pedestrians in our airborne detection task display
great variant shape, (2), we need motion information as one of
our three feature, but images in these data set are all still
images. Therefore, we build our own eagle-view data sets.
As for the learning process, Linear Support Vector
Machines [14] and boosted classifiers [15] are very popular
among most of current methods due to their good performance.
As for some detection details, there does not exist too much
difference [16]. Therefore, the most important factor which
affects pedestrian detection performance is how to choose
suitable features. A significant number of features have been
explored in the past decade. Dalal and Triggs HOG [1] brings
large gain to the performance of pedestrian detection. Inspired
(c) ETH
(b) Caltech-Japan
(d) our data set
Fig. 1. Example images cropped from six pedestrian detection
data sets. As it shows, the existing data sets pedestrians in (a)
(b) (c) are all side-view, but airborne videos are eagle-view.
1.1. Contribution
Data Set Existing popular data sets are all collected in the
side-view. Obviously this difference of view point between
existing data sets and our videos would harm our performance.
To obtain the optimal performance, we build our own data sets
by self-built quad copter with GoPro mounted on it. Also we
obtain motion data set which is not included in existing popular
data sets for training our motion feature.
Multiple Features Single feature can detect pedestrian in a
reasonable performance but it cannot excess the performance of
multiple features. Inspired by previous work, [3], and [18] we
proposed to build a three-feature classifier. Performance gain is
obvious through our experiment.
This paper is organized as follow: we introduce our
multiple features in section 2; our own data set and details
1
involved in detection are introduced in section 3; in
section 4, we discuss the performance of multiple features
compared with single feature alone and HOG-LBP feature.
2. MULTIPLE FEATURES
We are dealing with pedestrian detection in airborne videos,
which are more challenging. To achieve satisfactory
performance, new feature needs to be created. In our proposed
pedestrian detection procedure, we integrate HOG, cellstructured LBP and also Histogram of Oriented Motion (HOM)
together to build a large scale feature space. The framework of
our multiple features classifiers is shown in Figure (2). Details
about how to build each feature are illustrated in next 3
subsections.
Consecutive frame
Current frame
HOG
Compute
gradients
HOM
LBP
image
Weighted vote into
bins in each cell
Normalize
over
overlapping spatial
blocks
Histogram of Oriented Gradient (HOG), which is an
efficient descriptor for objects gradient in images, along with
Support Vector Machine, is highly popular in the last decade to
perform classification. Briefly, the HOG method calculates the
orient and magnitude of each pixel, vote their orients into bins
in each cell. The weight of vote is according to the magnitude
of each pixel. Then we can normalize each histogram of cell by
the nearby 4 blocks each cell belongs to. The HOG effetely not
only dress the problem of how to normalize each cell to get rid
of the effect of illumination, but also spread the effect of each
cell to their neighboring cells.
There are two ways for sliding window techniques.
Usually the most common one is to resize the input frame
image, and then use one scale model to scan those different
scalar images. While in [19] it suggests that using multi-scale
classifiers for pedestrian detection not only conquers the
blurring effect of resizing image, but also speeds up the whole
procedure. Therefore, in our proposed pedestrian detection
system, we mainly follow their idea to build 9 scale classifiers
to scan the input frame image.
Image warping
Compute LBP at
each pixel
Compute
differential
difference
Count
times
frame
transition
Fig.3. Visualization of HOG feature for pedestrian and
background
Weighted vote into
bins in each cell
Vote into bins in
each cell
Normalize
over
overlapping spatial
blocks
Normalize
over
overlapping spatial
blocks
Collected multiple features
Fig.2. Procedure of our proposed method. Separately we collect
HOG, cell-structured LBP and HOM features to obtain our final
large-scale feature vector.
2.1. Histogram of Oriented Gradient
2.2. Cell-structured Local Binary Pattern
While no single feature can perform better than HOG,
additional feature provides complementary information to
improve the performance. LBP has been widely used in various
applications such as face recognition and has achieved good
results. Its key advantage is that it is invariant to monotonic
gray level change and computational efficiency. This makes it
possible for applications such as pedestrian detection.
Inspired by HOG, Wang et al. [3] proposed to take LBP
operator as a descriptor for pedestrians. They add cellstructured LBP feature as another augmented feature vector. It
is know the HOG feature perform poorly when there are some
noisy edges in the background. To this point, LBP can filter
these noises with the concept of uniform pattern [20]. To
combine the characteristics of HOG and cell-structured LBP,
the descriptor which capture two features of pedestrians
perform better than descriptor with only HOG alone.
Followed by the procedure of extracting HOG feature, first
we extract the LBP pixel-wisely. We use LBP8,1 [3] to extract
LBP of each pixel. For one pixel, we use its neighboring 8
pixels within radius 1 to get its LBP. For pixel whose value is
2
great or equal to the central pixel, we write it as 1, otherwise we
write it as 0. The second step is to count the 0-1 and 1-0
transitions of LBP. There are 8 bits for one LBP, so transition
times vary from 0 to 7, i.e. 8 bins in total. The third step is to
vote the LBP transition times of pixels within each cell to 8
bins. Finally we can normalize each cell with the four blocks
which they belong to.
Fig. 4. Example of LBP8,1 feature extraction.
2.3. Histogram of Oriented Motion
With the combination of HOG and cell-structured LBP,
pedestrian detection has achieved a better performance.
However, another notable feature which is highly useful and
should be used for distinguishing pedestrians and background is
the motion feature.
Although some one will argue that HOG has already
captured the boundary information of pedestrian, there is no
need to build HOM. However, we still insist to do so.
Admittedly, motion will appear along the boundary of
pedestrian, but the boundary of background will disappear in
HOM but still exist in the HOG. Figure () show the difference
of HOG and HOF.
Since the videos are collected by the moving cameras
mounted on quad copters, to obtain the real motion of
pedestrians we must perform video stabilization first. In this
way, backgrounds in consecutive frames appear stable as they
are in stationary cameras. Usually the interval between the two
frames to perform homograph should not be large since in that
way the noise of background will be large, it is not suitable to
make background and pedestrians distinguishable.
Also inspired by the HOG feature, Histogram of Oriented
Motion (HOM) is a little different from the Motion descriptors
introduced in *(human detection using Oriented Histograms of
Flow and Appearance). Instead of using optical flow as the
input of HOM, we directly use frame difference as the input of
Fig. 5. Comparison of HOG feature and the motion input. The
first column is the cropped background and pedestrian training
samples. The second column is the visualization of HOG
feature from background and pedestrian. The third column is
the motion of background and pedestrian without filtering by a
threshold. The last column is the motion of background and
pedestrian filtered by a threshold. It is obvious that from our
human perception, HOGs of pedestrian and background do not
make a big difference, but the motion feature does.
3. TRAINING DATA SET AND SOME DETAILS
We use our own built quad copter to collect videos for building
our training dataset. The dataset consists of 5 pieces of videos
which in total last more than 10 minutes. Due to the high
frequency vibration of quad copter, videos taken by common
cameras tend to has much blurring effect, which will definitely
decrease the detection performance. To solve the blurring
effect, GoPro camera is used since it provides good video
stabilization quality. When cropping pedestrians from training
dataset, we fixed the ratio of height to weight of windows to
2:1. We divide the whole positive training dataset into 9 scales.
The height of positive training dataset starts from 28 pixels to
168 pixels.
HOM. Denote frame difference as I c , the x- and y-derivative
differential as I cx , I cy . We follow the procedure of HOG to
build HOM. One thing should be noticed is that the motion of
background is much less than pedestrians. If we still use the
same normalization step as HOG, we do not use make good use
of this information to make pedestrian and background more
distinguishable. Therefore, in order to keep the difference
between motion of pedestrian and motion of background,
before we perform normalization as the same technique in
HOG, we set a threshold to filter those motions with low value.
Fig. 6. The distribution of the height of training positive
samples
3
For the cell size, orientation number of HOG and the cost
C in SVM, we simply follow the original paper’s suggestion.
For the cell size and orientation number of HOM we use the
same parameters as HOG. Linear SVM is quite popular for
pedestrian detection due to its low time consuming. We use
SVM Lite to complete this task.
We start building classifiers at each scale with more than
2386 positive training samples from 40 pedestrians along with
their flipped images and 150 random selected negative training
samples. Meanwhile their motion samples are also collected.
We run the initial 9 classifiers on our training videos and add
the different false positive to negative training dataset. A crucial
point in training which often is underestimated is that when the
size of training dataset is not big enough one specific pedestrian
with the same posture should not be added into the positive
training dataset. If training samples of one pedestrian with the
similar postures are added to training dataset too many times,
this will increase the weight of that posture, then the whole
performance of pedestrian detection will decrease. This also
applies to negative training dataset.
Motion information does not only works as an effective
a0 
area ( BBdt  BBgt )
area ( BBdt  BBgt )
 0.5
(3)
To take the large detection time-consuming into
consideration, we evaluation these three feature on 102
pedestrians. These 102 pedestrians are in different scale. And
also the viewpoints are not all the same. Experiment result is
shown in Table.1.
HOG
HOG-LBP
HOG-LBP-HOM
recall
60.1% 72.3%
76.5%
precision 58.5% 65.5%
72.3%
Table. 1. Comparison of three features by the measurement
of recall and precision
As shown is table1, our multiple-feature classifier
outperform both HOG and HOG-LBP by 27%, 5% in recall and
23.5%, 10.4% in precision. Some of our test images are shown
in figure 7. More test image results using multiple-feature
classifier are shown in figure 8.
feature, but also can provide us a strong cue for where
should we put the scanning window. However, due to the
vibration and mobility of quad copter, image stabilization must
be performed first to get motion information. Then we enlarge
the detection area around each motion point and perform
uniform sampling to cut down the detection cost.
4. EVALUATION
Currently popular pedestrian detection is not suitable for our
test due to both the different view point which will definitely
harm the performance and the lack of motion information for
training and test. Our training procedure and test are performed
on our own dataset.
In the evaluation, we try to test three kinds of feature:
HOG alone, combination of HOG and LBP and our HOG-LBPHOM feature. For the evaluation purpose, we adopt two
methods to compare these three features, i.e. recall and
precision. Denote
BBdt is the detected bounding box, BBgt is
(a)
(b)
(c)
(d)
(e)
(f)
the ground truth bounding box.
recall 
precision 
successful detected BB BBgt
(1)
successful detected BB
all detected BB
(2)
We call the pedestrian is detected when the below
condition is satisfied:
Fig. 7. Test HOG, HOG-LBP and HOG-LBP-HOM classifier.
(a), (b) are the result of HOG classifier. (c), (d) are the result of
HOG-LBP classifier. (e), (f) are the result of our HOG-LBPHOM classifier.
4
5. CONCLUSIONS
In this paper, we propose a three-feature classifier for
pedestrian detection. With adding well addressed Histogram of
Oriented Motion, our multi-feature detector performs better
than HOG alone and HOG-LBP detector. Also for the use of
airborne pedestrian detection, we build our own data set for
both training and test purposes. Motion information is also
added to the data set so that motion feature can be used.
[8]
[9]
[10]
[11]
One drawback of our work is that extracting three features
while use SVM for classification can be time-consuming.
“Feature mining” has been proposed by Dolla´r et al. [21] to
efficiently utilize very large feature spaces using various
strategies. In the feature, we will try to transfer their work
and also parallel computing technology to ours to reach a
higher computing speed.
[12]
6. ACKNOWLEDGMENTS
The authors would like to acknowledge the great support of the
Intelligent Systems Center.
[14]
7. REFERENCES
[1] N. Dalal and B. Triggs, “Histograms of Oriented
Gradients for Human Detection,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2005.
[2] C. Wojek and B. Schiele, 2001, “A Performance
Evaluation of Single and Multi-Feature People
Detection,” Proc. DAGM Symp. Pattern Recognition,
2008.
[3] X. Wang, T.X. Han, and S. Yan, “An HOG-LBP
Human Detector with Partial Occlusion Handling,”
Proc. IEEE Int’l Conf. Computer Vision, 2009.
[4] S. Walk, N. Majer, K. Schindler, and B. Schiele, “New
Features and Insights for Pedestrian Detection,” Proc.
IEEE Conf. Computer Vision and Pattern Recognition,
2010.
[5] C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi-Feature People Detection,” Proc.
DAGM Symp. Pattern Recognition, 2008.
[6] P. Dolla´ r, Z. Tu, P. Perona, and S. Belongie, “Integral
Channel Features,” Proc. British Machine Vision
Conf., 2009.
[7] P. Felzenszwalb, D. McAllester, and D. Ramanan, “A
Discriminatively Trained, Multiscale, Deformable Part
Model,” Proc. IEEE Conf. Computer Vision and
Pattern Recognition, 2008.
[13]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
5
C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi-Feature People Detection,” Proc.
DAGM Symp. Pattern Recognition, 2008.
A. Ess, B. Leibe, and L. Van Gool, “Depth and
Appearance for Mobile Scene Analysis,” Proc. IEEE
Int’l Conf. Computer Vision, 2007.
C. Wojek, S. Walk, and B. Schiele, “Multi-Cue
Onboard Pedestrian Detection,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2009.
M. Enzweiler and D.M. Gavrila, “Monocular
Pedestrian Detection: Survey and Experiments,” IEEE
Trans. Pattern Analysis and Machine Intelligence, vol.
31, no. 12, pp. 2179- 2195, Dec. 2009.
C. Wojek and B. Schiele, “A Performance Evaluation
of Single and Multi-Feature People Detection,” Proc.
DAGM Symp, Pattern Recognition, 2008.
P. Dolla´r, C. Wojek, B. Schiele, and P. Perona,
“Pedestrian Detection: A Benchmark,” Proc. IEEE
Conf. Computer Vision and Pattern Recognition, 2009.
C. Papageorgiou and T. Poggio, “A Trainable System
for Object Detection,” Int’l J. Computer Vision, vol.
38, no. 1, pp. 15-33, 2000.
S. Walk, K. Schindler, and B. Schiele, “Disparity
Statistics for Pedestrian Detection: Combining
Appearance, Motion and Stereo,” Proc. European
Conf. Computer Vision, 2010.
Piotr Dolla´ r, Christian Wojek, Bernt Schiele, and
Pietro Perona, “Pedestrian Detection: An Evaluation of
the State of the Art”, IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 31, no. 12, pp. 2179- 2195,
Dec. 2009.
P.A. Viola, M.J. Jones, and D. Snow, “Detecting
Pedestrians Using Patterns of Motion and
Appearance,” Int’l J. Computer Vision, vol. 63, no. 2,
pp. 153-161, 2005.
N. Dalal, B. Triggs, and C. Schmid, “Human Detection
Using Oriented Histograms of Flow and Appearance,”
Proc. European Conf. Computer Vision, 2006.
Rodrigo Benenson, Markus Mathias, Radu Timofte and
Luc Van Gool, “Pedestrian detection at 100 frames per
second”, cvpr, 2012.
T. Ojala, M. Pietikinen, and D. Harwood. A
comparative study of texture measures with
classification based on feature distributions. Pattern
Recognition, 29(1):51–59, 1998.
P. Dolla´r, Z. Tu, H. Tao, and S. Belongie, “Feature
Mining for Image Classification,” Proc. IEEE Conf.
Computer Vision and Pattern Recognition, 2007.
Fig. 8 HOG-LBP-HOM classifier test results.
6
Download