A Machine Learning Approach to Object Recognition in the Context

advertisement
A Machine Learning Approach to Object Recognition in the Context of
Visual Road Scene Analysis from a Moving Vehicle
Jivko Sinapov
Dept. of Computer Science
Iowa State University
I. Introduction
This project address an object recognition task related to visual road scene
analysis in the context of a moving vehicle. The goal is to implement and evaluate a
robust object recognition technique for detection of various objects of interest. Object
recognition tasks such as detecting other cars and traffic signs are very important when
designing driving assistive systems or autonomous driving agents. In this project we
implement and evaluate an object detection scheme utilizing a cascade of Haar feature
classifiers, as well as a boosting technique utilizing SVM.
II. Background and Motivation
The success of several of the challengers in last year’s DARPA Grand Challenge
shows that computer vision can be used effectively in solving many of the problems
associated with autonomous driving. Many problems remain unsolved, however. For
example, the computers that processed the data in the autonomous vehicles participating
in the challenge are far superior to the average PC and it is unlikely that people at large
would be able to outfit theirs cars with such systems. In the coming years we are likely to
see various driving assistive technologies appear on the market and there is currently a
large overlap between the set of problems associated with autonomous driving and that of
problems associated in the area of driving assistive technology. A large fraction of
accidents occurs because the driver is not paying attention to the road and cars in front of
their vehicle. For example, a driver not paying attention can easily veer off course and
enter an undesired lane, or fail to stop at a traffic light. As such, real time traffic light, and
vehicle detection are very appropriate problems to tackle since any driving assistive or
autonomous driving system would have to be able to perform those tasks.
The primary goal for this project is to provide appropriate solutions for these
problems which can work in real time on a regular PC. In particular, we’ll take a look at
the problems of detecting traffic lights and other cars in the field of view. It is
conceivable that in the near future cars would come equipped with systems which
monitor the road, as well as the driver in order to determine if he or she is not paying
attention to the road. In such cases, the system must be able to detect situations which
demand the driver’s immediate attention – for example, if the car is approaching a red
light at high speed or if the car in front is suddenly slowing down. In order for such a
system to work, it will need to be able to accurately detect the objects of interest and a
machine learning approach is likely to provide such a solution.
III. Object Recognition Using a Haar Cascade Classifier
In the task of object recognition, we implement an approach which classifies
objects based on an extended set of Haar features. This approached was originally
proposed by Viola and Jones [1] and extended by Lienhart [2].
The detection scheme uses the values of Haar-like features in an image in order to
classify an object as a positive or negative instance. A subset of simple features used in
this model is shown in Figure 1.
Figure 1: Some simple examples of Haar-based features
Each of these features consists of a geometric representation of two regions –
black and white. The value of each feature at a given position in the image is the
difference between the sums of the pixels within the two regions. Haar features can take
arbitrarily complex shapes and the size of the full set available in this model is in the
order of tens of thousands. In order to compute the value of each feature at a given
location of the image, the image is represented in an integral form: the value at position
(x, y) in the integral image will contain the sum of pixels that are above y and to the left
of x. The general formula for the integral image representation is the following:
ii ( x, y) 
 I ( x' , y ' )
x 'x , y ' y
The integral image representation is chosen for several reasons. First, it allows for
efficient computation of a given Haar feature at a given position of the test image. In
addition, it allows for robust object detection regardless of global lightning conditions,
since the Haar features take into account only the differences of sums of pixels, which are
invariant in terms of the global intensity of the image. Last but not least, the integral
image representation allows for the object detection algorithm to scan for objects at
different scales very efficiently since scaling the integral image can be done much faster
than scaling the RGB image [1]. This is a very desirable property since real-time usability
is a major goal for this object recognition system.
The classifier is built in stages – at each stage, an AdaBoost-like approach is
applied to selecting one or more Haar-features, as well as determining appropriate
thresholds which can be applied to reject a large number of negative training instances.
Important input parameters for the training procedure are the minimum hit ratio and
maximum false alarm rate – the search for optimal feature and threshold selection will
continue until those two requirements are met, at which point the remaining training
examples will be passed on to the next stage. For example, if those parameters are set to
0.995 and 0.5 respectively, at each stage, feature selection and threshold optimization will
be applied until the resulting stage is
capable of classifying 99.5% of the
positive instances as positive and does not
classify more than 50% of the negative
images as positive. For more precise
details regarding feature selection and
training, consult Viola and Jones [1].
In the extended model of the
classifier implemented in the OpenCV
C++ library, each stage of the classifier
can make use of more than one feature in
order to meet the requirements set by the
input parameters, in which case each stage Figure 2: Schematic description of the classifier. At
each stage, the classifier either rejects the instance
can be viewed as a decision tree, rather (represented as a sub-window from a given test image)
than a decision stump [3]. It is also based on a given feature value or sends the instance
important to note that at each stage, the further down the tree for more processing. At the
classifier uses a different set of negative initial stages a large number of negative examples are
training images which are sampled from a eliminated [1].
given database of images that do not
contain the specified object. After training the desired number of stages, the result is a
cascade of tree-like classifiers, as show in Figure 2.
The structure of the resulting classifier is essentially that of a degenerate decision
tree or a decision list. Each added stage to the classifier tends to reduce the false positive
rate, but also reduces the detection rate [3]. As such, it is essential to train the classifier
with the appropriate number of stages for the given task.
Once a classifier is trained, detection is done by sliding a window across an input
image and passing the cropped sub-image through the classifier. In order for
classification to be size-invariant, the same procedure is also performed on the input
integral image at different scales. Given this scheme, the output of classification is a
series of sub-windows of the test image which contain the desired object. In the following
two sections we outline how this model was applied to the problems of traffic light and
car detection.
IV. Traffic Light Detection
The problem of traffic light detection is important in the area of driving assistive
technology. A system which is tasked with preventing accidents when a driver is not
paying attention must always know whether there is a traffic light in the scene and what
its state is.
To solve this task, a Haar cascade
classifier for traffic lights was trained. The
dataset used in these experiments consists of
real time video taken from a camcorder
positioned inside a passenger car. The
camera resolution is low (320 by 240) and so
is the image quality, thus adding another
challenge to this problem.
Figure 3: Positive examples of traffic lights used
for training. Negative samples are randomly
selected from an image collection that does
include traffic lights.
The classifier was trained with 5 stages on 120 positive examples, and 120 negative
examples. The minimum hit ratio at each stage is set to 0.95 and the maximum false
alarm rate is set to 30%.
To improve results and decrease computation time, the area of the image being
searched through is restricted to the portion where a traffic light could actually occur –
there is no point at looking for that object on the actual road, for example. Once a traffic
light is detected in the input stream, the image is analyzed to determine its state. Ideally,
we would want to identify the area of the traffic light which contains the actual color
signal. In our case, however, the resolution was low enough such that the number of
pixels that actually correspond to the light in the traffic light is usually about 5 or 6 which
makes it quite difficult to analyze. Nevertheless, a simple scheme for determining the
color of the light is implemented which works the following way:
Input: cropped image of detected traffic light
Steps: 1. G = Sum the green components of the cropped image
2. R = Sum of the red components
3. If (G/R) < t1, then output RED
4. If (G/R) > t2, then output GREEN
5. Else, output YELLOW
The thresholds t1 and t2 were automatically optimized based on a training set of
traffic light images. In practice this scheme worked well in determining the color of a
given light, although if we had better resolution, it is conceivable that a much better and
more robust algorithm could be devised.
The classifier was evaluated on about 20 minutes of continuous input stream
recorded while driving in Ames, IA. The detection scheme works quite comfortably in
real time due to the small size of the trained classifier and small area of the image that is
being processed. In all the occasions on which a traffic light was passed, the classifier is
able to detect it and almost always outputs the correct color, so long it is green or red.
The low-quality of the video input makes it difficult to recognize yellow since the pixels
of the light signal actually assume white color in such cases. The good detection
performance is likely due to the fact the traffic light shape is very distinct and there are
almost no other objects present in the portion of the image that is being searched. One
obvious drawback is that only traffic lights of this particular shape can be detected –
while most traffic lights in Ames follow this standard, the same might not be true for
other cities. Figure 4 shows some example results. At the end of this paper there is a
discussion about some available online demos of this system that demonstrate how it
works in practice.
Fig.4. Example results from running the traffic light detection procedure
V. Car Detection
In this particular problem, we are interested in detecting vehicles in front of the
observer. A series of Haar cascade classifiers are trained and evaluated on two different
datasets.
The first dataset, as in the previous problem, consists of low-quality video taken
while driving in Ames and the surrounding areas. The low quality, however, makes it
difficult to detect objects further in the distance and as such a second dataset of good
quality images was used in order to evaluate the detection scheme in more detail.
5.1. Car Detection in low-quality and low-resolution video stream
As in the previous section, the experiments are performed on a dataset comprising
of a recorded video from a camera installed in a passenger car while driving in Ames. A
classifier with 10 stages is trained on 200 sample images of cars taken from half the
amount of video available. The training parameters minimum hit ratio and maximum false
alarm rate are set to 0.995 and 0.3 respectively. The resulting classifier is tested on about
20 minutes of video recorded while driving on the freeway.
Once again, since the position of the
road relative to the observer is known in this
context, we are able to restrict the image
area in which a car is hypothesized to be.
Restricting the region of interest allows for
greater speed of computation and for
elimination of false positives which could
not possibly be actual cars due to their
location.
Once the region of interest is
identified, it is scanned by a widnow at
different scales, and any sub-windows which
are marked as positive by the Haar cascade Figure 7: Identifying region of interest, and performing
detection with trained Haar cascade classifier
classifier are deemed to be detected cars.
Restricting the search area helps eliminate almost all false positives. A passing car
was always detected as such, although once it gets far ahead enough, the detection
scheme fails due to the small size of the object and low image quality. Even though large
semi-trucks were not part of the training set, they generally tended to be recognized as
cars by the classifier, if close enough. The demos available online can give an accurate
illustration of how well this detection and classification scheme works. Some sample
screenshots are included in Appendix I. Overall, with a large data set and good quality
video stream, such system could be fairly robust although it will never be absolutely
perfect and hence an autonomous driving agent would need a much smarter framework in
order to detect vehicles on the road. In the next section, we evaluate this object
recognition and detection scheme much more precisely with mid- to good- quality input
data.
5.2. Car Detection in mid- to good-quality images
The dataset used in the following
experiments consist of 526 images taken from
inside the driver seat of a vehicle, each of which
contains at least one car in front of the observer.
The images are not sequential frames from a video
feed. Sample images from this dataset are shown
in Figure 5. The dataset was split into 2/3 training
and 1/3 test sets. Overall, 300 sample images of
cars were extracted which were used for training
each classifier.
Knowing that detection rate can decrease
as the number of stages in a classifier increase, our
task is to determine the optimal number of stages
for this given problem. The training parameters
minimum hit ratio and maximum false alarm rate
are set to 0.995 and 0.3 respectively for all trained
classifiers. Following, classifier with number of
stages ranging from 5 to 10 are trained and
evaluated on the test set.
Figure 5: Samples from a car image
database.
Evaluation is performed by running the detection scheme on the test set and
taking note of the type of results that are outputted at each frame. Each output result falls
within one of three categories: positive, negative, or partial. Positive results are those that
contain a car in a well-defined box. Negative results are such outputs that do not contain
any major distinguishable portion of a car. Partial results contain everything in between –
if the result contains a major portion of the car, or if it contains a car, but also lots of
other stuff, then it is labeled as partial. Figure 6 shows examples of each type of outputs.
(a)
(b)
(c)
Figure 6: Examples of a positive (a), negative (b) and a partial (c) detected object.
Each trained classifier was tasked with detecting cars in the test set and the
resulting outputs were saved and manually labeled as positive, negative or partial. Figure
7 shows the results of each run. As we can see from the chart, the 7-stage classifier
detects the highest number of cars in the test set, while the 10-stage classifier detects the
lowest false alarm rate, as expected.
Performance of Haar-cascade classifiers with varied
number of stages
350
300
# detected
250
200
150
100
50
0
n=5
n=6
n=7
n=8
n=9
n = 10
Num . of Stages of Classifier
positive
partial
negative
Figure 7: Summary of classifiers’ performance.
The results of these experiments illustrate the tradeoff between the hit rate and the
false alarm rate of each classifier. Ideally, we want to detect as many actual appearances
of the target object in the input stream without reporting too many false positives. In
both, driving assistive technology and autonomous driving applications, a false positive
error is not nearly as bad as a complete miss of an actual object of interest.
Following, we explore an approach to boost the classifier in order to minimize the
false alarm rate while maintaining a good hit ratio. One such approach would be to
reinsert samples of false positive outputs into the training set and further train the Haar
cascade classifier. Retraining the classifier, however, is a highly time-consuming process
when compared to other machine learning techniques. A 10-stage Haar cascade classifier,
for example, can take up to one hour train on an average PC, even when faced with only a
small dataset of 300 positive and 300 negative samples. If a real-time system is being told
by the user that some of its findings are false positives, it would not have the luxury of
time to adapt to those results.
Our approach to improving performance in real time utilizes an SVM which is
trained on labeled detected outputs resulting from running the Haar cascade classifier
detection scheme. This technique proves time-efficient and it improves performance. A
good question at this point is why not use SVM from the very beginning? An SVM
approach would likely yield better results than Haar cascade classification. However, we
note that it is difficult to efficiently search through an RGB image at different scales for
potential candidates. If the SVM makes use of global and local features instead of the raw
pixel values, then there would be even extra computational overhead (in addition to
scaling the image) when sliding a window and looking for a match. Real-time usability is
a requirement for any driving assistive technology or autonomous driving system. We
also have to note that object detection and recognition is only a small portion of such a
system and as such, we need an efficient algorithm which saves computational resources
for other tasks such as object tracking and decision making.
We perform experiments to initially validate whether an SVM can be used to
distinguish between positive and negative results of the Haar cascade classifier and
determine what image representation is best to use. The set of 232 positive and 194
negative output samples from the 5-stage classifier is used as a dataset in this experiment.
Each sample is scaled to size 15 by 15, converted to gray-scale image, and undergoes
histogram equalization. The equalized gray-scale image is used as the raw input, which
each attribute corresponding to a particular pixel with value of 0 to 255 scaled to a real
value between 0 and 1. We also perform
an experiment to see whether it is better
to use the edges in the image as a
representation of the instances, rather
than the gray-scale image itself. Overall,
there are 225 attributes per instance (1
for each pixel), regardless of which
representation we use. Figure 8
(a)
(b)
(c)
illustrates the way instances are
Figure 5: Samples’ preprocessing: (a) gray scale, (b)
preprocessed for input into the SVM
histogram equalization, (c) Canny edge detection.
algorithm.
The experiment suggests that using the equalized gray-scale representation yields
better classification results. Using 5-fold cross-validation with a polynomial kernel SVM,
we can achieve 93.8% accuracy which is illustrated in the following confusion matrix:
predicated
positive negative
5
positive
16
178
negative
actual
237
Using the detected edges representation of the samples, on the other hand, yielded
accuracy of only 83%. Experiments were also performed to determine the optimal scale
of the samples and the results show that increasing the image dimensions beyond 15 by
15 does not produce a significant increase in accuracy, but as expected, slows down
training and testing due to the quadratic increase of the number of attributes.
Following this result, we attempt to boost the 7-stage classifier by training and
evaluating an SVM on the dataset comprised of the Haar cascade classifier’s output
results. The dataset contains 512 instances, of which 250 are positive, 221 are partial, and
41 are negative. We conduct a 5-fold cross-validation experiment with a multi-class SVM
with polynomial kernel of 4th degree and the result is the following confusion matrix:
predicted
negative
0
24
3
partial
10
14
192
positive
negative
partial
actual
positive
240
3
26
All positive instances get classified as either positive or partial, while only a small
fraction of negative and partial instances gets classified as positive. No positive instance
is classified as negative, which is a highly desirable property in the applications discussed
previously. The results are promising and show that boosting a Haar cascade classifier
with an SVM can increase performance. The boosted 7-stage Haar cascade classifier is
clearly superior to the 10-stage classifier in terms of quality of results.
VI. Discussion
We have shown that a machine learning approach utilizing a Haar cascade
classifier and an SVM can be an efficient and accurate method for performing object
detection in real time input video stream. While the detection rate achieved is not high
enough for an autonomous driving agent, the proposed scheme could be utilized within a
driving assistive technology system. For demos of the currently developed framework,
visit:
http://www.cs.iastate.edu/~jsinapov/Vision/
The question of whether boosting a Haar cascade classifier with an SVM is more
efficient than using SVM for detection itself still remains to be answered. Intuitively,
searching for an object within the image would be faster if using a Haar cascade classifier
for recognition, but this hypothesis is yet to be validated. Object recognition with SVM
and local features (such as SIFT features, for example) has been shown to have very high
performance, but it is still a question of whether localizing the target object in an input
image can be done efficiently if the features used for recognition are not easy to compute
[4].
Ultimately, the goal is to design a system which can efficiently detect an object in
the input video stream, as well as efficiently update its model of the target to be detected.
SVM is the most likely candidate to achieve this task, as long as we can implement an
efficient search routine through the input image. From this standpoint, we can view the
Haar cascade object detection scheme as a search technique which identifies the areas
most likely to contain the target we are looking for. Once those candidates are localized,
they can be passed on to a stronger classifier which would not only produce better results,
but also be able to adapt its model based on user feedback. An alternative approach
would be to still use AdaBoost for feature selection during training but utilize SVM
directly instead of constructing a cascade of decision trees. This has the potential to
combine the efficiency of Haar cascade classifier detection scheme with the classification
power and robustness of the SVM.
References
[1]
Viola, P., and Jones, M., “Rapid Object Detection using a Boosted Cascade of
Simple Features,” IEEE CVPR, 2001.
[2]
Lienhart, R., and Maydt, J. “An Extended Set of Haar-like Features for Rapid
Object Detection,” Submitted to ICIP2002.
[3]
Bradski, G., Kaehler, A., Pisarevsky, V. "Learning-Based Computer Vision with
Intel's Open Source Computer Vision Library." Intel Technology Journal. 2005.
[4]
Serre, T., Wolf, L., Poggio, T., “Object Recognition with Features Inspired by
Visual Cortex”. Proceedings to IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2005.
Appendix I:
Sample results from performing car detection on real-time video input feed.
Download