Speransky_course project - mcgill-android

Car Detection Using a Bayesian Network
Konstantin Speransky
McGill University
We can consider detection of cars in still images as a
local classification problem. However, placing local
detections in the context of the overall 3D scene may
potentially improve the results, since all places and scales
of objects will not be equiprobable anymore.
We elaborate on two different types of local detectors:
based on PCA reconstruction and on aggregation of weak
classifiers using the boosting. Then, we present
integration of their outputs with contextual information
using a Bayesian framework.
1. Introduction
There are two main approaches for object detection in
images: patch-based and part-based. The former method
works better for rigid objects with well-defined shapes like
cars. It is also particularly convenient because we can
reduce the detection problem to the binary classification
problem: there is an object in a patch or there is not.
Boosting techniques are traditionally used in high
performance object detectors [2]. They have excellent
characteristics except computational complexity. Recently,
a promising approach based on PCA reconstruction of
images was proposed [3], so it would be interesting to
compare these techniques.
On the other hand, context plays a crucial role in scene
understanding. A car is recognized as a car not only
because of its visual appearance, but also because it is on
the road and has the proper size in comparison to other
objects. One of the first attempts to consider detection
results simultaneously with the context was implemented
in [2]. Heitz and Koller [5] linked detection of objects to
the unsupervised classification of nearby areas using a
Bayesian network.
In this work, we are following Hoiem et al. [1] who also
used a Bayesian network that, essentially, is a directed
graph, convenient to represent conditional dependencies
between variables, as a tool to link multiple sources of
information together. However, rather than relating things
in the image plane [5], the authors of [1] try to reconstruct
the 3D scene by integrating three crucial elements
together: high-performance local detector, estimation of
camera position/orientation and scene geometry.
As a dataset, I used street pictures with parked cars
collected in Montreal. Some of them were taken from a
car and display negative effects, like blurring. Others were
taken statically by a pedestrian, with a height and rotation,
similar to the pictures from the car.
The rest of the paper is organized as follows. Section 2
presents two local detectors based on PCA reconstruction
(Section 2.1) and GentleBoost algorithm (Section 2.2).
Some practical aspects, related to the local detection can
be found in Section 2.3. In section 3 we consider the
useful contextual information (Section 3.1) and how it can
be integrated with local detections using a Bayesian
network (Section 3.2). In Section 4 we conclude the paper
and propose some possible improvements.
2. Local detection
Local detector is an algorithm that analyzes presence or
absence of an object in a small patch of the image. We
implemented two up-to-date methods of local detection
and compared their performance. The first is an adaptation
of the work by Malagon-Borja and Fuentes [3] for the
problem of car detection. It is based on the projection and
subsequent reconstruction of a detection window using
different sets of eigenimages. The second approach,
elaborated by Murphy et al. [2], constructs a strong
classifier from a large set of week classifiers using a
modification of a boosting algorithm.
2.1. Detector based on the PCA reconstruction
A problem of detection can be considered as a
classification problem: we try to label a patch either as an
object or non-object. In [3], PCA is used to perform the
classification and the underlying idea is that PCA can
efficiently compress only the type of images it was
performed on. If we do PCA separately on the datasets
with objects and non-objects, we may expect that a newly
arrived patch with an object will be reconstructed better
with the first set of eigenimages than with the second one.
Because cars can appear in different colors, the
grayscale version of a patch along with its gradient
representation, that is color invariant, are used for
classification. The gradient representation is computed
with Sobel filters along two directions and serves as
additional information for the classifier. As a result, there
are four different sets of eigenimages obtained from
grayscale and gradient car pictures, grayscale and gradient
background pictures.
To reconstruct an image 𝐼 using a set of eigenimages 𝐴
and mean image 𝑋 we may follow the projection equation
𝑃 = 𝐴𝑇 (𝐼 − 𝑋)
and reconstruction equation
πΌπ‘Ÿ = 𝐴 βˆ™ 𝑃 + 𝑋
𝑑2 𝑑4
𝑑1 𝑑3
If 𝐷 > 2 than we assume that there is a car in the
window, otherwise there is a background. It turns out that
both grayscale and gradient representations are necessary
for detection. During the simulation, it was often the case
when one of the summand in (3) provided the wrong
value, but the final classification was, nevertheless, right.
The question arises of what similarity metric 𝑑 between
original and reconstructed images to use. Originally [3], it
was a simple difference between images:
If 𝐼 and πΌπ‘Ÿ don’t differ significantly, than 𝐼 corresponds to
a data set that 𝐴 and 𝑋 were acquired from.
𝑑 = |𝐼 − πΌπ‘Ÿ |
As an additional type of metrics we tried a mutual
information criterion. In this case, the classifier (3)
remains the same, but the similarity measure (4) changes
for the MI equation:
𝑑(𝐼, πΌπ‘Ÿ ) = ∑ ∑ 𝑝(π‘₯, 𝑦) log
π‘₯∈𝐼 𝑦∈πΌπ‘Ÿ
𝑝(π‘₯, 𝑦)
𝑝1 (π‘₯)𝑝2 (𝑦)
where 𝑝(π‘₯, 𝑦), 𝑝1 (π‘₯), 𝑝2 (𝑦) are joint and marginal
intensity histograms respectively.
Figure 1: First row – original patch for detection and
its gradient representation, second raw – projection
and reconstruction with car eigenimages, third raw –
projection and reconstruction with background
eigenimages. 20 eigenimages were used.
In Figure 1 there is an example of an original patch to
be classified (top-left) and its gradient representation (topright). It is clearly visible that projections and
reconstructions using the car eigenimages (second raw)
are much closer to the originals than projections and
reconstructions with background eigenimages.
Since, for every patch there are four meaningful
distances: between a grayscale image and its
reconstructions using car eigenimages (𝑑1 ) and
background eigenimages (𝑑2 ), a corresponding gradient
image and its reconstructions using car gradient
eigenimages (𝑑3 ) and background gradient eigenimages
(𝑑4 ) we can define a classifier in the form (4) where 𝑑 is
some distance measure between images.
Figure 2: ROC curves for the detector based on the
PCA reconstruction with different criteria
To train the detector we used 50 positive, 250 negative
examples and 40 eigenimages for all datasets. To evaluate
the detector, 3 times more examples were used. As we can
see from the Figure 2, sample difference performs better
than mutual information for small false alarm rate.
Moreover, mutual information works much worse for a
small number of eigenvalues and simple difference can be
computed more than 2 times faster.
2.2. Detector using boosting
Completely different approach for object detection can
be found in [2] where Murphy et al. used boosting to
combine a number of weak unified classifiers, which are
only slightly correlated with the true classification, into a
strong classifier.
There are numerous variations of boosting techniques;
we employ, as Murphy et al. [2], GentleBoost version that
can be described as follows [4]:
Initialization: Weights: 𝑀𝑑 = 1/𝑁, 𝐹(π‘₯) = 0.
For each round of boosting π‘š = 1 … 𝑀:
2.1. Select optimal weak classifier
π‘“π‘˜ (π‘₯) and
regression stump parameters π‘Ž, 𝑏, πœƒ that solve a
weighted least-squared problem (9):
min ∑ 𝑀𝑑 βˆ™ [𝑦𝑑 − (π‘Ž βˆ™ [π‘“π‘˜ (π‘₯𝑑 ) > πœƒ] + 𝑏)]2 (9)
2.2. Evaluate optimal weak classifier in a form of
regression stump π‘“π‘˜π‘Ÿπ‘  (π‘₯) from the previous step:
Figure 3 (from [2]): First row – some examples from
filters dictionary, second row – some examples from
spatial templates dictionary.
Each weak classifier computes its feature value for a
detection window. At first a patch is convolved with one
of 13 filters [2], including delta function, Gaussian
derivatives, Laplacian, corner detectors and edge detectors
(examples are in the first raw of Figure 3). Then, one of 30
spatial templates applies to the filtered patch (examples
are in the second raw of Figure 3). Eventually, we
calculate a histogram of the cropped, filtered patch and
find its variance and kurtosis. Overall, there are π‘˜ = 13 ×
30 × 2 = 780 weak classifiers of the form:
π‘“π‘˜ (π‘₯) = ∑ π‘€π‘˜ (π‘₯)(|𝐼(π‘₯) ∗ π‘”π‘˜ (π‘₯)|π›Ύπ‘˜ )
π‘“π‘šπ‘Ÿπ‘  (π‘₯) = π‘Ž βˆ™ [π‘“π‘˜ (π‘₯𝑑 ) > πœƒ] + 𝑏
2.3. Update the strong classifier:
𝐹(π‘₯) ← 𝐹(π‘₯) + π‘“π‘šπ‘Ÿπ‘  (π‘₯)
𝐹(π‘₯) = 𝑓1 (π‘₯) + 𝑓2 (π‘₯) + 𝑓3 (π‘₯) + β‹―
2.4. Update weights:
𝑀𝑑 ← 𝑀𝑑 βˆ™ exp[−𝑦𝑑 π‘“π‘šπ‘Ÿπ‘  (π‘₯𝑑 )]
Output strong classifier:
𝑠𝑖𝑔𝑛[𝐹(π‘₯)] = 𝑠𝑖𝑔𝑛 [∑
where π‘”π‘˜ (π‘₯) - filter, π‘€π‘˜ (π‘₯) - spatial template, | | histogram, π›Ύπ‘˜ = 2, 4 – variance or kurtosis.
The general idea of the boosting is to create a strong
classifier from weak ones using an additive model (7) by
minimizing the exponential loss (8). For each round of
boosting, examples from the training set that were
correctly classified by many of the previous weak
classifiers get lower weights and often misclassified
examples get higher weight.
π‘“π‘šπ‘Ÿπ‘  (π‘₯)]
The number of weak classifiers 𝑀 that are incorporated
into the final strong classifier is an important parameter.
𝐽(𝐹) = ∑ 𝑒 −𝑦𝑑𝐹(π‘₯𝑑)
where π‘“π‘˜ (π‘₯)-weak classifiers, 𝐹(π‘₯) - a strong classifier,
π‘₯𝑑 -training examples, 𝑦𝑑 ∈ [−1; 1] – ground truth for
training examples, 𝑁 – number of training examples.
From (8) it is obvious that if the predicted classification
𝐹(π‘₯𝑑 ) has the same sign as the corresponding ground truth
𝑦𝑑 than the loss function 𝐽(𝐹) becomes smaller.
Figure 4: ROC curves for GentleBoost detector.
In the Figure 4 we may see how the ROC curve is
improving once we increase the number of rounds 𝑀. The
training and test datasets are the same as in Section 2.1.
One of the major advantages of boosting algorithms is that
they are almost unsusceptible to overfitting. It is not a
fully explained phenomenon in the literature so far, but the
ROC curves even for more than 600 rounds didn’t show
any signs of overfitting during our simulations.
From the comparison of Figure 2 and Figure 4 we can
see that the detector based on the PCA reconstruction (40
eigenimages) works slightly better for small false alarm
rate (less than 1%) than the GentleBoost detector (200
rounds).However, for higher false detection rates, the
GentleBoost detector (200 rounds) performs better. It is an
open question, whether the GentleBoost detector with very
large number of weak classifiers can outperform PCA
detector even for small false alarm rates.
PCA(Simple difference)
Time, s
Figure 5: Time to classify 750 windows using different
In Section 3 I used the PCA-based detector for two
a) it is apparent from the Figure 5 that GentleBoost in
its current implementation is too slow;
b) PCA detector shows better performance on the low
false detection rates.
2.3. Practical implementation
After application of any detector to an image, we have
20-200 bounding boxes with different sizes and
confidences, mostly around cars. So, the pruning is
necessary to eliminate multiple detections around same
objects and the following technique was adapted. Let’s
consider that we have two intersecting bounding boxes
and box2 has higher probability that there is a car inside
than box1. We keep the box1 if the ratio of the
conjunction area of two boxes to their disjunction area is
less than maximum overlap (we used 0.4 as a tradeoff
between separate detection of neighboring cars and
cluttering of image with detection boxes) and discard it
The selected size of detection window is [150 × 80]
pixels and we worked with images of size [640 × 480].
The following set of scale parameters for the original
image was selected in order to detect cars of various sizes:
Figure 6: On the left there are 68 bounding boxes
before we apply pruning. On the right there are 6
boxes that remain after pruning.
π‘†π‘π‘Žπ‘™π‘’ = [0.4, 0.5, 0.7, 0.8, 0.9, 1, 1.1, 1.15,1.25]
By increasing the number of possible scales in (14) we
can improve the detection, by decreasing we can reduce
computational complexity.
We can dramatically diminish computational time by
shifting the detection window not by one, but by several
pixels at time. Empirically, we select the shift of 5 pixels
for the first three scales and 10 for the remaining factors.
3. Bayesian framework
We can further improve the detection if we integrate the
output of a local detector with the contextual information
and some of the possible contextual cues are presented in
Section 3.1. In Section 3.2 we present the Bayesian
network to integrate all the information. Section 3.3 shows
how the framework actually works for one particular
3.1. Contextual cues
There are two valuable geometric cues that give us a
tight prior likelihood for the location and scale of a car.
The first one has to do with the perspective projection. As
it is shown in [1] (formulas (1)-(7)) we can approximate
the relationship between the height of the object in the real
world 𝑦𝑖 , its height β„Žπ‘– and bottom position 𝑣𝑖 in the image,
horizon line in the image 𝑣0 ∈ [0 … 1] and the real-world
height of the camera 𝑦𝑐 in the following way:
𝑦𝑖 ≈ 𝑦𝑐
𝑣0 − 𝑣𝑖
It is assumed during the derivation of (15) that the
object is on the ground and camera tilt is small. From (15)
we can see that the probability of a car π‘œπ‘– given camera
parameters (πœƒ = [𝑣0 , 𝑦𝑐 ]) is proportional to a probability
of a car’s height in the image given πœƒ and 𝑣𝑖 :
𝑃(π‘œπ‘– |πœƒ)~𝑃(β„Žπ‘– |𝑣𝑖 , πœƒ)
𝑃(πœƒ) = 𝑃(𝑣0 )𝑃(𝑦𝑐 )
The second cue is that a detection window with a car
may be considered as containing mostly vertical surfaces
with a ground below. Possible surface geometries 𝑔𝑖 for
the detection window can be of three types: ground,
vertical and vertical with ground below.
Originally Hoiem et al. [6] used a rather complicated
general approach to differentiate ground and vertical
surfaces. We propose a much simpler way to find a ground
surface that is analogous to a magic wand in Photoshop
and it turned out that it works sufficiently well for this
particular constrained problem.
The algorithm for ground extraction is the following –
we automatically select points that are on the road most of
the time (in the left bottom corner). Then we compute the
Euclidian distances between RGB values of these
reference pixels and all other image pixels. We select only
pixels that are close (in terms of RGB distance) and find a
connected region that incorporates these pixels as well as
reference pixels. An example is in the Figure 8.b.
3.2. Bayesian network
It is possible to estimate the viewpoint, object locations
and scales and surface geometry of the scene separately.
However, the estimation can be more accurate and robust
if we take advantage of the interactions between all these
elements. We assume [1] that objects and geometric
surfaces produce image evidence 𝑒𝑔 and π‘’π‘œ respectively.
The viewpoint πœƒ = [𝑣0 , 𝑦𝑐 ] directly influences the position
and size of the objects π‘œπ‘– and objects directly influence
underlying geometric surfaces 𝑔𝑖 . It is possible to
represent these dependencies as a Bayesian network
(Figure 7). In this model objects are independent given the
viewpoint and geometric surfaces are independent given
their corresponding objects.
Decomposition of the joint distribution has the
following form:
𝑃(πœƒ, π‘œ, 𝑔, 𝑒𝑔 , π‘’π‘œ )
= 𝑃(πœƒ) ∏ 𝑃(π‘œπ‘– |πœƒ) 𝑃(π‘’π‘œπ‘– |π‘œπ‘– )𝑃(𝑔𝑖 |π‘œπ‘– )𝑃(𝑒𝑔𝑖 |𝑔𝑖 ) (17)
Using the Bayesian rule we can express the likelihood
of the scene given evidence from the image in the
following way:
𝑃(πœƒ, π‘œ, 𝑔, |𝑒𝑔 , π‘’π‘œ )
𝑃(𝑔𝑖 |𝑒𝑔 )
𝑃(π‘œπ‘– |π‘’π‘œπ‘– )
= 𝑃(πœƒ) ∏ 𝑃(π‘œπ‘– |πœƒ)
𝑃(𝑔𝑖 |π‘œπ‘– )
𝑃(π‘œπ‘– )
𝑃(𝑔𝑖 )
Equation (18) summarizes all the dependencies between
detections and contextual cues. A priori, camera height 𝑦𝑐
and horizon position 𝑣0 are considered as independent:
𝑃(𝑣0 ) and 𝑃(𝑦𝑐 ) are modeled as Gaussian variables with
mean values 0.5 and 1.2 correspondingly, so 𝑃(πœƒ) is also
Gaussian (see Figure 8.a).
Figure 7: Bayesian network
𝑃(π‘œπ‘– |π‘’π‘œπ‘– ) is an output from the local detector for each
position and scale in the image. If the confidence in the
particular local detection after pruning is higher than a
certain level, this detection becomes an object π‘œπ‘– in the
Bayesian framework. Essentially, the number of objects π‘œπ‘–
should be larger than the number of cars in the image and
after we make an inference on the Bayesian net, the
confidence in true detections should decrease, while
confidence in wrong detections should decrease. In order
to calculate the probability 𝑃(π‘œπ‘– |π‘’π‘œπ‘– ) we fitted a logistic
regression to the binary classifiers in Sections 2.1 and 2.2.
However, since all detections after pruning have very high
confidence (> 90%), we can in practice assign the same
high probability to all possible detection windows
(objects π‘œπ‘– ). 𝑃(π‘œπ‘– |πœƒ) is proportional to the probability of a
car’s image height (16) and has the Gaussian distribution
(15) with parameters:
πœ‡π‘– (π‘£π‘œ − 𝑣𝑖 )
πœŽπ‘– (π‘£π‘œ − 𝑣𝑖 )
where πœ‡π‘– = 1.4 and πœŽπ‘– = 0.3 are typical for cars.
𝑃(𝑔𝑖 |𝑒𝑔 ) is calculated using a method described in
Section 3.1, while 𝑃(𝑔𝑖 |π‘œπ‘– ) and 𝑃(𝑔𝑖 ) for three possible
values of 𝑔𝑖 are estimated either from the training dataset
or empirically.
Once we know conditional probabilities and priors in
the model (Figure 7) we can pose a question: What are the
posteriori distributions of all variables? There are a
number of exact and approximate algorithms to make
inference in Bayesian networks. We used Pearl’s belief
propagation that is realized in the Bayesian Networks
Toolbox [7] and is one of the most efficient algorithms to
make inference in the tree-structured graphs.
Figure 8: a) priori distribution for camera parameters πœƒ; b) bounding boxes from local detector and ground
extraction; c), d) distribution for πœƒ and confidence in bounding boxes after inference without geometric cues,
e), f) distribution for πœƒ and confidence in bounding boxes after inference with geometric cues
3.3. Example of the improved detection
We can gain an insight into how the algorithm works by
observing the results in the Figure 8. In the Figure 8.b we
may see the output of the local detector after pruning, as
well as the extraction of the ground plane. Corresponding
Gaussian a priori distribution for camera parameters is in
the Figure 8.a. If we combine camera parameters πœƒ with
local detections π‘œπ‘– (assume that 𝑔𝑖 are omitted in the
model in the Figure 7) then we see that confidences for
bounding boxes have changed (Figure 8.d), but not in the
truly desired way. It happened because there are two
competing theories for camera parameter estimation
(Figure 8.c) – one of them with horizon estimation 0.7 is
right, another, with lowered estimation of the horizon is
wrong but supported by two small detection boxes on the
right. However, once we applied our entire model (Figure
7) we have good estimation of confidences for bounding
boxes (Figure 8.f) as well as a suppressed wrong peak in
the distribution of camera parameters (Figure 8.e).
4. Conclusion
In the course projects two very efficient types of
algorithms for local detection in still images were realized
and compared. The first one, based on PCA reconstruction
demonstrated better performance for low false alarm rates
than the second detector that uses GentleBoost. Also, the
former is much more computationally efficient than the
latter, although the better implementation may show
different results.
Results from the local detector were then integrated
with the contextual cues, namely underlying geometric
surfaces and camera parameters, using a Bayesian
network. As it was seen in Section 3.3 this technique can
further improve the results, eliminating unlikely, from the
overall scene point of view, detections.
There are a number of ways to improve the algorithm
though. The major problem is computational efficiency.
Currently an entire picture can be evaluated in around 4
minutes using PCA local detector. It is still too demanding
for real applications, so further amendment, like
specifying a detection regions and larger shifts between
detection windows should be employed. Also, it is
possible to achieve some improvement by training using
larger datasets and by considering detections as still
dependent on each other given the viewpoint.
[1] D.Hoiem, A.Efros, M.Hebert, Putting objects in
perspective, IJCV, 80:3-15, 2008
[2] K.Murphy, A.Torralba and W.Freeman, Using the Forest to
See the Trees: A Graphical Model Relating Features,
Objects, and Scenes, NIPS, 16, 2003
[3] L.Malagon-Borja, O.Fuentes, Object detection using image
reconstruction with PCA, IVC, 27: 2–9, 2009 (Online 2007)
[4] J.Friedman, T.Hastie, R.Tibshirani, Additive logistic
regression: a statistical view of boosting, The Annals of
Statistics, Vol.28 No.2:337-407, 2000
[5] G.Heitz, D.Koller, Learning spatial context: using stuff to
find things, Proceedings of the 10th ECCV (2008) 30-43
[6] D.Hoiem, A.Efros, M.Hebert, Geometric Context from a
Single Image, ICCV 2005
[7] Bayesian Networks Toolbox for Matlab by K.Murphy:
[8] GentleBoost library for Matlab by A.Torralba