Car Detection Using a Bayesian Network Konstantin Speransky McGill University konstantin.speransky@mail.mcgill.ca Abstract We can consider detection of cars in still images as a local classification problem. However, placing local detections in the context of the overall 3D scene may potentially improve the results, since all places and scales of objects will not be equiprobable anymore. We elaborate on two different types of local detectors: based on PCA reconstruction and on aggregation of weak classifiers using the boosting. Then, we present integration of their outputs with contextual information using a Bayesian framework. 1. Introduction There are two main approaches for object detection in images: patch-based and part-based. The former method works better for rigid objects with well-defined shapes like cars. It is also particularly convenient because we can reduce the detection problem to the binary classification problem: there is an object in a patch or there is not. Boosting techniques are traditionally used in high performance object detectors [2]. They have excellent characteristics except computational complexity. Recently, a promising approach based on PCA reconstruction of images was proposed [3], so it would be interesting to compare these techniques. On the other hand, context plays a crucial role in scene understanding. A car is recognized as a car not only because of its visual appearance, but also because it is on the road and has the proper size in comparison to other objects. One of the first attempts to consider detection results simultaneously with the context was implemented in [2]. Heitz and Koller [5] linked detection of objects to the unsupervised classification of nearby areas using a Bayesian network. In this work, we are following Hoiem et al. [1] who also used a Bayesian network that, essentially, is a directed graph, convenient to represent conditional dependencies between variables, as a tool to link multiple sources of information together. However, rather than relating things in the image plane [5], the authors of [1] try to reconstruct the 3D scene by integrating three crucial elements together: high-performance local detector, estimation of camera position/orientation and scene geometry. As a dataset, I used street pictures with parked cars collected in Montreal. Some of them were taken from a car and display negative effects, like blurring. Others were taken statically by a pedestrian, with a height and rotation, similar to the pictures from the car. The rest of the paper is organized as follows. Section 2 presents two local detectors based on PCA reconstruction (Section 2.1) and GentleBoost algorithm (Section 2.2). Some practical aspects, related to the local detection can be found in Section 2.3. In section 3 we consider the useful contextual information (Section 3.1) and how it can be integrated with local detections using a Bayesian network (Section 3.2). In Section 4 we conclude the paper and propose some possible improvements. 2. Local detection Local detector is an algorithm that analyzes presence or absence of an object in a small patch of the image. We implemented two up-to-date methods of local detection and compared their performance. The first is an adaptation of the work by Malagon-Borja and Fuentes [3] for the problem of car detection. It is based on the projection and subsequent reconstruction of a detection window using different sets of eigenimages. The second approach, elaborated by Murphy et al. [2], constructs a strong classifier from a large set of week classifiers using a modification of a boosting algorithm. 2.1. Detector based on the PCA reconstruction A problem of detection can be considered as a classification problem: we try to label a patch either as an object or non-object. In [3], PCA is used to perform the classification and the underlying idea is that PCA can efficiently compress only the type of images it was performed on. If we do PCA separately on the datasets with objects and non-objects, we may expect that a newly arrived patch with an object will be reconstructed better with the first set of eigenimages than with the second one. Because cars can appear in different colors, the grayscale version of a patch along with its gradient representation, that is color invariant, are used for classification. The gradient representation is computed with Sobel filters along two directions and serves as additional information for the classifier. As a result, there are four different sets of eigenimages obtained from grayscale and gradient car pictures, grayscale and gradient background pictures. To reconstruct an image πΌ using a set of eigenimages π΄ and mean image π we may follow the projection equation π = π΄π (πΌ − π) (1) and reconstruction equation πΌπ = π΄ β π + π π·= π2 π4 + π1 π3 (3) If π· > 2 than we assume that there is a car in the window, otherwise there is a background. It turns out that both grayscale and gradient representations are necessary for detection. During the simulation, it was often the case when one of the summand in (3) provided the wrong value, but the final classification was, nevertheless, right. The question arises of what similarity metric π between original and reconstructed images to use. Originally [3], it was a simple difference between images: (2) If πΌ and πΌπ don’t differ significantly, than πΌ corresponds to a data set that π΄ and π were acquired from. π = |πΌ − πΌπ | (4) As an additional type of metrics we tried a mutual information criterion. In this case, the classifier (3) remains the same, but the similarity measure (4) changes for the MI equation: π(πΌ, πΌπ ) = ∑ ∑ π(π₯, π¦) log π₯∈πΌ π¦∈πΌπ π(π₯, π¦) π1 (π₯)π2 (π¦) (5) where π(π₯, π¦), π1 (π₯), π2 (π¦) are joint and marginal intensity histograms respectively. Figure 1: First row – original patch for detection and its gradient representation, second raw – projection and reconstruction with car eigenimages, third raw – projection and reconstruction with background eigenimages. 20 eigenimages were used. In Figure 1 there is an example of an original patch to be classified (top-left) and its gradient representation (topright). It is clearly visible that projections and reconstructions using the car eigenimages (second raw) are much closer to the originals than projections and reconstructions with background eigenimages. Since, for every patch there are four meaningful distances: between a grayscale image and its reconstructions using car eigenimages (π1 ) and background eigenimages (π2 ), a corresponding gradient image and its reconstructions using car gradient eigenimages (π3 ) and background gradient eigenimages (π4 ) we can define a classifier in the form (4) where π is some distance measure between images. Figure 2: ROC curves for the detector based on the PCA reconstruction with different criteria To train the detector we used 50 positive, 250 negative examples and 40 eigenimages for all datasets. To evaluate the detector, 3 times more examples were used. As we can see from the Figure 2, sample difference performs better than mutual information for small false alarm rate. Moreover, mutual information works much worse for a small number of eigenvalues and simple difference can be computed more than 2 times faster. 2.2. Detector using boosting Completely different approach for object detection can be found in [2] where Murphy et al. used boosting to combine a number of weak unified classifiers, which are only slightly correlated with the true classification, into a strong classifier. There are numerous variations of boosting techniques; we employ, as Murphy et al. [2], GentleBoost version that can be described as follows [4]: 1. Initialization: Weights: π€π‘ = 1/π, πΉ(π₯) = 0. 2. For each round of boosting π = 1 … π: 2.1. Select optimal weak classifier ππ (π₯) and regression stump parameters π, π, π that solve a weighted least-squared problem (9): π min ∑ π€π‘ β [π¦π‘ − (π β [ππ (π₯π‘ ) > π] + π)]2 (9) π,π,π,π 2.2. Evaluate optimal weak classifier in a form of regression stump ππππ (π₯) from the previous step: Figure 3 (from [2]): First row – some examples from filters dictionary, second row – some examples from spatial templates dictionary. Each weak classifier computes its feature value for a detection window. At first a patch is convolved with one of 13 filters [2], including delta function, Gaussian derivatives, Laplacian, corner detectors and edge detectors (examples are in the first raw of Figure 3). Then, one of 30 spatial templates applies to the filtered patch (examples are in the second raw of Figure 3). Eventually, we calculate a histogram of the cropped, filtered patch and find its variance and kurtosis. Overall, there are π = 13 × 30 × 2 = 780 weak classifiers of the form: ππ (π₯) = ∑ π€π (π₯)(|πΌ(π₯) ∗ ππ (π₯)|πΎπ ) (6) π₯ π‘=1 ππππ (π₯) = π β [ππ (π₯π‘ ) > π] + π 2.3. Update the strong classifier: πΉ(π₯) ← πΉ(π₯) + ππππ (π₯) πΉ(π₯) = π1 (π₯) + π2 (π₯) + π3 (π₯) + β― (11) 2.4. Update weights: π€π‘ ← π€π‘ β exp[−π¦π‘ ππππ (π₯π‘ )] 3. (12) Output strong classifier: π π πππ[πΉ(π₯)] = π πππ [∑ π=1 where ππ (π₯) - filter, π€π (π₯) - spatial template, | | histogram, πΎπ = 2, 4 – variance or kurtosis. The general idea of the boosting is to create a strong classifier from weak ones using an additive model (7) by minimizing the exponential loss (8). For each round of boosting, examples from the training set that were correctly classified by many of the previous weak classifiers get lower weights and often misclassified examples get higher weight. (10) ππππ (π₯)] (13) The number of weak classifiers π that are incorporated into the final strong classifier is an important parameter. (7) π π½(πΉ) = ∑ π −π¦π‘πΉ(π₯π‘) (8) π‘=1 where ππ (π₯)-weak classifiers, πΉ(π₯) - a strong classifier, π₯π‘ -training examples, π¦π‘ ∈ [−1; 1] – ground truth for training examples, π – number of training examples. From (8) it is obvious that if the predicted classification πΉ(π₯π‘ ) has the same sign as the corresponding ground truth π¦π‘ than the loss function π½(πΉ) becomes smaller. Figure 4: ROC curves for GentleBoost detector. In the Figure 4 we may see how the ROC curve is improving once we increase the number of rounds π. The training and test datasets are the same as in Section 2.1. One of the major advantages of boosting algorithms is that they are almost unsusceptible to overfitting. It is not a fully explained phenomenon in the literature so far, but the ROC curves even for more than 600 rounds didn’t show any signs of overfitting during our simulations. From the comparison of Figure 2 and Figure 4 we can see that the detector based on the PCA reconstruction (40 eigenimages) works slightly better for small false alarm rate (less than 1%) than the GentleBoost detector (200 rounds).However, for higher false detection rates, the GentleBoost detector (200 rounds) performs better. It is an open question, whether the GentleBoost detector with very large number of weak classifiers can outperform PCA detector even for small false alarm rates. GentleBoost PCA(MI) PCA(Simple difference) Time, s 0 100 200 300 Figure 5: Time to classify 750 windows using different algorithms In Section 3 I used the PCA-based detector for two reasons: a) it is apparent from the Figure 5 that GentleBoost in its current implementation is too slow; b) PCA detector shows better performance on the low false detection rates. 2.3. Practical implementation After application of any detector to an image, we have 20-200 bounding boxes with different sizes and confidences, mostly around cars. So, the pruning is necessary to eliminate multiple detections around same objects and the following technique was adapted. Let’s consider that we have two intersecting bounding boxes and box2 has higher probability that there is a car inside than box1. We keep the box1 if the ratio of the conjunction area of two boxes to their disjunction area is less than maximum overlap (we used 0.4 as a tradeoff between separate detection of neighboring cars and cluttering of image with detection boxes) and discard it otherwise. The selected size of detection window is [150 × 80] pixels and we worked with images of size [640 × 480]. The following set of scale parameters for the original image was selected in order to detect cars of various sizes: Figure 6: On the left there are 68 bounding boxes before we apply pruning. On the right there are 6 boxes that remain after pruning. πππππ = [0.4, 0.5, 0.7, 0.8, 0.9, 1, 1.1, 1.15,1.25] (14) By increasing the number of possible scales in (14) we can improve the detection, by decreasing we can reduce computational complexity. We can dramatically diminish computational time by shifting the detection window not by one, but by several pixels at time. Empirically, we select the shift of 5 pixels for the first three scales and 10 for the remaining factors. 3. Bayesian framework We can further improve the detection if we integrate the output of a local detector with the contextual information and some of the possible contextual cues are presented in Section 3.1. In Section 3.2 we present the Bayesian network to integrate all the information. Section 3.3 shows how the framework actually works for one particular example. 3.1. Contextual cues There are two valuable geometric cues that give us a tight prior likelihood for the location and scale of a car. The first one has to do with the perspective projection. As it is shown in [1] (formulas (1)-(7)) we can approximate the relationship between the height of the object in the real world π¦π , its height βπ and bottom position π£π in the image, horizon line in the image π£0 ∈ [0 … 1] and the real-world height of the camera π¦π in the following way: π¦π ≈ π¦π βπ π£0 − π£π (15) It is assumed during the derivation of (15) that the object is on the ground and camera tilt is small. From (15) we can see that the probability of a car ππ given camera parameters (π = [π£0 , π¦π ]) is proportional to a probability of a car’s height in the image given π and π£π : π(ππ |π)~π(βπ |π£π , π) (16) π(π) = π(π£0 )π(π¦π ) The second cue is that a detection window with a car may be considered as containing mostly vertical surfaces with a ground below. Possible surface geometries ππ for the detection window can be of three types: ground, vertical and vertical with ground below. Originally Hoiem et al. [6] used a rather complicated general approach to differentiate ground and vertical surfaces. We propose a much simpler way to find a ground surface that is analogous to a magic wand in Photoshop and it turned out that it works sufficiently well for this particular constrained problem. The algorithm for ground extraction is the following – we automatically select points that are on the road most of the time (in the left bottom corner). Then we compute the Euclidian distances between RGB values of these reference pixels and all other image pixels. We select only pixels that are close (in terms of RGB distance) and find a connected region that incorporates these pixels as well as reference pixels. An example is in the Figure 8.b. 3.2. Bayesian network It is possible to estimate the viewpoint, object locations and scales and surface geometry of the scene separately. However, the estimation can be more accurate and robust if we take advantage of the interactions between all these elements. We assume [1] that objects and geometric surfaces produce image evidence ππ and ππ respectively. The viewpoint π = [π£0 , π¦π ] directly influences the position and size of the objects ππ and objects directly influence underlying geometric surfaces ππ . It is possible to represent these dependencies as a Bayesian network (Figure 7). In this model objects are independent given the viewpoint and geometric surfaces are independent given their corresponding objects. Decomposition of the joint distribution has the following form: π(π, π, π, ππ , ππ ) = π(π) ∏ π(ππ |π) π(πππ |ππ )π(ππ |ππ )π(πππ |ππ ) (17) π Using the Bayesian rule we can express the likelihood of the scene given evidence from the image in the following way: π(π, π, π, |ππ , ππ ) π(ππ |ππ ) π(ππ |πππ ) = π(π) ∏ π(ππ |π) π(ππ |ππ ) (18) π(ππ ) π(ππ ) π Equation (18) summarizes all the dependencies between detections and contextual cues. A priori, camera height π¦π and horizon position π£0 are considered as independent: (19) π(π£0 ) and π(π¦π ) are modeled as Gaussian variables with mean values 0.5 and 1.2 correspondingly, so π(π) is also Gaussian (see Figure 8.a). Figure 7: Bayesian network π(ππ |πππ ) is an output from the local detector for each position and scale in the image. If the confidence in the particular local detection after pruning is higher than a certain level, this detection becomes an object ππ in the Bayesian framework. Essentially, the number of objects ππ should be larger than the number of cars in the image and after we make an inference on the Bayesian net, the confidence in true detections should decrease, while confidence in wrong detections should decrease. In order to calculate the probability π(ππ |πππ ) we fitted a logistic regression to the binary classifiers in Sections 2.1 and 2.2. However, since all detections after pruning have very high confidence (> 90%), we can in practice assign the same high probability to all possible detection windows (objects ππ ). π(ππ |π) is proportional to the probability of a car’s image height (16) and has the Gaussian distribution (15) with parameters: π= ππ (π£π − π£π ) π¦π π= ππ (π£π − π£π ) π¦π (20) where ππ = 1.4 and ππ = 0.3 are typical for cars. π(ππ |ππ ) is calculated using a method described in Section 3.1, while π(ππ |ππ ) and π(ππ ) for three possible values of ππ are estimated either from the training dataset or empirically. Once we know conditional probabilities and priors in the model (Figure 7) we can pose a question: What are the posteriori distributions of all variables? There are a number of exact and approximate algorithms to make inference in Bayesian networks. We used Pearl’s belief propagation that is realized in the Bayesian Networks Toolbox [7] and is one of the most efficient algorithms to make inference in the tree-structured graphs. c a b e d f Figure 8: a) priori distribution for camera parameters π; b) bounding boxes from local detector and ground extraction; c), d) distribution for π and confidence in bounding boxes after inference without geometric cues, e), f) distribution for π and confidence in bounding boxes after inference with geometric cues 3.3. Example of the improved detection We can gain an insight into how the algorithm works by observing the results in the Figure 8. In the Figure 8.b we may see the output of the local detector after pruning, as well as the extraction of the ground plane. Corresponding Gaussian a priori distribution for camera parameters is in the Figure 8.a. If we combine camera parameters π with local detections ππ (assume that ππ are omitted in the model in the Figure 7) then we see that confidences for bounding boxes have changed (Figure 8.d), but not in the truly desired way. It happened because there are two competing theories for camera parameter estimation (Figure 8.c) – one of them with horizon estimation 0.7 is right, another, with lowered estimation of the horizon is wrong but supported by two small detection boxes on the right. However, once we applied our entire model (Figure 7) we have good estimation of confidences for bounding boxes (Figure 8.f) as well as a suppressed wrong peak in the distribution of camera parameters (Figure 8.e). 4. Conclusion In the course projects two very efficient types of algorithms for local detection in still images were realized and compared. The first one, based on PCA reconstruction demonstrated better performance for low false alarm rates than the second detector that uses GentleBoost. Also, the former is much more computationally efficient than the latter, although the better implementation may show different results. Results from the local detector were then integrated with the contextual cues, namely underlying geometric surfaces and camera parameters, using a Bayesian network. As it was seen in Section 3.3 this technique can further improve the results, eliminating unlikely, from the overall scene point of view, detections. There are a number of ways to improve the algorithm though. The major problem is computational efficiency. Currently an entire picture can be evaluated in around 4 minutes using PCA local detector. It is still too demanding for real applications, so further amendment, like specifying a detection regions and larger shifts between detection windows should be employed. Also, it is possible to achieve some improvement by training using larger datasets and by considering detections as still dependent on each other given the viewpoint. References [1] D.Hoiem, A.Efros, M.Hebert, Putting objects in perspective, IJCV, 80:3-15, 2008 [2] K.Murphy, A.Torralba and W.Freeman, Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes, NIPS, 16, 2003 [3] L.Malagon-Borja, O.Fuentes, Object detection using image reconstruction with PCA, IVC, 27: 2–9, 2009 (Online 2007) [4] J.Friedman, T.Hastie, R.Tibshirani, Additive logistic regression: a statistical view of boosting, The Annals of Statistics, Vol.28 No.2:337-407, 2000 [5] G.Heitz, D.Koller, Learning spatial context: using stuff to find things, Proceedings of the 10th ECCV (2008) 30-43 [6] D.Hoiem, A.Efros, M.Hebert, Geometric Context from a Single Image, ICCV 2005 [7] Bayesian Networks Toolbox for Matlab by K.Murphy: http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html [8] GentleBoost library for Matlab by A.Torralba http://people.csail.mit.edu/torralba/shortCourseRLOC/boosting/boosting.html