Video-based Car Surveillance: License Plate, Make, and Model

UNIVERSITY OF CALIFORNIA, SAN DIEGO
Video-based Car Surveillance: License Plate, Make, and Model Recognition
A thesis submitted in partial satisfaction of the
requirements for the degree Masters of Science
in Computer Science
by
Louka Dlagnekov
Committee in charge:
Professor Serge J. Belongie, Chairperson
Professor David A. Meyer
Professor David J. Kriegman
2005
Copyright
Louka Dlagnekov, 2005
All rights reserved.
The thesis of Louka Dlagnekov is approved:
Chair
University of California, San Diego
2005
iii
TABLE OF CONTENTS
Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
I
Introduction . . . . . . .
1.1. Problem Statement
1.2. Social Impact . . .
1.3. Datasets . . . . . .
1.4. Thesis Structure . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
2
3
4
6
II
License Plate Detection . . . . .
2.1. Introduction . . . . . . . .
2.2. Previous Work . . . . . . .
2.3. Feature Selection . . . . .
2.4. The AdaBoost Algorithm .
2.5. Optimizations . . . . . . .
2.5.1. Integral Images . . .
2.5.2. Cascaded Classifiers
2.6. Results . . . . . . . . . . .
2.6.1. Datasets . . . . . . .
2.6.2. Results . . . . . . . .
2.7. Future Work . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
9
13
14
14
16
18
18
21
24
III License Plate Recognition . . . . . . . . . .
3.1. Tracking . . . . . . . . . . . . . . . .
3.2. Super-Resolution . . . . . . . . . . .
3.2.1. Registration . . . . . . . . . . .
3.2.2. Point Spread Function . . . . .
3.2.3. Algorithm . . . . . . . . . . . .
3.2.4. Maximum Likelihood Estimate
3.2.5. Maximum a Posteriori Estimate
3.2.6. Discussion . . . . . . . . . . . .
3.3. Optical Character Recognition . . . .
3.3.1. Previous Work . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
26
29
30
30
31
32
37
38
38
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
iv
3.3.2. Datasets . . . . . .
3.3.3. Template Matching
3.3.4. Other Methods . .
3.4. Results . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
40
42
42
IV Make and Model Recognition . . . .
4.1. Previous Work . . . . . . . . .
4.2. Datasets . . . . . . . . . . . .
4.3. Appearance-based Methods .
4.3.1. Eigencars . . . . . . . .
4.4. Feature-based Methods . . . .
4.4.1. Feature Extraction . . .
4.4.2. Shape Contexts . . . . .
4.4.3. Shape Context Matching
4.4.4. SIFT Matching . . . . .
4.4.5. Optimizations . . . . . .
4.5. Summary of Results . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
46
46
47
48
50
58
58
61
63
64
71
71
Conclusions and Future Work . . . . . . . . . . . .
5.1. Conclusions . . . . . . . . . . . . . . . . . . .
5.1.1. Difficulties . . . . . . . . . . . . . . . . .
5.2. Future Work . . . . . . . . . . . . . . . . . . .
5.2.1. Color Inference . . . . . . . . . . . . . .
5.2.2. Database Query Algorithm Development
5.2.3. Make and Model 3-D Structure . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
74
74
74
76
76
77
78
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
79
V
.
.
.
.
.
.
.
.
v
LIST OF FIGURES
1.1
1.2
1.3
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
(a) A Dutch license plate and (b) a California license plate. Most
cars in our datasets have plates of the form shown in (b), but at a
much lower resolution. . . . . . . . . . . . . . . . . . . . . . . . . .
A frame from the video stream of (a) the ‘Regents’ dataset and (b)
the ‘Gilman’ dataset. . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b)
Same images variance normalized. . . . . . . . . . . . . . . . . . . .
PCA on 1,520 license plate images. Note that about 70 components
are required to capture 90% of the energy. . . . . . . . . . . . . . .
The means of the absolute value of the (a) x-derivative, and (b)
y derivative, and the variance of the (c) x-derivative, and (d) yderivative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Types of features selected by AdaBoost. The sum of values computed over colored regions are subtracted from the sum of values
over non-colored regions. . . . . . . . . . . . . . . . . . . . . . . . .
Typical class conditional densities for weak classifier features. For
some features, there is clearly a large amount of error that cannot
be avoided when making classifications, however this error is much
smaller than the 50% AdaBoost requires to be effective. . . . . . . .
(a) The integral image acceleration structure. (b) The sum of the
values in each rectangular region can be computed using just four
array accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A cascaded classifier. The early stages are very efficient and good
at rejecting the majority of false windows. . . . . . . . . . . . . . .
The three sets of positive examples used in training the license plate
detector – sets 1, 2, and 3, with a resolution of 71 × 16, 80 × 19, and
104 × 31, respectively. . . . . . . . . . . . . . . . . . . . . . . . . .
ROC curves for a 5-stage cascade trained using 359 positive examples and three different choices of negative training examples. . . . .
ROC curves for (a) a single-stage, 123-feature detector, and (b) a
6-stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features per
stage respectively. The sizes of the images trained on in sets 1, 2,
and 3 are 71 × 16, 80 × 19, and 104 × 31 respectively. The x-axis
scales in (a) and (b) were chosen to highlight the performance of
the detector on each set. . . . . . . . . . . . . . . . . . . . . . . . .
Examples of regions incorrectly labeled as license plates in the set
3 test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Detection on an image from the Caltech Computer Vision group’s
car database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
3
5
6
10
10
11
12
15
16
19
20
22
23
24
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
4.1
4.2
4.3
4.4
4.5
4.6
4.7
A car tracked over 10 frames (1.7 seconds) with a blue line indicating
the positions of the license plate in the tracker. . . . . . . . . . . .
Our image formation model. The (a) full-resolution image H undergoes (b) a geometric transformation Tk followed by (c) a blur
with a PSF h(u, v); is (d) sub-sampled by S, and finally (e) addibk
tive Gaussian noise η is inserted. The actual observed image L
from our camera is shown in (f). The geometric transformation is
exaggerated here for illustrative purposes only. . . . . . . . . . . . .
(a) The Huber penalty function used in the smoothness prior with
α = 0.6 and red and blue corresponding the regions |x| ≤ α and
|x| > α respectively; (b) an un-scaled version of the bi-modal prior
with µ0 = 0.1 and µ1 = 0.9. . . . . . . . . . . . . . . . . . . . . . .
Super-resolution results: (a) sequence of images processed, (b) an
up-sampled version of one low-resolution image, (c) the average image, (d) the final high-resolution estimate. . . . . . . . . . . . . . .
The alphabet created from the training set. There are 10 examples
for each character for the low-resolution, average image, and superresolution classes, shown in that respective order. . . . . . . . . . .
Character frequencies across our training and test datasets. . . . . .
Template matching OCR results on the low-resolution test set for
‘standard’ and ‘loose’ comparisons between recognized characters
and actual characters. . . . . . . . . . . . . . . . . . . . . . . . . .
Recognition results for the images in our test set. Each horizontal
section lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and
7 mistakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Our automatically generated car database. Each image is aligned
such that the license plate is centered a third of the distance from
bottom to top. Of these images, 1,102 were used as examples, and 38
were used as queries to test the recognition rates of various methods.
We used the AndreaMosaic photo-mosaic software to construct this
composite image. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) The average image, and (b) the first 10 eigencars. . . . . . . . .
The first 19 query images and the top 10 matches in the database
for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . .
The second 19 query images and the top 10 matches in the database
for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . .
The first 19 query images and the top 10 matches in the database
for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . .
The second 19 query images and the top 10 matches in the database
for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . .
Harris corner detections on a car image. Yellow markers indicate
occlusion junctions, formed by the intersection of edges on surfaces
of different depths. . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
27
28
34
37
40
41
43
44
49
52
54
55
56
57
59
4.8
4.9
4.10
4.11
4.12
4.13
4.14
4.15
Kadir and Brady salient feature extraction results on (a) a car image
from our database, and (b) an image of a leopard. . . . . . . . . . .
SIFT keypoints and their orientations for a car image. . . . . . . . .
(a) Query car image with two interest points shown, (b) database car
image with one corresponding interest point shown, (c) diagram of
log-polar bins used for computing shape context histograms, (d,e,f)
shape context histograms for points marked ‘B’, ‘C’, and ‘A’ respectively. The x-axis represents θ and the y-axis represents log r
increasing from top to bottom. . . . . . . . . . . . . . . . . . . . . .
(a) Image edges and (b) a random sampling of 400 points from the
edges in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Query images 1-10 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between
matched keypoints of the query (top) and database (bottom) images.
Query images 11-20 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between
matched keypoints of the query (top) and database (bottom) images.
Query images 21-29 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between
matched keypoints of the query (top) and database (bottom) images.
Query images 30-38 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between
matched keypoints of the query (top) and database (bottom) images.
viii
60
61
62
64
67
68
69
70
LIST OF TABLES
2.1
4.1
4.2
Negative examples remaining during training at each stage of the
cascade. The three training operations shown are (1) initial training
with 10,052 randomly chosen negative examples, (2) first bootstrap
training with an additional 4,974 negative examples taken from false
positives, (3) second bootstrap operation with another 4,974 negative examples taken from false positives from the previous training
stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary of overall recognition rates for each method. . . . . . . .
Test set of queries used with ‘Size’ indicating the number of cars
similar to the query in the database and which method classified
each query correctly. . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
23
71
73
ACKNOWLEDGEMENTS
I would like to thank the following people for helping make this thesis
possible:
Serge Belongie for being there every step of the way and always being
available for consultation, even at three in the morning. David Meyer for arranging
funding and for ongoing consultation. David Kriegman for very helpful initial
guidance.
My family for being understanding and supportive throughout my education. My best friend Brian for many enlightening discussions and proof reading
drafts.
David Rose of the UCSD Police Department and Robert Meza of the
Campus Loss Prevention Center for providing access to car video data.
This work has been partially supported by DARPA under contract
F49620-02-C-0010.
x
ABSTRACT OF THE THESIS
Video-based Car Surveillance: License Plate, Make, and Model Recognition
by
Louka Dlagnekov
Masters of Science in Computer Science
University of California, San Diego, 2005
Professor Serge J. Belongie, Chair
License Plate Recognition (LPR) is a fairly well explored problem and is already
a component of several commercially operational systems. Many of these systems,
however, require sophisticated video capture hardware possibly combined with infrared strobe lights, or exploit the large size of license plates in certain geographical
regions and the (artificially) high discriminability of characters. One of the goals
of this project is to develop an LPR system that achieves a high recognition rate
without the need for a high quality video signal from expensive hardware. We also
explore the problem of car make and model recognition for purposes of searching
surveillance video archives for a partial license plate number combined with some
visual description of a car. Our proposed methods will provide valuable situational
information for law enforcement units in a variety of civil infrastructures.
xi
Chapter I
Introduction
License plate recognition (LPR) is widely regarded to be a solved problem,
the technology behind the London Congestion Charge program being a well-known
example. In an effort to reduce traffic congestion in Central London, the city
imposes a daily fee on motorists entering a specified zone [21]. In order to automate
the enforcement of the fee, over two hundred closed-circuit television (CCTV)
cameras are in operation whose video streams are processed by an LPR system. If
a plate is found whose registered owner has not paid the fee, the owner is fined.
Other LPR systems are used by the U.S. Customs for more efficient crosschecks in the National Crime Information Center (NCIC) and Treasure Enforcement Communications System (TECS) for possible matches with criminal suspects [22]. The 407 ETR toll road in Ontario, Canada also uses LPR to fine
motorists who do not carry a radio transponder and have not paid a toll fee. In
the Netherlands LPR systems are in place that are fully automated from detecting
speeding violations, to reading the license plate and billing the registered owner.
All of these systems treat license plates as cars’ fingerprints. In other
words, they determine a vehicle’s identity based solely on the plate attached to
it. One can imagine, however, a circumstance where two plates from completely
different make and model cars are swapped with malicious intent, in which case
these systems would not find a problem. We as humans are also not very good
1
2
at reading cars’ license plates unless they are quite near us, nor are we very good
at remembering all the characters. However, we are good at identifying and remembering the appearance of cars, and therefore their makes and models, even
when they are speeding away from us. In fact, the first bit of information Amber
Alert signs show is the car’s make and model and then its license plate number,
sometimes not even a complete number. Therefore, given the description of a car
and a partial license plate number, the authorities should be able to query their
surveillance systems for similar vehicles and retrieve a timestamp of when that
vehicle was last seen along with archived video data for that time.
Despite the complementary nature of license plate and make and model
information, to the best of our knowledge, make and model recognition is an unexplored problem. Various research has been done on detecting cars in satellite
imagery and detecting and tracking cars in video streams, but we are unaware
of any work on the make and model recognition (MMR) aspect. Because of the
benefits that could arise from the unification of LPR and MMR, we explore both
problems in this thesis.
1.1
Problem Statement
Although few details are released to the public about the accuracy of com-
mercially deployed LPR systems, it is known that they work well under controlled
conditions and require high-resolution imaging hardware. Most of the academic
research in this area also requires high-resolution images or relies on geographicallyspecific license plates and takes advantage of the large spacing between characters
in those regions and even the special character features of commonly misread characters as shown in Figure 1.1 (a). Although the majority of license plates in our
datasets were Californian and in the form of Figure 1.1 (b), the difficulty of the
recognition task is comparable to other United States plates. The image shown in
Figure 1.1 (b) is of much higher resolution than the images in our datasets and is
3
(a)
(b)
Figure 1.1: (a) A Dutch license plate and (b) a California license plate. Most
cars in our datasets have plates of the form shown in (b), but at a much lower
resolution.
shown for illustrative purposes only.
Our goal in this thesis is to design a car recognition system for surveillance
purposes, which, given low-resolution video data as input is able to maintain a
database of the license plate and make and model information of all cars observed
for the purposes of performing queries on license plates and makes and models. In
this thesis, we do not explore algorithms for such queries, but our results in this
project are an invaluable foundation for that task.
1.2
Social Impact
The use of any system that stores personally identifiable information
should be strictly monitored for adherence to all applicable privacy laws. Our
system is no exception. Since license plates can be used to personally identify individuals, queries to the surveillance database collected should only be performed
by authorized users and only when necessary, such as in car theft or child abduction circumstances. Because our system is query-driven rather than alarm-driven,
where by alarm-driven we mean the system issues an alert when a particular behavior is observed (such as running a red-light), slippery slope arguments toward
a machine-operated automatic justice system do not apply here. The query-driven
aspect also alleviates fears that such technology could be used to maximize state
revenue rather than to promote safety.
Although there exists the possibility of abuse of our system, this possibility exists in other systems too such as financial databases employed by banks and
4
other institutions that hold records of persons’ social security numbers. Even cell
phone providers can determine a subscriber’s location by measuring the distance
between the phone and cell towers in the area. In the end we feel the benefits of
using our system far outweigh the potential negatives, and it should therefore be
considered for deployment.
1.3
Datasets
We made use of two video data sources in developing and testing our
LPR and MMR algorithms. We shall refer to them as the ‘Regents’ dataset and
the ‘Gilman’ dataset. The video data in both datasets is captured from digital
video cameras mounted on top of street lamp poles overlooking stop signs. Figure
1.2 shows a typical frame captured from both cameras. These cameras, along
with nearly 20 others, were set up in the Regents parking lots of UCSD as part
of the RESCUE-ITR (Information Technology Research) program by the UCSD
Police Department. The ‘Regents’ video stream has a resolution of 640 × 480 and
sampling is done at 10 frames per second, while the ‘Gilman’ video stream has a
resolution of 720 × 480 and is sampled at 6 frames per second.
Due to the different hardware and different spatial positions of both cameras, the datasets have different characteristics. The camera in the ‘Regents’
dataset is mounted at a much greater distance from the stop sign and is set to
its full optical zoom, while the ‘Gilman’ camera is much closer. The size of plates
in the ‘Regents’ dataset are therefore much smaller, but exhibit less projective
distortion as cars move through the intersection. On the other hand, the ‘Gilman’
camera is of higher quality, which combined with the larger plate sizes made for
an easier character recognition task.
Since only about a thousand cars pass through both intersections in an
8 hour recorded period, some sort of automation was necessary to at least scan
through the video stream to find frames containing cars. An application was
5
(a)
(b)
Figure 1.2: A frame from the video stream of (a) the ‘Regents’ dataset and (b) the
‘Gilman’ dataset.
written for this purpose, which searches frames (in a crude, but effective method
of red color component thresholding, which catches cars’ taillights) for cars and
facilitates the process of extracting training data by providing an interface for
hand-clicking on points.
Using this process, over 1,500 training examples were extracted for the
‘Regents’ dataset as shown on Figure 1.3(a). In the figure, time flows in raster
scan order, such that the top left license plate image was captured at 8am and
the bottom right at 4pm. Note the dark areas in the image – this is most likely
a result of cloud cover, and this illumination change can be accounted for by
variance normalizing the images as shown in Figure 1.3(b). Although this variance
normalization technique does improve the consistency of license plate examples, it
had little effect on the overall results and was not used so as to reduce unnecessary
computation. However, we point it out as a reasonable solution to concerns that
illumination differences may adversely affect recognition.
Unless otherwise indicated, all references to datasets shall refer to the
‘Gilman’ dataset.
6
(a)
(b)
Figure 1.3: (a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b)
Same images variance normalized.
1.4
Thesis Structure
Chapter 2 discusses the design and performance of a license plate detec-
tor trained in a boosting framework. In Chapter 3 we present several important
pre-processing steps on detected license plate regions and describe a simple algorithm to perform optical character recognition (OCR). The problem of make
and model recognition is explored in Chapter 4, where we evaluate several wellknown and some state of the art object recognition algorithms in this novel setting.
We conclude the thesis in Chapter 5 and discuss ideas for future research on car
recognition.
Chapter II
License Plate Detection
2.1
Introduction
In any object recognition system, there are two major problems that need
to be solved – that of detecting an object in a scene and that of recognizing it;
detection being an important requisite. In our system, the quality of the license
plate detector is doubly important since the make and model recognition subsystem
uses the location of the license plate as a reference point when querying the car
database. In this chapter we shall discuss our chosen detection mechanism.
The method we employ for detecting license plates can be described as
follows. A window of interest, of roughly the dimensions of a license plate image,
is placed over each frame of the video stream and its image contents are passed as
input to a classifier whose output is 1 if the window appears to contain a license
plate and 0 otherwise. The window is then placed over all possible locations in the
frame and candidate license plate locations are recorded for which the classifier
outputs a 1.
In reality, this classifier, which we shall call a strong classifier, weighs the
decisions of many weak classifiers, each specialized for a different feature of license
plates, thereby making a much more accurate decision. This strong classifier is
trained using the AdaBoost algorithm. Over several rounds, AdaBoost selects the
7
8
best performing weak classifier from a set of weak classifiers, each acting on a single
feature. The AdaBoost algorithm is discussed in detail in Section 2.4.
Scanning every possible location of every frame would be very slow were
it not for two key optimization techniques introduced by Viola and Jones – integral
images and cascaded classifiers [49]. The integral image technique allows for an
efficient implementation and the cascaded classifiers greatly speed up the detection
process, as not all classifiers need be evaluated to rule out most non-license plate
sub-regions. With these optimizations in place, the system was able to process
10 frames per second at a resolution of 640 × 480 pixels. The optimizations are
discussed in Section 2.5.
Since the size of a license plate image can vary significantly with the
distance from the car to the camera, using a fixed-size window of interest is impractical. Window-based detection mechanisms often scan a fixed-size window
over a pyramid of image scales. Instead, we used three different sizes of windows,
each having a custom-trained strong classifier for that scale.
2.2
Previous Work
Most LPR systems employ detection methods such as corner template
matching [20] and Hough transforms [26] [51] combined with various histogrambased methods. Kim et al. [28] take advantage of the color and texture of Korean
license plates (white characters on green background, for instance) and train a
Support Vector Machine (SVM) to perform detection. Their license plate images
range in size from 79 × 38 to 390 × 185 pixels, and they report processing lowresolution input images (320 × 240) in over 12 seconds on a Pentium3 800MHz,
with a 97.4% detection rate and a 9.4% false positive rate. Simpler methods, such
as adaptive binarization of an entire input image followed by character localization,
also appear to work as shown by Naito et al. [36] and [5], but are used in settings
with little background clutter and are most likely not very robust.
9
Since license plates contain a form of text, we decided to face the detection
task as a text extraction problem. Of particular interest to us was the work done by
Chen and Yuille on extracting text from street scenes for reading for the blind [10].
Their work, based on the efficient object detection work by Viola and Jones [49],
uses boosting to train a strong classifier with a good detection rate and a very low
false positive rate. We found that this text detection framework also works well
for license plate detection.
2.3
Feature Selection
The goal of this section is to find good features in the image contents
of the window of interest, one for each weak classifier. The features to which the
weak classifiers respond are important in terms of overall accuracy and should be
chosen to discriminate well between license plates and non-license plates.
Viola and Jones use Haar-like features, where sums of pixel intensities
are computed over rectangular sub-windows [49]. Chen and Yuille argue that,
while this technique may be useful for face detection, text has little in common
with faces [10]. To support their assumption, they perform principal component
analysis (PCA) on their training examples and find that about 150 components are
necessary to capture 90 percent of the variance, whereas in typical face datasets,
only a handful would be necessary. To investigate whether this was the case with
license plates, a similar plot was constructed, shown in Figure 2.1. Unlike the text
of various fonts and orientations with which Chen and Yuille were working, license
plates require much fewer components to capture most of the variance. However, an
eigenface-based approach [48] for classification yielded very unsatisfactory results
and is extremely expensive to compute over many search windows. Fisherfacebased classification [3], which is designed to maximize between-class scatter to
within-class scatter, also yielded unsatisfactory results.
It is desirable to select features that produce similar results on all license
10
Energy Captured
1
0.8
0.6
0.4
0.2
0
0
50
100
150
200
Number of Eigenvalues
250
300
Figure 2.1: PCA on 1,520 license plate images. Note that about 70 components
are required to capture 90% of the energy.
5
5
10
10
15
15
10
20
30
40
10
20
(a)
(b)
(c)
(d)
30
40
Figure 2.2: The means of the absolute value of the (a) x-derivative, and (b) y
derivative, and the variance of the (c) x-derivative, and (d) y-derivative.
11
X derivative
Y derivative
X derivative variance
Y derivative variance
Figure 2.3: Types of features selected by AdaBoost. The sum of values computed
over colored regions are subtracted from the sum of values over non-colored regions.
plate images and are good at discriminating between license plates and non-license
plates. After pre-scaling all training examples in the ‘Regents’ dataset to the same
45 × 15 size and aligning them, the sum of the absolute values of their x and
y-derivatives exhibit the pattern shown in Figure 2.2. The locations of the 7 digits
of a California license plate are clearly visible in the y-derivative and y-derivative
variance. Although the x-derivative and x-derivative variance show the form which
Yuille and Chen report for text images, the y-derivative and y-derivative variance
are quite different and yield a wealth of information.
A total of 2,400 features were generated as input to the AdaBoost algorithm. These were a variation of the Haar-like features used by Viola and Jones
[49], but more generalized, yet still computationally simple. A scanning window
was evenly divided into between 2 and 7 regions of equal size, either horizontal or
vertical. Each feature was then a variation on the sum of values computed in a set
of the regions subtracted from the sum of values in the remaining set of regions.
Therefore, each feature applied a thresholding function on a scalar value. Some of
these features are shown in Figure 2.3.
The values of the regions of each window were the means of pixel intensities, derivatives, or variance of derivatives. None of the features actually selected
by AdaBoost used raw pixel intensities, however, probably because of their poor
discriminating ability with respect to wide illumination differences. Each weak
classifier was a Bayes classifier, trained on a single feature by forming class condi-
12
Likelihood
0.15
0.12
0.09
0.06
0.03
0
0
.250
.500
.750
1.00
1.250
1.500
1.750
2.000
2.250
2.500
2.875
3.125
3.375
0
.125
.250
.375
.500
.625
.750
.875
1.00
1.125
1.250
1.375
1.500
1.625
0.75
0.6
0.45
0.3
0.15
0
0.15
0.12
0.09
0.06
0.03
0
0
.250
.500
.750
1.00
1.250
1.500
1.750
2.000
2.250
2.500
2.875
3.125
Feature Value
License Plate
Non−License Plate
Figure 2.4: Typical class conditional densities for weak classifier features. For
some features, there is clearly a large amount of error that cannot be avoided when
making classifications, however this error is much smaller than the 50% AdaBoost
requires to be effective.
tional densities (CCD) from the training examples. The CCD for a typical weak
classifier is shown in Figure 2.4. When making a decision, regions where the license
plate CCD is larger than the non-license plate CCD are classified as license plate
and vice-versa, instead of using a simple one-dimensional threshold.
Although the features we have described are rather primitive and not
flexible in the sense that they are not able to respond to discontinuities other than
vertical and horizontal, they lend themselves nicely to the optimization techniques
discussed in Section 2.5. Steerable filters, Gabor filters, or other wavelet-based
approaches are more general, but would be slower to compute.
13
2.4
The AdaBoost Algorithm
AdaBoost is a widely used instance of boosting algorithms. The term
boosting refers to the process of strengthening a collection of weak learning algorithms to create a strong learning algorithm. It was developed by Schapire
[40] in 1990, who showed that any weak learning algorithm could be transformed
or “boosted” into a strong learning algorithm. A more efficient version of the
algorithm outlined by Schapire was later presented by Freund [16] called “boostby-majority”, and in 1995 Schapire and Freund developed AdaBoost [17], “Ada”
standing for “adaptive” since it adjusts adaptively to the errors observed in the
weak learners.
The idea of boosting can be explained with the following example. Consider the problem of classifying email messages into junk-email and regular email
by examining messages’ keywords. An example of a keyword we may tend to see
often in junk email is “click here” and can classify messages as junk if they contain
the keyword. Although this may work for many junk emails, it will almost certainly also lead to many legitimate messages being misclassified. Classifying solely
based on the “click here” keyword is a good rule of thumb, but it is rather coarse.
A better approach would be to find several of these rough rules of thumb and take
advantage of boosting to combine them.
In its original form, AdaBoost is used to boost the classification accuracy
of a single classifier, such as a perceptron, by combining a set of classification
functions to form a strong classifier. As applied to this project, AdaBoost is used
to select a combination of weak classifiers to form a strong classifier. The weak
classifiers are called weak because they only need to be correct just over 50% of
the time.
At the start of training, each training example (x1 , y1 )...(xn , yn ) is assigned a weight wi =
1
2m
for negatives and wi =
1
2l
for positives, where xi are
positive and negative inputs, y ∈ {0, 1}, m is the number of negatives, and l is the
14
number of positives. The uneven initial distribution of weights leads to the name
“Asymmetric AdaBoost” for this boosting technique.
Then, for t = 1, ..., T rounds, each weak classifier hj is trained and its
P
error is computed as t = i wi |hj (xi ) − yi |. The hj with lowest error is selected,
and the weights are updated according to:
t
wt+1,i = wt,i
1 − t
if xi is classified correctly and not modified if classified incorrectly. This essentially
forces the weak classifiers to concentrate on “harder” examples that are most often
misclassified. We implemented the weighting process in our Bayes classifiers by
scaling the values used to build the CCDs.
After T rounds, T weak classifiers are selected and the strong classifier
makes classifications according to

PT
PT

1
α
h
(x)
≥
τ
t
t
t=1 αt
t=1
,
h(x) =

0 otherwise
where αt = ln
1−t
t
and τ is set to
1
2
(2.1)
to minimize the error.
Schapire and Freund showed that the overall error of the boosted classifier
is bound exponentially with the size of T .
2.5
Optimizations
In this section we discuss two key optimization techniques introduced by
Viola and Jones [49], which allowed us to achieve very fast detection rates – 10
frames per second on 640 × 480 image sizes.
2.5.1
Integral Images
The features described in Section 2.3 add values over one group of sections
and subtract them from another group of sections. If these sections are m × n
15
A
B
w
(x, y)
C
x
D
y
(a)
z
(b)
Figure 2.5: (a) The integral image acceleration structure. (b) The sum of the
values in each rectangular region can be computed using just four array accesses.
pixels in size, we would normally require mn array accesses. However, if we take
advantage of their rectangular nature, we can reduce the accesses to four, regardless
of the size of the section, using an integral image data structure.
An integral image I 0 of an image I is of the same dimensions as I and at
each location (x, y) contains the sum of all the pixels in I above and to the left of
the pixel (x, y):
I 0 (x, y) =
X
I(x, y).
x0 ≤x,y 0 ≤y
With this structure in place, the sum of the pixel values in region D in
Figure 2.5 (b), can be computed as
D = I 0 (w) + I 0 (z) − (I 0 (x) + I 0 (y)).
The integral image itself can be efficiently computed in a single pass over the image
using the following recurrences:
r(x, y) = r(x − 1, y) + I(x, y)
I 0 (x, y) = I(x, y − 1) + r(x, y),
where r(−1, y) and I(x, −1) are defined to be 0.
16
Scanning
Window
1
1
1
2
0
1
3
0
0
1
4
Further
Processing
0
Reject Window
Figure 2.6: A cascaded classifier. The early stages are very efficient and good at
rejecting the majority of false windows.
For the images on which we trained and classified, we created integral
images for raw pixel values, x-derivatives, y-derivatives, as well as integral images
for the squares of these three types of values. The integral image of squares of
values is useful for quickly computing the variance of the values in the sections of
our features, since the variance can be computed as
σ 2 = m2 −
1 X 2
x,
N
where m is the mean and x is the feature value.
2.5.2
Cascaded Classifiers
At any given time, there are at most a handful of license plates visible in
a frame of video, yet there are on the order of (640 − 100) × (480 − 30) ≈ 200, 000
window positions that require scanning, assuming a license plate image is 100 × 30
pixels. The number of regions to be classified as not containing a license plate
clearly far exceed those that do. Luckily, it is not necessary to employ all classifiers
selected by AdaBoost at each window position.
The idea behind a cascaded classifier is to group the classifiers into several
stages in order of increasing complexity with the hopes that the majority of regions
can be rejected quickly by very few classifiers. Such a cascaded structure is depicted
in Figure 2.6. Although a positive instance will pass through all stages of the
cascade, this will be a very rare event, and the cost would be amortized.
17
Training the cascade is done stage by stage, where the first stage is trained
on all positive and negative examples, the second stage is trained on all positive
examples and only the false positives of the first stage used as negative examples,
and so on for the remaining stages. The justification for this selection of negative
examples is that when the cascade is in operation, there are many window instances
which the latter stages will never be asked to classify since the early stages will
have rejected them, and, therefore, training of the latter stages should reflect the
type of data those stages would see in practice. Usually, the largest percentage of
negative examples will be rejected in the first two stages, and the rest of the stages
in the cascade will train on “harder” examples, and thus have much higher false
positive rates than the early stages and as a result require more classifiers.
By increasing the τ threshold in Equation (2.1), which is designed to yield
a low error on the training data, we can decrease the false positive rate, at the
expense of a decrease in the detection rate. This adjustment allows us to generate
the receiver operating characteristic (ROC) curves shown in the next section, and
it also allows us to design the cascade with a desirable false negative rate at each
stage. Since the false negative rate is given by
N=
K
Y
ni ,
i=1
where n is the false negative rate of each stage in the cascade, and K are the
number of stages, if we desire a 90% overall detection rate and K = 10, we would
require each ni to be 99% since .9910 ≈ .90. The 99% false negative rate can easily
be achieved by decreasing the τ threshold in Equation (2.1), even at the expense
of a high false positive rate at each stage. The overall false positive rate is given
by
P =
K
Y
pi ,
i=1
where pi is the false positive rate of each stage. Even a high false positive rate of
40% at each stage would equate to an overall false positive rate of only .01%, since
.4010 ≈ .0001.
18
The design of a good cascade is not trivial. Viola and Jones present a
simple algorithm that determines the number of features to be used at each stage
by selecting a desired false negative and false positive rate [49], however, it assumes
that each feature is of equal computational complexity. In our case, and in Chen
and Yuille’s [10] cascaded classifier, this assumption does not hold. In principle
one could design an algorithm to evaluate the time complexity of each feature type
and choose how many and of what type features should be placed in K stages in
order to minimize the overall running time of the classifier. Unfortunately, this is
a very difficult problem. In practice, however, one can design a reasonably good
cascade using the guiding principle that efficient features should be evaluated near
the front of the cascade and more computationally expensive features should be
evaluated near the end of the cascade.
In our chosen cascaded classifier, we did not allow AdaBoost to select
variance-based features for the first stage since we wanted it to be very efficient at
eliminating a large portion of window locations early on. We should also mention
that not only is detection fast in a cascaded classifier, but so is its training. Since
each stage eliminates a large number of negative examples, the latter stages train
on a much smaller set of examples. For a 123-feature single-stage classifier, full
training with two bootstrap operations takes 18 hours to train, whereas a 6-stage
classifier with the same number of features in total takes 5 hours.
2.6
Results
In this section we present our results on the ‘Gilman’ dataset.
2.6.1
Datasets
Unlike in our ‘Regents’ dataset, the camera on the ‘Gilman’ dataset was
mounted much closer to the intersection, which resulted in greater projective distortion of the license plate as each car progresses through the intersection. We
19
Figure 2.7: The three sets of positive examples used in training the license plate
detector – sets 1, 2, and 3, with a resolution of 71 × 16, 80 × 19, and 104 × 31,
respectively.
investigated training our license plate detector on plate images of a single scale
and performing detection on a pyramid of scales for each frame, but found that
that detection rate was not as good as having a dedicated detector trained on
several scales. Therefore, the final training and test datasets were created by sampling three images of each car when it is approaching, entering, and exiting the
intersection for 419 cars over several hours of video. The plates were then manually
extracted from these images and split into three sets of small, medium, and large
area. This provided 359 training images and 60 test images for each of the three
sets. The average size of a plate in each set was 71 × 16, 80 × 19, and 104 × 31
respectively. The images in each set are shown in Figure 2.7. To allow for an easier
method of extracting negative examples for training and to test our detector, we
20
100
Detection Rate (%)
92
84
76
68
60
0
0.05
0.1
0.15
0.2
0.25
False Positive Rate (%)
20000 random
10000 random, 10000 FP
10000 random, 5000 FP, 5000 FP
Figure 2.8: ROC curves for a 5-stage cascade trained using 359 positive examples
and three different choices of negative training examples.
ensured that each of the 419 frames sampled for each set contained at most one
visible license plate.
We generated additional positive training examples for each set by extracting images from 10 random offsets (up to 1/8 of the width and 1/4 of the
height of license plates) of each license plate location (for a total of 3,590), all of
the same size as the average license plate size for that set. We found that this
yielded better results than just using the license plate location for a single positive
example per hand-labeled region. Of course, when the detector was in operation, it
fired at many regions around a license plate, which we in fact used as an indication
of the quality of a detection.
To generate negative examples, we picked 28 license plate-sized images
from random regions known not to contain license plates in each positive frame,
which resulted in 10,052 per set. We then applied a sequence of two bootstrap operations where false positives obtained from testing on the training data were used
as additional negative examples for re-training the cascade. We found that two sequential bootstrap operations of 4,974 negative examples each were more effective
21
than a single bootstrap operation with 9,948 negative examples. A comparison of
these two methods is given in Figure 2.8.
2.6.2
Results
Figure 2.9 shows a receiver operating characteristic (ROC) curve for our
cascaded detector, and a single-stage cascade detector with the same number of
features. There appears to be a trend indicating that a larger set (in terms of image
size) is learned better than a smaller set. This is most likely due to the detector
having access to more information content per image and as a result is able to
better discriminate between license plates and non-license plates. In fact, when
our detector was trained on the ‘Regents’ dataset where plate sizes were on average
only 45 × 15 pixels, the detection rates were much lower even though more training
examples were used. The ROC improvement for the resolution increase between
sets 1 and 2 does not appear in the single-stage cascade, most likely because it is
not a large increase.
Table 2.1 shows the number of negative examples remaining at each stage
of the cascade during the three training operations. Stages using the same number
of negative examples as the previous indicate that the desired detection rate of
99.5% could not be maintained at the previous stage, and the τ threshold of Equation (2.1) was unchanged. Note that with each bootstrap operation the number
of negative examples that enter the last stage of the cascade increases a lot more
quickly than the linear increase of negative examples because the false positives
represent ‘harder’ examples.
As was to be expected, the cascaded classifier was much faster in operation with each frame requiring about 100 ms to process, whereas the single-stage
classifier required over 3 seconds, but exhibited a superior ROC curve.
Figure 2.10 shows a few examples of regions that our detector incorrectly
labeled as license plates in our test dataset. Perhaps not surprisingly, a large
number of them are text from advertising on city buses, or the UCSD shuttle.
22
100
Detection Rate (%)
92
84
76
68
60
0
0.004
Set 1
0.008
0.012
False Positive Rate (%)
Set 2
0.016
0.02
Set 3
(a)
100
Detection Rate (%)
92
84
76
68
60
0
0.05
Set 1
0.1
0.15
False Positive Rate (%)
Set 2
0.2
0.25
Set 3
(b)
Figure 2.9: ROC curves for (a) a single-stage, 123-feature detector, and (b) a 6stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features per stage respectively.
The sizes of the images trained on in sets 1, 2, and 3 are 71 × 16, 80 × 19, and
104 × 31 respectively. The x-axis scales in (a) and (b) were chosen to highlight the
performance of the detector on each set.
23
Table 2.1: Negative examples remaining during training at each stage of the cascade. The three training operations shown are (1) initial training with 10,052 randomly chosen negative examples, (2) first bootstrap training with an additional
4,974 negative examples taken from false positives, (3) second bootstrap operation
with another 4,974 negative examples taken from false positives from the previous
training stage.
# of Features
(1)
(2)
(3)
1
2
2
3
10,052 1,295
15,026 4,532
20,000 20,000
3
4
6
12
1,295 537
4,532 2,217
8,499 5,582
5
40
207
861
2,320
6 Remaining
60
0
0
152
0
552
14
Figure 2.10: Examples of regions incorrectly labeled as license plates in the set 3
test set.
Those that contain taillights can easily be pruned by applying a color threshold.
We also applied our license plate detector to a few car images from the
Caltech Computer Vision group’s car database, whose image quality is far better
than the video cameras used to create our datasets, and we found that many license
plates were detected correctly, at the expense of a high number of false positives
due to vegetation, for which our detector was not given negative examples. These
could easily be pruned as well simply by applying yet another color threshold.
Figure 2.11 shows the output of our detector on one of these images.
We did not achieve as low a false positive rate per detection rate on our
datasets as either Chen and Yuille, or Viola and Jones, but the false positive rate
of 0.002% for a detection rate of 96.67% in set 3 is quite tolerable. In practice, the
number false positives per region of each frame is small compared to the number of
detections around a license plate in the frame. Therefore, in our final detector we
24
Figure 2.11: Detection on an image from the Caltech Computer Vision group’s car
database.
do not consider a region to contain a license plate unless the number of detections
in the region is above a threshold.
2.7
Future Work
It would be advantageous to investigate other types of features to place
in the latter stages of the cascade in order to reduce the false positive rate. Colorbased discrimination would be especially useful, since most plates contain a bimodal color distribution of a white background and black or dark blue text. Other
features mentioned by Chen and Yuille [10] such as histogram tests and edgelinking were not tried but should be to test their performance in a license plate
detection setting.
Chapter III
License Plate Recognition
In this chapter, we present a process to recognize the characters on detected license plates. We begin by describing a method for tracking license plates
over time and how this can provide multiple samplings of each license plate for
the purposes of enhancing it for higher quality character recognition. We then
describe our optical character recognition (OCR) algorithm and present our recognition rates.
3.1
Tracking
More often than not, the false positive detections from our license plate
detector were erratic, and if on the car body, their position was not temporally
consistent. We use this fact to our advantage by tracking candidate license plate
regions over as many frames as possible. Then, only those regions with a smooth
trajectory are deemed valid. The tracking of license plates also yields a sequence
of samplings of the license plate, which are used as input to a super-resolution
pre-processing step before OCR is performed on them.
Numerous tracking algorithms exist that could be applied to our problem.
Perhaps the most well-known and popular is the Kanade-Lucas-Tomasi (KLT)
tracker [45]. The KLT tracker makes use of a Harris corner detector to detect
good features to track in a region of interest (our license plate) and measures the
25
26
similarity of every frame to the first allowing for an affine transformation. Sullivan
et al. [47] make use of a still camera for the purposes of tracking vehicles by defining
regions of interest (ROI) chosen to span individual lanes. They initiate tracking
when a certain edge characteristic is observed in the ROI and make predictions on
future positions of vehicles. Those tracks with a majority of accurate predictions
are deemed valid. Okuma et al. [38] use the Viola and Jones [49] framework to
detect hockey players and then apply a mixture particle filter using the detections
as hypotheses to keep track of the players.
Although each of these tracking methods would probably have worked
well in our application, we chose a far simpler approach which worked well in
practice. Because detecting license plates is efficient we simply run our detector
on each frame and for each detected plate we determine whether that detection is
a new plate or an instance of a plate already being tracked. To determine whether
a detected plate is new or not, the following conditions are checked:
• the plate is within T pixels of an existing tracker
• the plate is within T 0 pixels of an existing tracker and the plate is within θ
degrees of the general direction of motion of the plates in the tracker’s history
If any of these are true, the plate is added to the corresponding tracker, otherwise
a new tracker is created for that plate. In our application T 0 was an order of
magnitude larger than T . Figure 3.1 shows the tracking algorithm in action.
Our tracking algorithm was also useful for discarding false positives from
the license plate detector. The erratic motion of erroneous detections usually resulted in the initiation of several trackers each of which stored few image sequences.
Image sequences of 5 frames or fewer were discarded.
3.2
Super-Resolution
Video sequences such as the ones obtained from our cameras provide
multiple samplings of the same surface in the physical world. These multiple
27
Figure 3.1: A car tracked over 10 frames (1.7 seconds) with a blue line indicating
the positions of the license plate in the tracker.
samples can sometimes be used to extract higher-resolution images than any of
the individual samples. The process of extracting a single high-resolution image
from a set of lower-resolution images is called super-resolution. Super-resolution
is different from what is known as image restoration where a higher-resolution
image is obtained from a single image, a process also sometimes referred to as
enhancement.
The investigation into super-resolution was inspired by the low-resolution
license plate images in our ‘Regents’ dataset. In that dataset, the noisy and blurry
45 × 15 pixel license plate images made it very difficult to read the text on the
plates.
Before we describe the super-resolution algorithm we shall use, we shall
describe our assumed image formation model. A plane in the scene undergoes a
28
(a)
(b)
(c)
(d)
(e)
(f)
Figure 3.2: Our image formation model. The (a) full-resolution image H undergoes
(b) a geometric transformation Tk followed by (c) a blur with a PSF h(u, v); is (d)
sub-sampled by S, and finally (e) additive Gaussian noise η is inserted. The actual
bk from our camera is shown in (f). The geometric transformation
observed image L
is exaggerated here for illustrative purposes only.
geometric transformation that maps its world coordinates to those of the camera.
The optics of the camera blur the resulting projection at which point the camera samples it at the low-resolution we observe. Because of imperfections in the
sampling device, noise is introduced, which we shall assume to be spatially uncorrelated, additive, and Gaussian-distributed with zero-mean and constant variance.
Expressed in more formal terms, the imaging process is:
bk (x, y) = S ↓ (h(x, y) ∗ H(Tk (x, y))) + η(x, y),
L
(3.1)
with the following notation:
bk
L
S↓
– k th estimated low-resolution image
– down-sampling operator by a factor of S
h – point spread function (PSF)
∗
– convolution operator
H
– high-resolution image
Tk
– geometric transformation
η
– additive noise
This image formation process is illustrated in Figure 3.2. Note that the actual
observed image in Figure 3.2 (f) appears to have a further blurring effect after the
additive noise step when compared to Figure 3.2 (e). This could be due to a slight
29
motion-blur, which is not taken into account by our model.
The goal of a super-resolution algorithm is to find H given each observed
Lk . The sub-sampling factor S is usually chosen to be 2 or 4, and the estimation
of Tk and h(x, y) are discussed in sections 3.2.1 and 3.2.2 respectively.
We shall use the b symbol to differentiate between estimated and actual
b
images. In other words, H, represents the actual high-resolution image, and H
denotes its estimate.
3.2.1
Registration
The process of determining the transformation Tk for each image is known
as registration. In the general case, Tk is a projective transformation (planar homography) and its reference coordinates are usually those of one of the images in
the sequence. If all the images are roughly aligned by the detector, as was the case
with our detector, the choice of a reference image is arbitrary, and we chose the
first of each sequence.
As a simplification, we are assuming that Tk is simply translational since,
as the reader may recall from Chapter 2, our license plate detector is customdesigned for three different scales, and the variation in size of detections within a
scale is minimal. To calculate the translation of each image Lk in the sequence
relative to the reference image L1 , we divided each image into several patches and
used the normalized cross-correlation measure of similarity
P
(I1 (x) − I1 )(I2 (x) − I2 )
N CC(I1 , I2 ) = qPx
2
2
x (I1 (x) − I1 ) (I2 (x) − I2 )
(3.2)
to find the best place in L1 of each patch. In Equation (3.2),
I1 =
1 X
1 X
I1 (x) and I2 =
I2 (x)
N x
N x
are the means of I1 and I2 . N CC(I1 , I2 ) takes on values in [−1, 1], with 1 representing most similar and -1 representing least similar. Each patch I1 is placed over
all possible offsets of the same size, I2 , over the reference image L1 , and the average
30
offset of each correspondence is computed and treated as the translation from Lk
to L1 . This simple process leads to sub-pixel accuracies for each translation.
Since registration is a crucial pre-processing step for the extraction of an
accurate high-resolution estimate, we applied an all-pairs cross-correlation procedure on the plates in each tracked sequence to ensure all images in the sequence
are somewhat similar and no erroneous detections are included. Those images with
poor correlation to the rest are discarded.
3.2.2
Point Spread Function
The blur operation in Equation (3.1) is modeled by a convolution with a
point spread function (PSF). The PSF should approximate the blur of both the
optics of the camera as well as its sensor. Zomet and Peleg [24] suggest three
methods of estimating it:
• Use camera specifications obtained from manufacturer (if available)
• Analyze a picture of a known object
• Use the images in the sequence
Capel and Zisserman [7] instead suggest to simply use an isotropic Gaussian, which
Capel found to work well in practice [6]. For our experiments we chose a Gaussian
of size 15 × 15 and standard deviation of 7, which was used to create the blur
operation in Figure 3.2.
3.2.3
Algorithm
Our super-resolution algorithm is based on a probabilistic framework.
The algorithm estimates the super-resolution image H by maximizing the conb
b given the set of
ditional probability P r(H|L)
of the super-resolution estimate H
b
observed low-resolution images L = {Lk }. We do not know P r(H|L)
directly,
b Using
but using the imaging model of Equation (3.1) we can determine P r(L|H).
31
Bayes’ Rule,
b r(H)
b
P r(L|H)P
b
.
P r(H|L)
=
P r(L)
To find the most probable high-resolution image H, we need to maximize
b r(H).
b
P r(L|H)P
(3.3)
b A further simplification
We can drop the P r(L) term since it does not depend on H.
is sometimes made by assuming that all high-resolution images are equally likely,
b is maximized. The high-resolution estimate obtained
in which case just P r(L|H)
from this process is the maximum likelihood (ML) estimate. In our case, however,
we do have some prior knowledge of the high-resolution images of license plates,
which we can use to our advantage. We shall first describe a method of finding
the ML estimate and then describe the priors we use in Section 3.2.5.
3.2.4
Maximum Likelihood Estimate
Using our assumption that image noise is Gaussian with zero-mean and
variance σ 2 , Capel and Zisserman [7] suggest the total probability of an observed
b is
image Lk given an estimate of the super-resolution image H
b =
P r(Lk |H)
Y
x,y
b (x,y)−L (x,y))2
−( L
k
k
1
2σ 2
√ e
.
σ 2π
(3.4)
The log-likelihood function of Equation (3.4) is:
L(Lk ) = −
X
bk (x, y) − Lk (x, y))2 .
(L
(3.5)
x,y
If we assume independent observations,
b =
P r(L|H)
Y
b
P r(Lk |H),
(3.6)
k
and the corresponding log-likelihood function for all images in the set L becomes
L(L) =
X
k
L(Lk ) = −
X
k
2
b k − Lk k .
kL
(3.7)
32
The ML estimate then is obtained by finding the H that maximizes Equation (3.7):
HM = argmax
X
H
L(Lk )
k
b k − Lk k2 .
= argmin kL
(3.8)
H
If the formation process in Equation (3.1) that maps the high-resolution
b to L
bk is expressed in matrix form as
estimate H
b
ck = Mk H,
L
(3.9)
we have a system of N linear equations for all N images in the sequence. Stacking
these vertically, we have:








L1
L2
..
.
LN



M1






 M2 

b
 = 
 ..  H.

 . 




MN
b
L = MH.
(3.10)
Using this notation, the solution of Equation (3.7) can be obtained by
b = (M> M)−1 M> L.
H
(3.11)
In practice, M is very large and its pseudo-inverse is prohibitive to compute and
therefore iterative minimization techniques are used. The iterative methods also
b when the high-resolution images are not all equally
facilitate the computation of H
likely, and several priors are included in Equation (3.3). We use simple gradient
descent as our minimization method.
3.2.5
Maximum a Posteriori Estimate
b we use for obtainIn this section we shall describe the priors P r(H)
ing a maximum a posteriori estimate (MAP). The MAP estimate is obtained by
maximizing the full expression in Equation (3.3). The most common prior used
33
in the super-resolution literature is the smoothness prior introduced by Schultz
and Stevenson [42]. Capel and Zisserman also use a learnt face-space prior [8]. For
super-resolution of text specifically, Donaldson and Myers [12] use a bi-modal prior
taking into account the bi-modal appearance of dark text on light background. The
two priors we experimented with were the smoothness and bi-modal prior.
Smoothness Prior
The smoothness prior we used was introduced by Schultz and Stevenson
[42] and has the probability density:
b
b
H(x,y))
b y)) = cs e−ρ(H(x,y)−
,
P rs (H(x,
(3.12)
b y) is the average of the pixel intensities
where cs is a normalizing constant, H(x,
b
of the four nearest neighbors of H:
b
b
b
b
b y) = H(x − 1, y) + H(x + 1, y) + H(x, y − 1) + H(x, y + 1) ,
H(x,
4
and ρ(x) is the Huber cost function:


x2
ρ(x) =

2α|x| − α2
(3.13)
, |x| ≤ α
.
(3.14)
, |x| > α
b y) − H(x,
b y) expression is a measure of the local smoothness around
The H(x,
a pixel (x, y), where large indicates discontinuities and small indicates a smooth
region. A plot of the Huber function is shown in Figure 3.3 (a). Its use is justified
by Donaldson and Myers [12] who suggest the linear region of ρ(x) for |x| > α
preserves steep edges because of the constant derivative.
Bi-Modal Prior
The bi-modal prior used by Donaldson and Myers [12] is an exponential
fourth-order polynomial probability density with maxima at the corresponding
34
1.02
0.8
1
0.6
Prb/cb
Gradient Penalty
1
0.4
0.98
0.2
0
−1
−0.5
0
Gradient
0.5
0.96
0
1
0.25
0.5
0.75
Image Intensity
(a)
1
(b)
Figure 3.3: (a) The Huber penalty function used in the smoothness prior with α =
0.6 and red and blue corresponding the regions |x| ≤ α and |x| > α respectively;
(b) an un-scaled version of the bi-modal prior with µ0 = 0.1 and µ1 = 0.9.
black and white peaks of the pixel intensity distributions of the high-resolution
image:
b y)) = cb e−(H(x,y)−µ0 )
P rb (H(x,
b
2 (H(x,y)−µ
b
1)
2
,
(3.15)
where cb is a normalizing constant and µ0 and µ1 are the centers of the peaks. The
function is shown in Figure 3.3 (b) for a choice of µ0 = 0.1 and µ1 = 0.9.
Donaldson and Myers estimate µ0 and µ1 for each high-resolution estimate, but instead we used the constants in Figure 3.3 (b).
Computing the Estimate
Combining the likelihood and two prior probability distributions and substituting into Equation (3.3), we have
H = argmax
H
Y
k
b ·
P r(Lk |H)
Y
x,y
b y)) · P rb (H(x,
b y)).
P rs (H(x,
(3.16)
35
Taking the negative log-likelihood of the right-hand side,
H = argmin
−
X
H
2
b − Lk k +
kMk H
k
X
b y) − H(x,
b y)) −
ρ(H(x,
x,y
X
b y) − µ0 )2 (H(x,
b y) − µ1 )2 .
(H(x,
(3.17)
x,y
b ES (H),
b and
For convenience, we shall refer to each of the three terms as EM (H),
b To control the contributions of each term we weigh EM (H),
b ES (H),
b and
EB (H).
b by the constants cM , cS , and cB , respectively:
EB (H)
H = argmin
b + cS ES (H)
b + cB EB (H).
b
cM EM (H)
(3.18)
H
We chose to use gradient descent to minimize Equation (3.17), therefore,
b The
we need to find the derivative of the entire expression with respect to H.
derivative of the ML term is straightforward:
∂
b = −2Mk > (Mk H
b − Lk ),
E (H)
b M
∂H
(3.19)
and the derivative of the bi-modal term is:
∂
b = 2(H(x,
b y) − µ0 )(H(x,
b y) − µ1)(2H(x,
b y) − µ0 − µ1).
E (H)
b B
∂H
(3.20)
The derivative of the smoothness term is more tricky to compute since each neighb y) involves H(x,
b y) in its H
b calculation. Therefore, we need to unroll
bor of H(x,
b around H(x,
b y) and then find the derivative:
ES (H)
... + x
... + x
) + ρ(x2 −
)... +
4
4
x1 + x2 + x3 + x4
... + x
) + ρ(x3 −
) + ... +
ρ(x −
4
4
... + x
ρ(x4 −
) + ...,
4
b = . . . + ρ(x1 −
ES (H)
(3.21)
36
where
b y)
x = H(x,
b y − 1)
x1 = H(x,
b − 1, y)
x2 = H(x
b + 1, y − 1)
x3 = H(x
b y + 1).
x4 = H(x,
The derivative then is
∂
b y)) = 1 ρ0 (x1 − . . . + x ) +
EB (H(x,
b
4
4
∂H
1 0
... + x
ρ (x2 −
)+
4
4
x1 + x2 + x3 + x4
ρ0 (x −
)+
4
... + x
1 0
ρ (x3 −
)+
4
4
1 0
... + x
ρ (x4 −
).
4
4
(3.22)
Having obtained the derivatives of each term, we iteratively step in the direction
opposite the gradient until we reach a local minimum. At each step we add some
b ES (H),
b and EB (H)
b terms, controlled by the
portion of the gradient of the EM (H),
factors cM , cS , and cB , respectively and the step size. Instead of constructing each
Mk matrix in Equation (3.19) explicitly, we only apply image operations such as
warp, blur, and sample for multiplications with Mk and Mk > Mk , similar to Zomet
and Peleg’s work [52]. Since Mk is the product of each of the image operations, it
can be decomposed into
Mk = SBWk ,
(3.23)
where S is the down-sampling matrix, B is the matrix expressing the blurring with
the PSF, and Wk is the transformation matrix representing Tk . Therefore,
Mk > = W k > B > S > ,
(3.24)
37
(a)
(b)
(c)
(d)
Figure 3.4: Super-resolution results: (a) sequence of images processed, (b) an upsampled version of one low-resolution image, (c) the average image, (d) the final
high-resolution estimate.
where Wk > is implemented by applying the reverse transformation Wk applies,
B > is implemented by applying the same blur operation as B since we are using
an isotropic Gaussian PSF, and S > is implemented by up-sampling without any
interpolation.
We use the average image in the sequence resized to four times the resolution using bi-linear interpolation as the initial high-resolution estimate. The
choice of the average image as an initial estimate is justified since it contains little
of the noise found in the individual images as is seen in Figure 3.4 (c).
Since we are performing a cross-check of each image with each other image
in the sequence during registration, the first few images (which have most detail)
are pruned. Had we implemented a more general transformation estimation for
registration, we would have been be able to take advantage of these images, but
simple translation estimation with them included negatively affected the average
image and thus the initial super-resolution estimate.
3.2.6
Discussion
There are numerous parameters in our image formation model and our
super-resolution algorithm that require either estimation or initial adjustment. The
38
values of these parameters have a profound effect on the final super-resolution images. Some of these results may look more appealing to us as humans, but the only
way to determine whether super-resolution in general is worth-while is to actually
determine whether they improve the OCR rate. This was the approach taken by
Donaldson and Myers [12], however, they used Scansoft’s DevKit 2000, a commercial OCR engine, on printed text, for which most commercial OCR packages are
designed. Although we were unable to obtain a copy of their choice of OCR package, the commercial OCR software we experimented with performed very poorly
on our super-resolution images, most likely because the OCR engines were not
specifically trained on our forms of license plate text, or were not of sufficiently
high resolution.
Donaldson and Myers found that the biggest factor super-resolution had
on improving OCR performance was the more clear separation of characters rather
than the reduction of noise. The separation of characters, which is a result of the
bi-modal prior, can also be observed in our data as shown in the super-resolution
estimate in Figure 3.4 (d). The image also exhibits a clear bi-modal pixel intensity
distribution, and in fact, the contrast is good enough to not require binarization
algorithms to be applied, a pre-processing step often necessary for OCR packages
to work correctly.
3.3
Optical Character Recognition
In this section we describe a very simple algorithm to recognize the char-
acters on detected plates and propose additional methods that may be used in
further research.
3.3.1
Previous Work
It was our initial intent to apply a binarization algorithm, such as a
modified version of Niblack’s algorithm as used by Chen and Yuille [10], on the
39
extracted license plate images from our detector, and then use the binarized image as input to a commercial OCR package. We found, however, that even at a
resolution of 104 × 31 the OCR packages we experimented with yielded very poor
results. Perhaps this should not come as a surprise considering the many custom
OCR solutions used in existing LPR systems.
The most common custom OCR approach used by existing LPR systems
is correlation-based template matching [35], sometimes done on a group of characters [11]. Sometimes, the correlation is done with principal component analysis
(PCA) [27]. Others [44] apply connected component analysis on binarized images
to segment the characters and minimize a custom distance measure between character candidates and templates. Classification of segmented characters can also be
done using neural networks [37] with good results.
Instead of explicitly segmenting characters in detected plates Amit et al.
[2] use a coarse-to-fine approach for both detection and recognition of characters
on license plates. Although they present high recognition rates, the license plate
images they worked with were of high-resolution, and it is not clear whether their
method will be as effective on the low-resolution images in our datasets.
Because of the simplicity of the template matching method, we chose to
experiment with it first, and it proved to work reasonably well.
3.3.2
Datasets
We generated training and test data by running our license plate detector
on several hours of video and extracting sequences of images for each tracked license
plate. This process resulted in a total of 879 plate sequences each of which was
labeled by hand. Of these, 121 were chosen at random to form an alphabet of
characters for training. These 121 sequences contained the necessary distribution
of characters to form 10 examples per character, for a total of 360 examples (26
letters and 10 digits). This alphabet of training images is shown in Figure 3.5.
The remaining 758 plates were used for testing the OCR rate.
40
Figure 3.5: The alphabet created from the training set. There are 10 examples for
each character for the low-resolution, average image, and super-resolution classes,
shown in that respective order.
Figure 3.6 shows a histogram of the frequency of all characters in our
training and test datasets. Note that the majority of characters are numbers with
‘4’ being most common since most of today’s California plates start with that
number. The frequencies of ‘I’, ‘O’, and ‘Q’ were relatively small most likely due
to their potential confusion with other similarly shaped characters.
3.3.3
Template Matching
Unless text to be read is in hand-written form, it is common for OCR
software to segment the characters and then perform recognition on the segmented
image. The simplest methods for segmentation usually involve the projection of
row and column pixels and placing divisions at local minima of the projection
functions. In our data, the resolution is too low to segment characters reliably in
this fashion, and we therefore decided to apply simple template matching instead,
which can simultaneously find both the location of characters and their identity.
41
License Plate Character Frequencies
750
600
450
300
150
0
0 1 2 3 4 5 6 7 8 9 A B C D E F G H I
J K L M N O P Q R S T U V W X Y Z
Figure 3.6: Character frequencies across our training and test datasets.
The algorithm can be described as follows. For each example of each
character, we search all possible offsets of the template image in the license plate
image and record the top N best matches. The searching is done using the NCC
metric shown in Equation (3.2), and a threshold on the NCC score is applied before
considering a location a possible match. If more than one character matches a
region the size of the average character, the character with the higher correlation is
chosen and the character with the lower correlation is discarded. Once all templates
have been searched, the characters for each region found are read left to right
forming a string. N is dependent on the resolution of the license plate image and
should be chosen such that not all N matches are around a single character when
the same character occurs more than once on a plate, and not too large so that
not all possible regions are processed.
This method may seem inefficient, however, the recognition process takes
on the order of half a second for a resolution of 104 × 31, which we found to be
acceptable. This recognition time is much smaller than the several seconds required
to estimate a super-resolution image. Our results for this method are shown in
Section 3.4.
42
3.3.4
Other Methods
We would like to propose several ideas for future work on license plate
OCR. The first method is to apply shape context matching [4] on characters
segmented after applying connected components and a thinning pre-processing
step [44] on the high-resolution estimates. Shape contexts have shown to be very
effective at recognizing hand-written digits, and it is reasonable to presume that
the method might work well on license plate characters.
The second method that might benefit further research in this area is to
apply the AdaBoost framework to recognizing segmented characters. At the time
of this writing we are not aware of any OCR algorithms that use boosted classifiers, but the filters we presented in Chapter 3 may also be adapted to individual
characters, with the caveat that many more training examples would be required
and the AdaBoost classifier we presented would need to be modified for multiclass
classification.
Mori and Malik [32] use a Hidden Markov Model (HMM) to choose the
most likely word when performing text recognition in images with adversarial clutter. A similar method may apply to license plate recognition to learn and recognize
common character sequence types, such as a digit, followed by three letters, followed by three digits.
3.4
Results
Our template matching method was not well-suited for recognition on
the super-resolution images using the super-resolution templates in our training
alphabet. Our low-resolution templates yielded far better results on the test set,
which is most likely due to a better correlation resulting from the natural blur that
occurs in the low-resolution images, allowing more intra-class variance. Therefore,
in this section we present our results on just the low-resolution image sequences.
Figure 3.7 shows our recognition results on the low-resolution images
43
Low−Resolution Edit Distance Histogram
Number of Plates
160
120
80
40
0
0
1
2
3
4
5
6
7
Edit Distance
Standard
Loose
Figure 3.7: Template matching OCR results on the low-resolution test set for ‘standard’ and ‘loose’ comparisons between recognized characters and actual characters.
from the test set, taken from the second frame in the image sequence of the plate
trackers. We used the edit distance, sometimes also referred to as the Levenshtein
distance, to measure how similar our recognized text was to the labeled plates in
the test set. Because certain characters are easily confused with others, even by
humans, we also applied a ‘loose’ character equality test whenever the edit distance
algorithm compared two characters. The groups of characters {‘O’, ‘0’, ‘D’, ‘Q’},
{‘E’, ‘F’}, {‘I’, ‘T’, ‘1’}, {‘B’, ‘8’}, and {‘Z’, ‘2’} were each considered of the same
type and no penalty was applied for incorrect readings within the group. Figure
3.7 shows the number of license plates read with various numbers of mistakes with
and without using the ‘loose’ comparison measure.
Figure 3.8 shows the template matching method applied to the actual low
resolution images in the test set. Note that over half of the test set was recognized
with two or fewer mistakes. One can observe a large degradation in image quality
with each progressive horizontal section. The template matching is most often
thwarted by plate boundaries, which are more and more visible as the size of the
plate decreases.
Our goal for this thesis was to have an unconstrained LPR system, and
44
Figure 3.8: Recognition results for the images in our test set. Each horizontal
section lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and 7 mistakes.
45
these OCR rates are quite satisfactory for our purposes. An alternative to superresolution would be to perform OCR on each image in the sequence and obtain
the most likely text in that fashion, however, this experiment was not performed.
Chapter IV
Make and Model Recognition
As with our license plate recognition problem, detecting the car is the
first step to performing make and model recognition (MMR). To this end, one can
apply a motion segmentation method such as [50] to estimate a region of interest
(ROI) containing the car. Instead, we decided to use the location of detected
license plates as an indication of the presence and location of a car in the video
stream and to crop an ROI of the car for recognition. This method would also be
useful for make and model recognition in static images, where the segmentation
problem is more difficult.
In this chapter, we describe several feature-based and appearance-based
methods commonly used in object recognition and evaluate their recognition rates
on car images extracted from our video stream.
4.1
Previous Work
To the best of our knowledge, MMR is a fairly unexplored recognition
problem. Various work has been done on car detection in street scene images
[29] [43] [39] and aerial photographs [41]. Dorko and Schmid [13] use scale invariant
features to detect cars in images with 50% background on average. Agarwal et
al. [1] automatically create a vocabulary of car parts, such as tires and windshields,
from training images and detect cars by finding individual parts and comparing
46
47
their spatial relations. Interestingly, most of the car detection literature only deals
with side-views of cars, perhaps because from a large distance the side profile
provides richer and thus more discriminating features.
The work of Ferencz et al. [14] is most closely related to our problem
statement. Their work is helping develop a wide-area car tracking system and is
not formulated as a recognition problem, but what they call an object identification
problem. In our system we are interested in determining to which make and model
class a new vehicle belongs, and although all classes consist of cars, there is fair
amount of variation within each of the make and model classes. In contrast, Ferencz
et al. are interested in determining whether two images taken at different times
and camera orientations are of the exact same car, where there is really only a
single example that serves as a model. They solve this problem by automatically
finding good features on side views of cars from several hundred pairs of training
examples, where good features refer to features that are good at discriminating
between cars from many small classes.
4.2
Datasets
We automatically generated a database of car images by running our
license plate detector and tracker on several hours of video data and cropping a
fixed window of size 400 × 220 pixels around the license plate of the middle frame
of each tracked sequence. This method yielded 1,140 images in which cars of each
make and model were of roughly the same size since the license plate detector was
specialized to respond to a narrow range of license plate sizes. The majority of
these images are shown in Figure 4.1. The crop window was positioned such that
the license plate was centered in the bottom third of the image. We chose this
position as a reference point to ensure matching was done with only car features
and not background features. Had we centered the license plate both vertically
and horizontally, cars that have their plates mounted on their bumper would have
48
exposed the road in the image. Although this method worked well in most cases,
for some cars, the position of the license plate was off-center horizontally, which
allowed for non-car regions to be included in the ROI.
After collecting these images, we manually assigned make, model, and
year labels to 790 of the 1,140 images. We were unable to label the remaining
350 images due to our limited familiarity with those cars. We often made use of
the California Department of Motor Vehicles’ web site [23] to determine the makes
and models of cars with which we were not familiar. The web site allows users to
enter a license plate or vehicle identification number for the purposes of checking
whether or not a car has passed recent smog checks. For each query, the web site
returns smog history as well as the car’s make and model description if available.
The State of California requires all vehicles older than three years to pass a smog
check every two years. Therefore, we were unable to query cars that were three
years old or newer and relied on our personal experience to label them.
We split the 1,140 labeled images into a query set and a database set.
The query set contains 38 images chosen to represent a variety of make and model
classes, in some cases with multiple queries of the same make and model but
different year in order to capture the variation of model designs over time. We
evaluated the performance of each of the recognition methods by finding the best
match in the database for each of the query images.
4.3
Appearance-based Methods
Appearance-based object recognition methods work by treating entire
images as feature vectors and comparing these vectors with a training vector space.
An M × N image would be transformed into a single M N -dimensional feature
vector consisting of just pixel intensities from the image. In practice, M and
N are too large to search the training vector space efficiently for a best match
and some sort of dimensionality reduction is done first. Common dimensionality
49
Figure 4.1: Our automatically generated car database. Each image is aligned such
that the license plate is centered a third of the distance from bottom to top. Of
these images, 1,102 were used as examples, and 38 were used as queries to test the
recognition rates of various methods. We used the AndreaMosaic photo-mosaic
software to construct this composite image.
50
reduction techniques are principal component analysis (PCA) [33] [34] and the
Fisher transform [3].
Because appearance-based methods work directly with feature vectors
consisting entirely of pixel brightness values (which directly correspond to the radiance of light emitted from the object), they are not good at handling illumination
variability in the form of intensity, direction, and number of light sources, nor variations in scale. The Fisherface method [3] and Illumination Cones [18] address the
illumination variability problem but are not invariant to scale.
In this section, we describe the Eigenface recognition method, which has
frequently been used in face recognition, and evaluate its performance on MMR.
4.3.1
Eigencars
In principal component analysis, a set of feature vectors from a high-
dimensional space is projected onto a lower dimensional space, chosen to capture the variation of the feature vectors. More formally, given a set of N images
{x1 , x2 , ..., xN }, each expressed as an n-dimensional feature vector, we seek a linear transformation, W ∈ Rn×m , that maps each xk into an m-dimensional space,
where m < n, such that
W > ΣW
(4.1)
is maximized. Here,
N
1 X
Σ=
(xi − µ)(xi − µ)> ,
N − 1 k=1
and
N
1 X
µ=
xi
N i=1
is the average image. The covariance matrix Σ is also referred to as the total scatter
matrix [3] since it measures the variability of all classes in the n-dimensional feature
vectors.
51
Finding the W that maximizes Equation (4.1) is an eigenvalue problem. Since n is usually very large (in our case 88,000) and much larger than N
(1,102), computing the eigenvectors of Σ directly is computationally and storageprohibitive. Instead, consider the matrix
A = [x1 − µ, x2 − µ, ..., xN − µ] .
(4.2)
Then, Σ = AA> . Using singular value decomposition (SVD), A can be decomposed
as A = U DV > , where U and V are orthonormal and of size n × N and N × N
respectively, and D is an N × N diagonal matrix. Using this decomposition, Σ
becomes
Σ = AA> = U DV > (U DV > )> = U DV > V D> U = U D2 U > ,
(4.3)
where D2 consists of {λ1 , λ2 , ..., λN }, where λi are the first N eigenvalues of Σ, with
the corresponding eigenvectors, and, therefore columns of W , in the columns of
U . Because these eigenvectors are of the same dimensions as the set of xi images,
they can be visualized and in the face recognition literature are referred to as
‘eigenfaces’ [48]. We chose to more aptly call them ‘eigencars’ since our domain of
input images consists of cars. The first ten eigenvectors corresponding to the ten
largest eigenvalues are shown in Figure 4.2 (b), and µ is shown in Figure 4.2 (a).
The eigencars recognition algorithm can then be described as follows:
Off-line
1. Construct the A matrix from a set of N images {x1 , x1 , ..., xN }
2. Compute the SVD of A to obtain the eigenspace U and the diagonal matrix
D containing the eigenvalues in decreasing order
3. Project each of the N column vectors of A onto the eigenspace U to obtain
a low-dimensional N × N feature matrix F = A> U , and scale each row of F
by the diagonal of D
52
(a)
(b)
Figure 4.2: (a) The average image, and (b) the first 10 eigencars.
On-line
1. Subtract the average image µ from the query image q, q0 = q − µ
2. Project q0 onto the eigenspace U to obtain an N -dimensional feature vector
f and scale f by the diagonal of D
3. Find the row k of F that has the smallest L2 distance to f and consider xk
to be the best match to q
Results
We applied the algorithm to our database and query sets and obtained a
recognition rate of only 23.7%. This is a very low recognition rate, however, the
recognition rate using random guessing is 2.5%.
Figures 4.3 and 4.4 show the query images and the top ten matches
in the database for each query using the on-line recognition method. Note the
stark similarity in overall illumination of all matches for each query, even though
53
the matches contain a large variation of makes and models. This suggests the
algorithm is not recognizing car features, but rather illumination similarity.
Belhumeur et al. suggest that the three eigencars corresponding to the
three largest eigenvalues capture most of the variation due to lighting and that it is
best to ignore them. Indeed, discarding these eigenvectors increased the recognition
rate to 44.7%. The results of this modified approach are shown in Figures 4.5 and
4.6. Note that the matches no longer exhibit the strong similarity in illumination
as before. We also tried removing the top 7 largest eigenvectors, which led to
a recognition rate of 47.4%. Removing any more eigenvectors, however, had a
negative effect.
Discussion
The most computationally intensive part of the eigencars algorithm is the
computation of F = A> U . With A consisting of the full resolution images, the
process takes about four hours, and requires roughly 1,500MB of RAM. We also
performed the recognition experiment on sub-scaled versions of the images with
200 × 110 resolution and found that this greatly reduced the off-line training time
and significantly reduced the memory requirements without adversely affecting the
recognition rate.
The on-line part of the algorithm is reasonably fast. It only takes one or
two seconds to project q0 onto the eigenspace U . We shall see that this is a strong
advantage of the appearance-based method when we evaluate the performance of
feature-based methods in Section 4.4.
The Fisherface [3] method is a more recent appearance-based recognition
method that has similar computational requirements as the Eigenface method and
has been shown to yield superior recognition rates in the face recognition domain
because it selects a linear transformation that maximizes the ratio of the betweenclass scatter to the within-class scatter. It therefore requires us to place our set
of xk training images into separate classes. Due to time constraints, we did not
54
Figure 4.3: The first 19 query images and the top 10 matches in the database for
each using all N eigencars.
55
Figure 4.4: The second 19 query images and the top 10 matches in the database
for each using all N eigencars.
56
Figure 4.5: The first 19 query images and the top 10 matches in the database for
each using N − 3 eigencars.
57
Figure 4.6: The second 19 query images and the top 10 matches in the database
for each using N − 3 eigencars.
58
evaluate this method.
4.4
Feature-based Methods
In contrast to appearance-based recognition methods, feature-based
recognition methods first find a number of interesting features in an image and
then use a descriptor representative of the image area around the feature location
to compare with features extracted from training images of objects. The features
should belong to the objects to be recognized, and should be sparse, informative,
and reproducible, the latter two properties being most important for object
recognition. If the features themselves are not sufficiently informative, descriptors
are used for matching methods, where the descriptors are usually constructed
from the image structure around the features.
4.4.1
Feature Extraction
Here, we discuss several feature extraction methods commonly used in
object recognition.
Corner Detectors
In the computer vision community, interest point detection is often called
corner detection even though not all features need be corners. Corner detection
is often used for solving correspondence problems, such as in stereopsis. Corner
features occur in an image where there is a sharp change in the angle of the
gradient. In practice, these points of sharp change in the angle of the gradient
do not always correspond to real corners in the scene, for example in the case of
occlusion junctions.
Two popular corner detectors are the Harris [19] and Förstner [15] detectors. The output of a Harris corner detector on a car image from our dataset is
shown in Figure 4.7.
59
Figure 4.7: Harris corner detections on a car image. Yellow markers indicate
occlusion junctions, formed by the intersection of edges on surfaces of different
depths.
Corner features by themselves are not sufficiently informative for object
recognition, but Agarwal et al. [1] combine them with patches of the image used
as a descriptor.
Salient Features
Kadir and Brady [25] have developed a low-level feature extraction
method inspired by studies of the human visual system. Their feature detector
extracts features at various scales that contain high entropy.
For each pixel
location x, the scale s is chosen in which the entropy is maximum, where by scale
we mean the patch size around x used to obtain a probability distribution P on
the pixel intensities used in the entropy H calculation:
H(s, x) =
255
X
Ps,x (i) log Ps,x (i).
(4.4)
i=0
Equation (4.4) assumes pixel intensities take on values between 0 and 255. Unlike
the corner detector, Kadir and Brady features carry a scale descriptor in addition
to their position in the image.
We created an efficient implementation of the detector using our integral
image optimization technique from Section 2.5.1 for the calculation of P around
x for the various scales. Our results on our car image are shown in Figure 4.8.
60
(a)
(b)
Figure 4.8: Kadir and Brady salient feature extraction results on (a) a car image
from our database, and (b) an image of a leopard.
We found that Kadir and Brady features had low repeatability when applied to
our car images and were, therefore, not further explored. They seem to be more
suitable in some images over others as can be seen in Figure 4.8 (b).
SIFT Features
The corner detector we described earlier is sensitive to changes in image
size and, therefore, does not provide useful features for matching images of different sizes. Scale invariant feature transform (SIFT) features recently developed
by Lowe [30] overcome this problem and are also invariant to rotation and even
partially invariant to illumination differences.
The process of extracting SIFT features consists of four steps: scale-space
extremum detection, keypoint localization, orientation assignment, and descriptor
assignment. The scale space L(x, y, σ) of an image I(x, y) is defined as a convolution of a variable-scale Gaussian kernel:
L(x, y, σ) = G(x, y, σ) ∗ I(x, y),
where
G(x, y, σ) =
1 −(x2 +y2 )/2σ2
e
.
2πσ 2
61
Figure 4.9: SIFT keypoints and their orientations for a car image.
The scale parameter σ is quantized and keypoints are then localized by finding
extrema in
D(x, y, σ) = L(x, y, kσ) − L(x, y, σ),
where kσ is the next highest scale. The location of the extrema are called keypoints.
Orientation assignment of each keypoint is then done by computing the gradient
magnitude m(x, y) and orientation θ(x, y) of the scale space for the scale of that
keypoint:
m(x, y) =
p
(L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2
θ(x, y) = tan−1
L(x, y + 1) − L(x, y − 1)
L(x + 1, y) − L(x − 1, y)
Figure 4.9 shows 352 keypoints and their orientations extracted from our example
car image from our database.
Finally, the descriptor is assigned by dividing the region around the keypoint into 16 symmetric sub-regions and assigning 8 orientation bins to each subregion. The final result is a 16 × 8 = 128-dimensional feature vector. When
comparing two SIFT descriptors, the L2 distance measure is used.
4.4.2
Shape Contexts
A shape context is an image descriptor introduced by Belongie et al.
[4] and has been shown to be very good for matching shapes. Some success-
62
(a)
(c)
(b)
(d)
(e)
(f)
Figure 4.10: (a) Query car image with two interest points shown, (b) database car
image with one corresponding interest point shown, (c) diagram of log-polar bins
used for computing shape context histograms, (d,e,f) shape context histograms for
points marked ‘B’, ‘C’, and ‘A’ respectively. The x-axis represents θ and the y-axis
represents log r increasing from top to bottom.
ful applications include hand-written digit recognition [4] and breaking “Completely Automated Public Turing Tests to Tell Computers and Humans Apart”
(CAPTCHA) [32] protection mechanisms used by internet companies such as Yahoo to deter automated signups for thousands of email accounts. Although the
shape context descriptor is best suited for binary images, we felt it would be interesting to test it in the context of grayscale car images.
The shape context descriptor is computed as follows. Given an interest
point x, we consider a circle of radius r centered on x and divide it into sections
according to a log-polar grid as shown in Figure 4.10 (c). We then count the number
of edge pixels within a radius r that fall in each bin. The resulting histogram is
known as the shape context of x. Figure 4.10 shows the shape context for a pair
of matching points and a point on the shape far away from each matching point.
63
Note the similarity in the descriptor for the corresponding points and how vastly
different it is for point A.
The shape context descriptors are usually compared using the χ2 distance
X khi (k) − hj (k)k2
d(hi , hj ) =
,
khi (k) + hj (k)k
bins k
(4.5)
where hi and hj are the two descriptors. Sometimes, the L2 distance is used
instead, though we found that using it had little effect on the overall recognition
results.
The original shape context work [4] used a histogram with 5 logarithmic
divisions of the radius and 12 linear divisions of the angle. In our recognition
experiments we also tried a histogram of size 9 × 4 in addition to the original
5 × 12. In [31], Mori et al. augment the shape context histogram to include edge
orientations, which we have not experimented with.
4.4.3
Shape Context Matching
Shape context matching was the first feature-based method we tried. The
algorithm we implemented works as follows:
1. For each image d in the database and a query image q, take a random
sampling of N points from the edge images (as shown in Figure 4.11) of q
and d and compute the shape context around each point.
2. For each database image d:
(a) For each sampled edge point pq in q find the best matching sampled
point pd in d within some radius threshold that has a shape context
with the smallest χ2 distance according to Equation (4.5).
(b) Sum all χ2 distances for every correspondence and treat the sum as
some cost c.
3. Choose the d that has the lowest cost c and consider that the best match
64
(a)
(b)
Figure 4.11: (a) Image edges and (b) a random sampling of 400 points from the
edges in (a).
In the original work on shape contexts, step 2 was performed for several
iterations using the correspondences to compute a thin plate spline transformation
that transforms q. Since we are matching 3-D rigid bodies under an approximately
affine camera, we instead computed an estimation for the affine transformation
using RANSAC that would best align q with d but found that the affine estimate
was not sufficiently stable and the natural alignment obtained by using the license
plate as a reference point was sufficient.
Our recognition rates on our query set were 65.8% using a 5 × 12-size
shape context and 63.2% using a 9 × 4-size shape context. The radius of the
descriptor we used was 35 pixels and the sampling size N of points was 400.
4.4.4
SIFT Matching
We also explored matching query images using the SIFT feature extractor
and descriptor discussed earlier. The algorithm we used was the following:
1. For each image d in the database and a query image q, perform keypoint
localization and descriptor assignment as described in Section 4.4.1.
2. For each database image d:
(a) For each keypoint kq in q find the keypoint kd in d that has the smallest
L2 distance to kq and is at least a factor of α smaller than the distance
65
to the next closest descriptor. If no such kd exists, examine the next
kq .
(b) Count the number of descriptors n that successfully matched in d.
3. Choose the d that has the largest n and consider that the best match.
Discussion
We found that a few types of keypoint matches resulting from the above
algorithm did not contribute to the selection of a best car match. For example,
some matching keypoints corresponded to entire groups of digits and letters on
the license plates of a query image and a database image even though the the
cars to which they belonged looked quite different. Since the best car match in
the database is determined based on the number of matched keypoints, spurious
matches should be ignored. We, therefore, applied the following keypoint pruning
procedures:
• Limit horizontal distance between matching keypoints. This helps remove
outliers when estimating an affine transformation between the query and
database images.
• Ignore keypoints that occur in the license plate region.
• Do not allow multiple query keypoints to match to the same database keypoint.
• Compute an affine transformation from the query to the database image
when there are more than three matching keypoints. If the scale, shear, or
translation parameters of the transformation are outside a threshold, set the
number of matching keypoints n to 0.
We used Lowe’s implementation [30] of the keypoint localization part of the algorithm. Unlike in Lowe’s implementation, the query’s keypoint descriptors were
66
compared with the keypoint descriptors of each image in the database. This means
that the second best descriptor was not chosen for an object other than the current database image. Also, modifying the threshold from the 0.36 appearing in
the published code to 0.60 (which is closer to the suggested in Lowe’s paper) increased the number of matches, but had little effect on the overall recognition rate
– misclassified cars using one method were correctly classified with the other at
the expense of different misclassifications.
When the number of matching descriptors between the query image and
a database image is equal to that of another database image, we break the tie by
selecting the database image with the smaller overall L2 distance between all the
descriptors. This only occurred when the best matches in the database had one
or two matching descriptors, and applying the tie-break procedure had little effect
on the overall recognition rate.
Results
The SIFT matching algorithm described above yielded a recognition rate
of 89.5% on the query set. The recognition results are shown in Figures 4.12 – 4.15
for each query image. Note that the top 10 matches were all of the same make and
model for some of the queries with over 20 similar cars in the database.
67
Figure 4.12: Query images 1-10 and the top 10 matches in the database using
SIFT matching. Yellow lines indicate correspondences between matched keypoints
of the query (top) and database (bottom) images.
68
Figure 4.13: Query images 11-20 and the top 10 matches in the database using
SIFT matching. Yellow lines indicate correspondences between matched keypoints
of the query (top) and database (bottom) images.
69
Figure 4.14: Query images 21-29 and the top 10 matches in the database using
SIFT matching. Yellow lines indicate correspondences between matched keypoints
of the query (top) and database (bottom) images.
70
Figure 4.15: Query images 30-38 and the top 10 matches in the database using
SIFT matching. Yellow lines indicate correspondences between matched keypoints
of the query (top) and database (bottom) images.
71
Table 4.1: Summary of overall recognition rates for each method.
Method
Recognition rate
Eigencars using all eigenvectors
23.7%
Eigencars without 3 highest
44.7%
Shape context matching with 9 × 4 bins
63.2%
Shape context matching with 5 × 12 bins
65.8%
SIFT matching
89.5%
4.4.5
Optimizations
Finding the best match for a query image in our database of 1,102 images
for both shape context and SIFT matching takes about 30 seconds, compared to
0.5 seconds with the Eigencars method. The high recognition rate achieved with
SIFT matching is certainly appealing, but for our system to be real-time, MMR
must be as fast as the LPR algorithms.
Several possibilities exist that may help in that regard. Instead of comparing features in the query image with every single database image, it would be
useful to cluster the database images into groups of similar type, such as sedan,
SUV, etc. and perform a hierarchical search to reduce the number of comparisons.
A promising method that is applicable to our situation is the recent work
by Sivic and Zisserman [46]. They formulate the object recognition problem as
a text retrieval problem, which itself has been shown to be remarkably efficient
based on our daily experiences with internet search engines. Future work on MMR
should investigate the possibility of incorporating a similar approach.
4.5
Summary of Results
Table 4.1 summarizes the overall recognition rates of the appearance-
based and feature-based methods we evaluated.
Table 4.2 lists the the queries used in our test set and shows which methods were able to classify each query correctly. Note that most of the queries SIFT
72
matching was not able to classify correctly had 5 or fewer entries similar to it in
the database. It is safe to assume that having more examples per make and model
class will increase the recognition rate.
73
Table 4.2: Test set of queries used with ‘Size’ indicating the number of cars similar
to the query in the database and which method classified each query correctly.
Make and Model
VW Beetle
Honda Accord-1
Honda Accord-2
Honda Accord-3
Honda Civic-1
Honda Civic-2
Honda Civic-3
Honda Civic-4
Toyota Camry-1
Toyota Camry-2
Toyota Camry-3
Toyota Camry-4
Toyota Corolla-1 (dent)
Toyota Corolla-1
Toyota Corolla-2
Toyota Corolla-4
VW Jetta-1
VW Jetta-2
Ford Explorer-1
Ford Explorer-2
Van Van
Van Van (occluded)
Nissan Altima-1
Nissan Altima-2
Nissan Altima-3
Nissan Altima-4
Nissan Sentra-5
Toyota 4Runner
Ford Focus-1
Ford Focus-2
Ford Mustang
Honda CR-V
BMW 323
VW Passat
Toyota Tundra
Toyota RAV4
Toyota Sienna-1
Toyota Sienna-2
Size
5
18
20
17
19
16
12
11
20
16
11
4
14
26
8
15
6
4
6
18
5
4
3
3
4
9
6
6
9
7
10
9
7
3
3
6
Eigencars Eigencars
Full
Minus 3
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
SC
SC SIFT
5 × 12 9 × 4
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
√
Chapter V
Conclusions and Future Work
5.1
Conclusions
We have presented a useful framework for car recognition that combines
LPR and MMR. Our recognition rates for both sub-problems are very promising
and can serve as an important foundation to a query-based car surveillance system.
Our LPR solution is real-time and works well with inexpensive camera hardware
and does not require infrared lighting or sensors as are normally used in commercial
LPR systems. Our MMR solution is also very accurate, however, further research
is required to make it real-time. We have suggested ideas on how this may be
achieved in Section 4.4.5.
5.1.1
Difficulties
At the start of our project, we anticipated several difficulties that we
would possibly encounter. Some of these include:
1. The weather can sometimes make the background undesirably dynamic, such
as swaying branches and even wind-induced camera shake.
2. Variability exists in the license plate designs of different states, and even in
the character spacing such as found in vanity plates.
74
75
3. Depending on the Sun’s position, a vehicle’s shadow may be mistaken as
being part of the vehicle.
4. Various types of vehicle body damage or even dirt might impact LPR and
MMR.
5. Recognition algorithms might only work during broad daylight or only with
very good lighting.
6. The surface of most cars is specular, a material property known to cause
problems for appearance-based recognition algorithms.
In this section, we discuss the observed performance of our system in each of the
above situations.
1. The effects of wind were heavily pronounced in the ‘Regents’ camera since the
camera’s optics were extended to their full zoom range and even light winds
caused camera shake. Even though image stabilization techniques could alleviate this effect, camera shake does not influence our license plate detection
algorithms because the entire frame is searched for license plates, and the
license plate tracker is sufficiently robust to handle the camera movement we
observed.
2. Our datasets did not include an adequate sampling of out-of-state plates
and vanity plates to determine how well our system would handle these instances. However, the few such plates we observed seemed to be detected
and recognized no differently.
3. Vehicle shadows did not affect our car recognition algorithms. Because of our
choice of license plate location as a reference point when segmenting the car
image, the segmented image contained only pixels belonging to a car and no
background except in very rare cases with SUVs whose plates are mounted
off-center. Even in those cases, our MMR algorithm performed well as seen
in Figure 4.14.
76
4. Figure 4.13 shows an example of a query of a car with a dent and a van
with partial occlusion. For both cases SIFT matching matched the correct
vehicle in the database, while the appearance-based methods failed. License
plate detection did in fact perform quite poorly on very old or dirty plates,
however, those instances were rare, and even we as humans were unable to
read those plates.
5. It might be worthwhile to investigate possible night-time make and model
recognition methods where some crude intuition might be formed about the
vehicle examined based on taillight designs. We have not experimented with
night-time video data, but external lighting would certainly be required in
those cases for our system to operate.
6. The specular material of car bodies had little observed effect on our MMR
rates. In most cases, reflections of tree branches on cars’ windows resulted
in features that simply did not match features in the database. In a few
instances, as seen in Figure 4.13, several features caused by a tree branch
reflection resulted in a match, but were simply not enough to impact the
overall recognition rate, and in general, with more examples per make and
model this would hardly be a problem.
5.2
Future Work
Although our work is a good start to a query-based car surveillance sys-
tem, further research is necessary to make such a system possible. In this section,
we discuss several necessary features that need to be researched and developed.
5.2.1
Color Inference
In addition to searching the surveillance database for cars using some
make and model description and a partial license plate, it would also be useful to
be able to search for a particular color car as the make and model information may
77
be incomplete. Various color- and texture-based image segmentation techniques
used in content-based image retrieval such as [9] may be suitable for our purpose.
Since we already segment cars statically using the license plate tracker,
we could simply compute a color histogram for the entire region and store this as a
color feature vector in addition to the existing SIFT feature vectors for each image
in the database. To assign a meaningful label to the color histogram, such as ‘red’,
‘white’, ‘blue’, etc., we can find the minimum distance, as described in [9], to a
list of pre-computed and hand-labeled color histograms for each interesting color
type.
5.2.2
Database Query Algorithm Development
Due to the heavy computation necessary to perform both LPR and MMR,
a production surveillance system would require constant updates to a surveillance database as cars are detected in the live video stream and not simply as
an overnight batch process. An algorithm for querying such a database might
work as follows. Given any combination of partial license plate, make and model,
or color description of a car:
1. If partial license plate description is provided, perform a search for the license
plate substring and sort results using the edit distance measure described in
Chapter 3.
2. If make and model description is provided, search the top results from Step 1
for desired make and model. Otherwise search entire database for the given
make and model description.
3. If color is provided, return results from Step 2 with a similar color, as described in Section 5.2.1.
Queries in the database should return the times each matching vehicle
was observed and allow the user to replay the video stream for those times.
78
5.2.3
Make and Model 3-D Structure
In our MMR work, we have not explored car pose variation beyond what
normally occurs at the stop signs in our scenes. A robust MMR system should also
work well in scenes where there is a large variation of poses. This could require
the estimation of a car’s 3-D structure to be used as additional input to the MMR
algorithms.
Bibliography
[1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via
a sparse, part-based representation. PAMI, 26(11):1475–1490, 2004.
[2] Y. Amit, D. Geman, X. Fan. A coarse-to-fine strategy for multiclass shape
detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 26, 1606–
1621. 2004.
[3] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. PAMI, pp 711–720. 1997.
[4] S. Belongie, J. Malik, J. Puzicha. Matching shapes. Proc. ICCV. pp. 454-461,
2001.
[5] G. Cao, J. Chen, J. Jiang, An adaptive approach to vehicle license plate
localization. Industrial Electronics Society, 2003. IECON ’03. Volume 2, pp
1786- 1791
[6] D. Capel. Image Mosaicing and Super-resolution. PhD thesis, University of
Oxford, 2001.
[7] D. Capel, A. Zisserman. Super-resolution enhancement of text image sequences. International Conference on Pattern Recognition, pages 600–605,
Barcelona, 2000.
[8] D. Capel, A. Zisserman. Super-resolution from multiple views using learnt
image models. In Proc. CVPR, 2001.
[9] C. Carson, S. Belongie, H. Greenspan, J. Malik. Blobworld: color- and texturebased image segmentation using EM and its Application to image querying
and classification. PAMI, 24(8):1026–1038, 2002.
[10] X. Chen, A. Yuille. Detecting and reading text in natural scenes. CVPR.
Volume: 2, pp. 366–373, 2004.
[11] P. Comelli, P. Ferragina, M. Granieri, F. Stabile. Optical recognition of motor
vehicle license plates. IEEE Trans. On Vehicular Technology, Vol. 44, No. 4,
pp. 790–799, 1995.
79
80
[12] K. Donaldson, G. Myers. Bayesian Super-Resolution of Text in Video with a
Text-Specific Bi-Modal Prior. SRI
http://www.sri.com/esd/projects/vace/docs/
IJDAR2003-Myers-Donaldson.pdf
[13] G. Dorko and C. Schmid. Selection of scale-invariant parts for object class
recognition. Proc. ICCV, 2003.
[14] A. Ferencz, E. Miller, J. Malik. Learning hyper-features for visual identification. NIPS, 2004.
[15] W. Förstner. E. Gülch. A fast operator for detection and precise location of
distinct points, corners and circular features. Proc. Intercommission Conference on Fast Processing of Photogrammetric Data, Interlaken. 281–305, 1987.
[16] Y. Freund. Boosting a weak learning algorithm by majority. Information and
Computation, Volume 121: 256–285, 1995
[17] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. EuroCOLT 95, pages 23-37, SpringerVerlag, 1995.
[18] A. Georghiades, P. Belhumeur, D. Kriegman. From few to many: Illumination
Cone models for face recognition under variable lighting and pose. IEEE trans.
PAMI, pp.643–660, 2001.
[19] C. Harris, M. Stephens. A combined corner and edge detector. Alvey Vision
Conference, pp 147–151, 1988.
[20] H. Hegt, R. de la Haye, N. Khan. A high performance license plate recognition
system. SMC’98 Conference Proceedings. 1998 IEEE International Conference
on Systems, Man, and Cybernetics (Cat. No.98CH36218). IEEE. Part vol.5,
1998, pp.4357–62 vol.5. New York, NY, USA.
[21] http://en.wikipedia.org/wiki/London Congestion Charge
[22] http://www.cbp.gov/xp/CustomsToday/2001/December/custoday lpr.xml
[23] California DMV Smog Check Web Site
http://www.smogcheck.ca.gov/vehtests/pubtstqry.aspx
[24] M. Irani, S. Peleg. Super resolution from image sequences. In International
Conference on Pattern Recognition, pages 115120, 1990.
[25] T. Kadir and M. Brady. Saliency, scale and image description. Proc. IJCV,
45(2): 83–105, 2001.
81
[26] V. Kamat, S. Ganesan. An efficient implementation of the Hough transform for
detecting vehicle license plates using DSP’S. Real-Time Technology and Applications Symposium (Cat. No.95TH8055). IEEE Comput. Soc. Press. 1995,
pp.58–9. Los Alamitos, CA, USA.
[27] N. Khan, R. de la Haye, A. Hegt. A license plate recognition system. SPIE
Conf. on Applications of Digital Image Processing. 1998.
[28] K. Kim, K. Jung, and J. Kim, Color texture-based object detection: an application to license plate localization. Lecture Notes in Computer Science: International Workshop on Pattern Recognition with Support Vector Machines,
pp. 293–309, 2002.
[29] B. Leung. Component-based Car Detection in Street Scene Images. Master’s
Thesis, Massachusetts Institute of Technology, 2004.
[30] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV,
2(60):91–110, 2004.
[31] G. Mori, S. Belongie, J. Malik. Efficient Shape Matching Using Shape Contexts, PAMI (to appear), 2005
[32] G. Mori and J. Malik. Recognizing objects in adversarial clutter: breaking a
visual CAPTCHA. Proc. CVPR, 2003.
[33] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from
appearance. ICJV, 1995.
[34] H. Murase, M. Lindenbaum, Spatial temporal adaptive method for partial
eigenstructure decomposition of large images. Tech. Report 6527, Nippon Telegraph and Telephone Corporation, 1992.
[35] T. Naito, T. Tsukuda, K. Yamada, K. Kozuka. Robust recognition methods for inclined license plates under various illumination conditions outdoors.
Proc. of IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems, pp. 697702,1999.
[36] T. Naito, T. Tsukada, K. Yamada, K. Kozuka, S. Yamamoto, Robust licenseplate recognition method for passing vehicles underoutside environment. IEEE
T VEH TECHNOL 49 (6): 2309–2319 NOV 2000.
[37] J. Nijhuis, M. Brugge, K. Helmholt, J. Pluim, L. Spaanenburg, R. Venema,
M. Westenberg. Car license plate recognition with neural networks and fuzzy
logic. Proceedings of IEEE International Conference on Neural Networks,
Perth, Western Australia, pp 21852903. 1995.
[38] K. Okuma, A. Teleghani, N. de Freitas, J. Little and D. Lowe. A boosted
particle filter: Multitarget detection and tracking, ECCV, 2004.
82
[39] C. Papageorgiou, T. Poggio. A trainable object detection system: car detection in static images. MIT AI Memo, 1673 (CBCL Memo 180), 1999.
[40] R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–
227, 1990.
[41] C. Schlosser, J. Reitberger, S. Hinz, Automatic car detection in high-resolution
urban scenes based on an adaptive 3D-model. Proc. IEEE/ISPRS Workshop
on ”Remote Sensing and Data Fusion over Urban Areas”. 2003.
[42] R. Schultz, R. Stevenson. Extraction of high-resolution frames from video
sequences. IEEE Transactions on Image Processing, 5(6):996–1011, 1996.
[43] H. Schneiderman, T. Kanade. A statistical method for 3D object detection
applied to faces and cars. IEEE CVPR, 2000.
[44] V. Shapiro, G. Gluhchev. Multinational license plate recognition system: segmentation and classification. Proc. ICPR 1051–4651. 2004.
[45] J. Shi, C. Tomasi, Good Features to track. Proc. IEEE Conf. on Computer
Vision and Pattern Recognition (CVPR94), Seattle, June 1994.
[46] J. Sivic, A. Zisserman. Video google: a text retrieval approach to object
matching in videos. Proc. ICCV, 2003.
[47] G. Sullivan., K. Baker, A. Worrall, C. Attwood, P. Remagnino, Model-based
vehicle detection and classification using orthographic approximations. Image
and Vision Computing. 15(8), 649–654.
[48] M. Turk, A. Pentland. Face recognition using eigenfaces. Proc. CVPR, 1991.
[49] P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001.
Proceedings of the 2001 IEEE Computer Society Conference on , Volume: 1,
8–14 Dec. 2001 Pages:I-511 - I-518 vol.1
[50] J. Wills, S. Agarwal, S. Belongie. What went where. Proc. CVPR pp. 98104,
2003.
[51] Y. Yanamura, M. Goto, D. Nishiyama, M. Soga, H. Nakatani, H. Saji. Extraction and tracking of the license plate using Hough transform and voted
block matching. IEEE IV2003 Intelligent Vehicles Symposium. Proceedings
(Cat. No.03TH8683). IEEE. 2003, pp.243–6. Piscataway, NJ, USA.
[52] A. Zomet, S. Peleg. Super-resolution from multiple images having arbitrary
mutual motion. Super-Resolution Imaging, Kluwer Academic, 2001.