UNIVERSITY OF CALIFORNIA, SAN DIEGO Video-based Car Surveillance: License Plate, Make, and Model Recognition A thesis submitted in partial satisfaction of the requirements for the degree Masters of Science in Computer Science by Louka Dlagnekov Committee in charge: Professor Serge J. Belongie, Chairperson Professor David A. Meyer Professor David J. Kriegman 2005 Copyright Louka Dlagnekov, 2005 All rights reserved. The thesis of Louka Dlagnekov is approved: Chair University of California, San Diego 2005 iii TABLE OF CONTENTS Signature Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi I Introduction . . . . . . . 1.1. Problem Statement 1.2. Social Impact . . . 1.3. Datasets . . . . . . 1.4. Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 3 4 6 II License Plate Detection . . . . . 2.1. Introduction . . . . . . . . 2.2. Previous Work . . . . . . . 2.3. Feature Selection . . . . . 2.4. The AdaBoost Algorithm . 2.5. Optimizations . . . . . . . 2.5.1. Integral Images . . . 2.5.2. Cascaded Classifiers 2.6. Results . . . . . . . . . . . 2.6.1. Datasets . . . . . . . 2.6.2. Results . . . . . . . . 2.7. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 8 9 13 14 14 16 18 18 21 24 III License Plate Recognition . . . . . . . . . . 3.1. Tracking . . . . . . . . . . . . . . . . 3.2. Super-Resolution . . . . . . . . . . . 3.2.1. Registration . . . . . . . . . . . 3.2.2. Point Spread Function . . . . . 3.2.3. Algorithm . . . . . . . . . . . . 3.2.4. Maximum Likelihood Estimate 3.2.5. Maximum a Posteriori Estimate 3.2.6. Discussion . . . . . . . . . . . . 3.3. Optical Character Recognition . . . . 3.3.1. Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 25 26 29 30 30 31 32 37 38 38 . . . . . . . . . . . . . . . iv 3.3.2. Datasets . . . . . . 3.3.3. Template Matching 3.3.4. Other Methods . . 3.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 42 42 IV Make and Model Recognition . . . . 4.1. Previous Work . . . . . . . . . 4.2. Datasets . . . . . . . . . . . . 4.3. Appearance-based Methods . 4.3.1. Eigencars . . . . . . . . 4.4. Feature-based Methods . . . . 4.4.1. Feature Extraction . . . 4.4.2. Shape Contexts . . . . . 4.4.3. Shape Context Matching 4.4.4. SIFT Matching . . . . . 4.4.5. Optimizations . . . . . . 4.5. Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 47 48 50 58 58 61 63 64 71 71 Conclusions and Future Work . . . . . . . . . . . . 5.1. Conclusions . . . . . . . . . . . . . . . . . . . 5.1.1. Difficulties . . . . . . . . . . . . . . . . . 5.2. Future Work . . . . . . . . . . . . . . . . . . . 5.2.1. Color Inference . . . . . . . . . . . . . . 5.2.2. Database Query Algorithm Development 5.2.3. Make and Model 3-D Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 74 74 76 76 77 78 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 V . . . . . . . . v LIST OF FIGURES 1.1 1.2 1.3 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 (a) A Dutch license plate and (b) a California license plate. Most cars in our datasets have plates of the form shown in (b), but at a much lower resolution. . . . . . . . . . . . . . . . . . . . . . . . . . A frame from the video stream of (a) the ‘Regents’ dataset and (b) the ‘Gilman’ dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . (a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b) Same images variance normalized. . . . . . . . . . . . . . . . . . . . PCA on 1,520 license plate images. Note that about 70 components are required to capture 90% of the energy. . . . . . . . . . . . . . . The means of the absolute value of the (a) x-derivative, and (b) y derivative, and the variance of the (c) x-derivative, and (d) yderivative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of features selected by AdaBoost. The sum of values computed over colored regions are subtracted from the sum of values over non-colored regions. . . . . . . . . . . . . . . . . . . . . . . . . Typical class conditional densities for weak classifier features. For some features, there is clearly a large amount of error that cannot be avoided when making classifications, however this error is much smaller than the 50% AdaBoost requires to be effective. . . . . . . . (a) The integral image acceleration structure. (b) The sum of the values in each rectangular region can be computed using just four array accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A cascaded classifier. The early stages are very efficient and good at rejecting the majority of false windows. . . . . . . . . . . . . . . The three sets of positive examples used in training the license plate detector – sets 1, 2, and 3, with a resolution of 71 × 16, 80 × 19, and 104 × 31, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . ROC curves for a 5-stage cascade trained using 359 positive examples and three different choices of negative training examples. . . . . ROC curves for (a) a single-stage, 123-feature detector, and (b) a 6-stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features per stage respectively. The sizes of the images trained on in sets 1, 2, and 3 are 71 × 16, 80 × 19, and 104 × 31 respectively. The x-axis scales in (a) and (b) were chosen to highlight the performance of the detector on each set. . . . . . . . . . . . . . . . . . . . . . . . . Examples of regions incorrectly labeled as license plates in the set 3 test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Detection on an image from the Caltech Computer Vision group’s car database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi 3 5 6 10 10 11 12 15 16 19 20 22 23 24 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 4.1 4.2 4.3 4.4 4.5 4.6 4.7 A car tracked over 10 frames (1.7 seconds) with a blue line indicating the positions of the license plate in the tracker. . . . . . . . . . . . Our image formation model. The (a) full-resolution image H undergoes (b) a geometric transformation Tk followed by (c) a blur with a PSF h(u, v); is (d) sub-sampled by S, and finally (e) addibk tive Gaussian noise η is inserted. The actual observed image L from our camera is shown in (f). The geometric transformation is exaggerated here for illustrative purposes only. . . . . . . . . . . . . (a) The Huber penalty function used in the smoothness prior with α = 0.6 and red and blue corresponding the regions |x| ≤ α and |x| > α respectively; (b) an un-scaled version of the bi-modal prior with µ0 = 0.1 and µ1 = 0.9. . . . . . . . . . . . . . . . . . . . . . . Super-resolution results: (a) sequence of images processed, (b) an up-sampled version of one low-resolution image, (c) the average image, (d) the final high-resolution estimate. . . . . . . . . . . . . . . The alphabet created from the training set. There are 10 examples for each character for the low-resolution, average image, and superresolution classes, shown in that respective order. . . . . . . . . . . Character frequencies across our training and test datasets. . . . . . Template matching OCR results on the low-resolution test set for ‘standard’ and ‘loose’ comparisons between recognized characters and actual characters. . . . . . . . . . . . . . . . . . . . . . . . . . Recognition results for the images in our test set. Each horizontal section lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and 7 mistakes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Our automatically generated car database. Each image is aligned such that the license plate is centered a third of the distance from bottom to top. Of these images, 1,102 were used as examples, and 38 were used as queries to test the recognition rates of various methods. We used the AndreaMosaic photo-mosaic software to construct this composite image. . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) The average image, and (b) the first 10 eigencars. . . . . . . . . The first 19 query images and the top 10 matches in the database for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . . The second 19 query images and the top 10 matches in the database for each using all N eigencars. . . . . . . . . . . . . . . . . . . . . . The first 19 query images and the top 10 matches in the database for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . . The second 19 query images and the top 10 matches in the database for each using N − 3 eigencars. . . . . . . . . . . . . . . . . . . . . Harris corner detections on a car image. Yellow markers indicate occlusion junctions, formed by the intersection of edges on surfaces of different depths. . . . . . . . . . . . . . . . . . . . . . . . . . . . vii 27 28 34 37 40 41 43 44 49 52 54 55 56 57 59 4.8 4.9 4.10 4.11 4.12 4.13 4.14 4.15 Kadir and Brady salient feature extraction results on (a) a car image from our database, and (b) an image of a leopard. . . . . . . . . . . SIFT keypoints and their orientations for a car image. . . . . . . . . (a) Query car image with two interest points shown, (b) database car image with one corresponding interest point shown, (c) diagram of log-polar bins used for computing shape context histograms, (d,e,f) shape context histograms for points marked ‘B’, ‘C’, and ‘A’ respectively. The x-axis represents θ and the y-axis represents log r increasing from top to bottom. . . . . . . . . . . . . . . . . . . . . . (a) Image edges and (b) a random sampling of 400 points from the edges in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Query images 1-10 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. Query images 11-20 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. Query images 21-29 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. Query images 30-38 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. viii 60 61 62 64 67 68 69 70 LIST OF TABLES 2.1 4.1 4.2 Negative examples remaining during training at each stage of the cascade. The three training operations shown are (1) initial training with 10,052 randomly chosen negative examples, (2) first bootstrap training with an additional 4,974 negative examples taken from false positives, (3) second bootstrap operation with another 4,974 negative examples taken from false positives from the previous training stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary of overall recognition rates for each method. . . . . . . . Test set of queries used with ‘Size’ indicating the number of cars similar to the query in the database and which method classified each query correctly. . . . . . . . . . . . . . . . . . . . . . . . . . . ix 23 71 73 ACKNOWLEDGEMENTS I would like to thank the following people for helping make this thesis possible: Serge Belongie for being there every step of the way and always being available for consultation, even at three in the morning. David Meyer for arranging funding and for ongoing consultation. David Kriegman for very helpful initial guidance. My family for being understanding and supportive throughout my education. My best friend Brian for many enlightening discussions and proof reading drafts. David Rose of the UCSD Police Department and Robert Meza of the Campus Loss Prevention Center for providing access to car video data. This work has been partially supported by DARPA under contract F49620-02-C-0010. x ABSTRACT OF THE THESIS Video-based Car Surveillance: License Plate, Make, and Model Recognition by Louka Dlagnekov Masters of Science in Computer Science University of California, San Diego, 2005 Professor Serge J. Belongie, Chair License Plate Recognition (LPR) is a fairly well explored problem and is already a component of several commercially operational systems. Many of these systems, however, require sophisticated video capture hardware possibly combined with infrared strobe lights, or exploit the large size of license plates in certain geographical regions and the (artificially) high discriminability of characters. One of the goals of this project is to develop an LPR system that achieves a high recognition rate without the need for a high quality video signal from expensive hardware. We also explore the problem of car make and model recognition for purposes of searching surveillance video archives for a partial license plate number combined with some visual description of a car. Our proposed methods will provide valuable situational information for law enforcement units in a variety of civil infrastructures. xi Chapter I Introduction License plate recognition (LPR) is widely regarded to be a solved problem, the technology behind the London Congestion Charge program being a well-known example. In an effort to reduce traffic congestion in Central London, the city imposes a daily fee on motorists entering a specified zone [21]. In order to automate the enforcement of the fee, over two hundred closed-circuit television (CCTV) cameras are in operation whose video streams are processed by an LPR system. If a plate is found whose registered owner has not paid the fee, the owner is fined. Other LPR systems are used by the U.S. Customs for more efficient crosschecks in the National Crime Information Center (NCIC) and Treasure Enforcement Communications System (TECS) for possible matches with criminal suspects [22]. The 407 ETR toll road in Ontario, Canada also uses LPR to fine motorists who do not carry a radio transponder and have not paid a toll fee. In the Netherlands LPR systems are in place that are fully automated from detecting speeding violations, to reading the license plate and billing the registered owner. All of these systems treat license plates as cars’ fingerprints. In other words, they determine a vehicle’s identity based solely on the plate attached to it. One can imagine, however, a circumstance where two plates from completely different make and model cars are swapped with malicious intent, in which case these systems would not find a problem. We as humans are also not very good 1 2 at reading cars’ license plates unless they are quite near us, nor are we very good at remembering all the characters. However, we are good at identifying and remembering the appearance of cars, and therefore their makes and models, even when they are speeding away from us. In fact, the first bit of information Amber Alert signs show is the car’s make and model and then its license plate number, sometimes not even a complete number. Therefore, given the description of a car and a partial license plate number, the authorities should be able to query their surveillance systems for similar vehicles and retrieve a timestamp of when that vehicle was last seen along with archived video data for that time. Despite the complementary nature of license plate and make and model information, to the best of our knowledge, make and model recognition is an unexplored problem. Various research has been done on detecting cars in satellite imagery and detecting and tracking cars in video streams, but we are unaware of any work on the make and model recognition (MMR) aspect. Because of the benefits that could arise from the unification of LPR and MMR, we explore both problems in this thesis. 1.1 Problem Statement Although few details are released to the public about the accuracy of com- mercially deployed LPR systems, it is known that they work well under controlled conditions and require high-resolution imaging hardware. Most of the academic research in this area also requires high-resolution images or relies on geographicallyspecific license plates and takes advantage of the large spacing between characters in those regions and even the special character features of commonly misread characters as shown in Figure 1.1 (a). Although the majority of license plates in our datasets were Californian and in the form of Figure 1.1 (b), the difficulty of the recognition task is comparable to other United States plates. The image shown in Figure 1.1 (b) is of much higher resolution than the images in our datasets and is 3 (a) (b) Figure 1.1: (a) A Dutch license plate and (b) a California license plate. Most cars in our datasets have plates of the form shown in (b), but at a much lower resolution. shown for illustrative purposes only. Our goal in this thesis is to design a car recognition system for surveillance purposes, which, given low-resolution video data as input is able to maintain a database of the license plate and make and model information of all cars observed for the purposes of performing queries on license plates and makes and models. In this thesis, we do not explore algorithms for such queries, but our results in this project are an invaluable foundation for that task. 1.2 Social Impact The use of any system that stores personally identifiable information should be strictly monitored for adherence to all applicable privacy laws. Our system is no exception. Since license plates can be used to personally identify individuals, queries to the surveillance database collected should only be performed by authorized users and only when necessary, such as in car theft or child abduction circumstances. Because our system is query-driven rather than alarm-driven, where by alarm-driven we mean the system issues an alert when a particular behavior is observed (such as running a red-light), slippery slope arguments toward a machine-operated automatic justice system do not apply here. The query-driven aspect also alleviates fears that such technology could be used to maximize state revenue rather than to promote safety. Although there exists the possibility of abuse of our system, this possibility exists in other systems too such as financial databases employed by banks and 4 other institutions that hold records of persons’ social security numbers. Even cell phone providers can determine a subscriber’s location by measuring the distance between the phone and cell towers in the area. In the end we feel the benefits of using our system far outweigh the potential negatives, and it should therefore be considered for deployment. 1.3 Datasets We made use of two video data sources in developing and testing our LPR and MMR algorithms. We shall refer to them as the ‘Regents’ dataset and the ‘Gilman’ dataset. The video data in both datasets is captured from digital video cameras mounted on top of street lamp poles overlooking stop signs. Figure 1.2 shows a typical frame captured from both cameras. These cameras, along with nearly 20 others, were set up in the Regents parking lots of UCSD as part of the RESCUE-ITR (Information Technology Research) program by the UCSD Police Department. The ‘Regents’ video stream has a resolution of 640 × 480 and sampling is done at 10 frames per second, while the ‘Gilman’ video stream has a resolution of 720 × 480 and is sampled at 6 frames per second. Due to the different hardware and different spatial positions of both cameras, the datasets have different characteristics. The camera in the ‘Regents’ dataset is mounted at a much greater distance from the stop sign and is set to its full optical zoom, while the ‘Gilman’ camera is much closer. The size of plates in the ‘Regents’ dataset are therefore much smaller, but exhibit less projective distortion as cars move through the intersection. On the other hand, the ‘Gilman’ camera is of higher quality, which combined with the larger plate sizes made for an easier character recognition task. Since only about a thousand cars pass through both intersections in an 8 hour recorded period, some sort of automation was necessary to at least scan through the video stream to find frames containing cars. An application was 5 (a) (b) Figure 1.2: A frame from the video stream of (a) the ‘Regents’ dataset and (b) the ‘Gilman’ dataset. written for this purpose, which searches frames (in a crude, but effective method of red color component thresholding, which catches cars’ taillights) for cars and facilitates the process of extracting training data by providing an interface for hand-clicking on points. Using this process, over 1,500 training examples were extracted for the ‘Regents’ dataset as shown on Figure 1.3(a). In the figure, time flows in raster scan order, such that the top left license plate image was captured at 8am and the bottom right at 4pm. Note the dark areas in the image – this is most likely a result of cloud cover, and this illumination change can be accounted for by variance normalizing the images as shown in Figure 1.3(b). Although this variance normalization technique does improve the consistency of license plate examples, it had little effect on the overall results and was not used so as to reduce unnecessary computation. However, we point it out as a reasonable solution to concerns that illumination differences may adversely affect recognition. Unless otherwise indicated, all references to datasets shall refer to the ‘Gilman’ dataset. 6 (a) (b) Figure 1.3: (a) 1,200 of 1,520 training examples for the ‘Regents’ dataset. (b) Same images variance normalized. 1.4 Thesis Structure Chapter 2 discusses the design and performance of a license plate detec- tor trained in a boosting framework. In Chapter 3 we present several important pre-processing steps on detected license plate regions and describe a simple algorithm to perform optical character recognition (OCR). The problem of make and model recognition is explored in Chapter 4, where we evaluate several wellknown and some state of the art object recognition algorithms in this novel setting. We conclude the thesis in Chapter 5 and discuss ideas for future research on car recognition. Chapter II License Plate Detection 2.1 Introduction In any object recognition system, there are two major problems that need to be solved – that of detecting an object in a scene and that of recognizing it; detection being an important requisite. In our system, the quality of the license plate detector is doubly important since the make and model recognition subsystem uses the location of the license plate as a reference point when querying the car database. In this chapter we shall discuss our chosen detection mechanism. The method we employ for detecting license plates can be described as follows. A window of interest, of roughly the dimensions of a license plate image, is placed over each frame of the video stream and its image contents are passed as input to a classifier whose output is 1 if the window appears to contain a license plate and 0 otherwise. The window is then placed over all possible locations in the frame and candidate license plate locations are recorded for which the classifier outputs a 1. In reality, this classifier, which we shall call a strong classifier, weighs the decisions of many weak classifiers, each specialized for a different feature of license plates, thereby making a much more accurate decision. This strong classifier is trained using the AdaBoost algorithm. Over several rounds, AdaBoost selects the 7 8 best performing weak classifier from a set of weak classifiers, each acting on a single feature. The AdaBoost algorithm is discussed in detail in Section 2.4. Scanning every possible location of every frame would be very slow were it not for two key optimization techniques introduced by Viola and Jones – integral images and cascaded classifiers [49]. The integral image technique allows for an efficient implementation and the cascaded classifiers greatly speed up the detection process, as not all classifiers need be evaluated to rule out most non-license plate sub-regions. With these optimizations in place, the system was able to process 10 frames per second at a resolution of 640 × 480 pixels. The optimizations are discussed in Section 2.5. Since the size of a license plate image can vary significantly with the distance from the car to the camera, using a fixed-size window of interest is impractical. Window-based detection mechanisms often scan a fixed-size window over a pyramid of image scales. Instead, we used three different sizes of windows, each having a custom-trained strong classifier for that scale. 2.2 Previous Work Most LPR systems employ detection methods such as corner template matching [20] and Hough transforms [26] [51] combined with various histogrambased methods. Kim et al. [28] take advantage of the color and texture of Korean license plates (white characters on green background, for instance) and train a Support Vector Machine (SVM) to perform detection. Their license plate images range in size from 79 × 38 to 390 × 185 pixels, and they report processing lowresolution input images (320 × 240) in over 12 seconds on a Pentium3 800MHz, with a 97.4% detection rate and a 9.4% false positive rate. Simpler methods, such as adaptive binarization of an entire input image followed by character localization, also appear to work as shown by Naito et al. [36] and [5], but are used in settings with little background clutter and are most likely not very robust. 9 Since license plates contain a form of text, we decided to face the detection task as a text extraction problem. Of particular interest to us was the work done by Chen and Yuille on extracting text from street scenes for reading for the blind [10]. Their work, based on the efficient object detection work by Viola and Jones [49], uses boosting to train a strong classifier with a good detection rate and a very low false positive rate. We found that this text detection framework also works well for license plate detection. 2.3 Feature Selection The goal of this section is to find good features in the image contents of the window of interest, one for each weak classifier. The features to which the weak classifiers respond are important in terms of overall accuracy and should be chosen to discriminate well between license plates and non-license plates. Viola and Jones use Haar-like features, where sums of pixel intensities are computed over rectangular sub-windows [49]. Chen and Yuille argue that, while this technique may be useful for face detection, text has little in common with faces [10]. To support their assumption, they perform principal component analysis (PCA) on their training examples and find that about 150 components are necessary to capture 90 percent of the variance, whereas in typical face datasets, only a handful would be necessary. To investigate whether this was the case with license plates, a similar plot was constructed, shown in Figure 2.1. Unlike the text of various fonts and orientations with which Chen and Yuille were working, license plates require much fewer components to capture most of the variance. However, an eigenface-based approach [48] for classification yielded very unsatisfactory results and is extremely expensive to compute over many search windows. Fisherfacebased classification [3], which is designed to maximize between-class scatter to within-class scatter, also yielded unsatisfactory results. It is desirable to select features that produce similar results on all license 10 Energy Captured 1 0.8 0.6 0.4 0.2 0 0 50 100 150 200 Number of Eigenvalues 250 300 Figure 2.1: PCA on 1,520 license plate images. Note that about 70 components are required to capture 90% of the energy. 5 5 10 10 15 15 10 20 30 40 10 20 (a) (b) (c) (d) 30 40 Figure 2.2: The means of the absolute value of the (a) x-derivative, and (b) y derivative, and the variance of the (c) x-derivative, and (d) y-derivative. 11 X derivative Y derivative X derivative variance Y derivative variance Figure 2.3: Types of features selected by AdaBoost. The sum of values computed over colored regions are subtracted from the sum of values over non-colored regions. plate images and are good at discriminating between license plates and non-license plates. After pre-scaling all training examples in the ‘Regents’ dataset to the same 45 × 15 size and aligning them, the sum of the absolute values of their x and y-derivatives exhibit the pattern shown in Figure 2.2. The locations of the 7 digits of a California license plate are clearly visible in the y-derivative and y-derivative variance. Although the x-derivative and x-derivative variance show the form which Yuille and Chen report for text images, the y-derivative and y-derivative variance are quite different and yield a wealth of information. A total of 2,400 features were generated as input to the AdaBoost algorithm. These were a variation of the Haar-like features used by Viola and Jones [49], but more generalized, yet still computationally simple. A scanning window was evenly divided into between 2 and 7 regions of equal size, either horizontal or vertical. Each feature was then a variation on the sum of values computed in a set of the regions subtracted from the sum of values in the remaining set of regions. Therefore, each feature applied a thresholding function on a scalar value. Some of these features are shown in Figure 2.3. The values of the regions of each window were the means of pixel intensities, derivatives, or variance of derivatives. None of the features actually selected by AdaBoost used raw pixel intensities, however, probably because of their poor discriminating ability with respect to wide illumination differences. Each weak classifier was a Bayes classifier, trained on a single feature by forming class condi- 12 Likelihood 0.15 0.12 0.09 0.06 0.03 0 0 .250 .500 .750 1.00 1.250 1.500 1.750 2.000 2.250 2.500 2.875 3.125 3.375 0 .125 .250 .375 .500 .625 .750 .875 1.00 1.125 1.250 1.375 1.500 1.625 0.75 0.6 0.45 0.3 0.15 0 0.15 0.12 0.09 0.06 0.03 0 0 .250 .500 .750 1.00 1.250 1.500 1.750 2.000 2.250 2.500 2.875 3.125 Feature Value License Plate Non−License Plate Figure 2.4: Typical class conditional densities for weak classifier features. For some features, there is clearly a large amount of error that cannot be avoided when making classifications, however this error is much smaller than the 50% AdaBoost requires to be effective. tional densities (CCD) from the training examples. The CCD for a typical weak classifier is shown in Figure 2.4. When making a decision, regions where the license plate CCD is larger than the non-license plate CCD are classified as license plate and vice-versa, instead of using a simple one-dimensional threshold. Although the features we have described are rather primitive and not flexible in the sense that they are not able to respond to discontinuities other than vertical and horizontal, they lend themselves nicely to the optimization techniques discussed in Section 2.5. Steerable filters, Gabor filters, or other wavelet-based approaches are more general, but would be slower to compute. 13 2.4 The AdaBoost Algorithm AdaBoost is a widely used instance of boosting algorithms. The term boosting refers to the process of strengthening a collection of weak learning algorithms to create a strong learning algorithm. It was developed by Schapire [40] in 1990, who showed that any weak learning algorithm could be transformed or “boosted” into a strong learning algorithm. A more efficient version of the algorithm outlined by Schapire was later presented by Freund [16] called “boostby-majority”, and in 1995 Schapire and Freund developed AdaBoost [17], “Ada” standing for “adaptive” since it adjusts adaptively to the errors observed in the weak learners. The idea of boosting can be explained with the following example. Consider the problem of classifying email messages into junk-email and regular email by examining messages’ keywords. An example of a keyword we may tend to see often in junk email is “click here” and can classify messages as junk if they contain the keyword. Although this may work for many junk emails, it will almost certainly also lead to many legitimate messages being misclassified. Classifying solely based on the “click here” keyword is a good rule of thumb, but it is rather coarse. A better approach would be to find several of these rough rules of thumb and take advantage of boosting to combine them. In its original form, AdaBoost is used to boost the classification accuracy of a single classifier, such as a perceptron, by combining a set of classification functions to form a strong classifier. As applied to this project, AdaBoost is used to select a combination of weak classifiers to form a strong classifier. The weak classifiers are called weak because they only need to be correct just over 50% of the time. At the start of training, each training example (x1 , y1 )...(xn , yn ) is assigned a weight wi = 1 2m for negatives and wi = 1 2l for positives, where xi are positive and negative inputs, y ∈ {0, 1}, m is the number of negatives, and l is the 14 number of positives. The uneven initial distribution of weights leads to the name “Asymmetric AdaBoost” for this boosting technique. Then, for t = 1, ..., T rounds, each weak classifier hj is trained and its P error is computed as t = i wi |hj (xi ) − yi |. The hj with lowest error is selected, and the weights are updated according to: t wt+1,i = wt,i 1 − t if xi is classified correctly and not modified if classified incorrectly. This essentially forces the weak classifiers to concentrate on “harder” examples that are most often misclassified. We implemented the weighting process in our Bayes classifiers by scaling the values used to build the CCDs. After T rounds, T weak classifiers are selected and the strong classifier makes classifications according to PT PT 1 α h (x) ≥ τ t t t=1 αt t=1 , h(x) = 0 otherwise where αt = ln 1−t t and τ is set to 1 2 (2.1) to minimize the error. Schapire and Freund showed that the overall error of the boosted classifier is bound exponentially with the size of T . 2.5 Optimizations In this section we discuss two key optimization techniques introduced by Viola and Jones [49], which allowed us to achieve very fast detection rates – 10 frames per second on 640 × 480 image sizes. 2.5.1 Integral Images The features described in Section 2.3 add values over one group of sections and subtract them from another group of sections. If these sections are m × n 15 A B w (x, y) C x D y (a) z (b) Figure 2.5: (a) The integral image acceleration structure. (b) The sum of the values in each rectangular region can be computed using just four array accesses. pixels in size, we would normally require mn array accesses. However, if we take advantage of their rectangular nature, we can reduce the accesses to four, regardless of the size of the section, using an integral image data structure. An integral image I 0 of an image I is of the same dimensions as I and at each location (x, y) contains the sum of all the pixels in I above and to the left of the pixel (x, y): I 0 (x, y) = X I(x, y). x0 ≤x,y 0 ≤y With this structure in place, the sum of the pixel values in region D in Figure 2.5 (b), can be computed as D = I 0 (w) + I 0 (z) − (I 0 (x) + I 0 (y)). The integral image itself can be efficiently computed in a single pass over the image using the following recurrences: r(x, y) = r(x − 1, y) + I(x, y) I 0 (x, y) = I(x, y − 1) + r(x, y), where r(−1, y) and I(x, −1) are defined to be 0. 16 Scanning Window 1 1 1 2 0 1 3 0 0 1 4 Further Processing 0 Reject Window Figure 2.6: A cascaded classifier. The early stages are very efficient and good at rejecting the majority of false windows. For the images on which we trained and classified, we created integral images for raw pixel values, x-derivatives, y-derivatives, as well as integral images for the squares of these three types of values. The integral image of squares of values is useful for quickly computing the variance of the values in the sections of our features, since the variance can be computed as σ 2 = m2 − 1 X 2 x, N where m is the mean and x is the feature value. 2.5.2 Cascaded Classifiers At any given time, there are at most a handful of license plates visible in a frame of video, yet there are on the order of (640 − 100) × (480 − 30) ≈ 200, 000 window positions that require scanning, assuming a license plate image is 100 × 30 pixels. The number of regions to be classified as not containing a license plate clearly far exceed those that do. Luckily, it is not necessary to employ all classifiers selected by AdaBoost at each window position. The idea behind a cascaded classifier is to group the classifiers into several stages in order of increasing complexity with the hopes that the majority of regions can be rejected quickly by very few classifiers. Such a cascaded structure is depicted in Figure 2.6. Although a positive instance will pass through all stages of the cascade, this will be a very rare event, and the cost would be amortized. 17 Training the cascade is done stage by stage, where the first stage is trained on all positive and negative examples, the second stage is trained on all positive examples and only the false positives of the first stage used as negative examples, and so on for the remaining stages. The justification for this selection of negative examples is that when the cascade is in operation, there are many window instances which the latter stages will never be asked to classify since the early stages will have rejected them, and, therefore, training of the latter stages should reflect the type of data those stages would see in practice. Usually, the largest percentage of negative examples will be rejected in the first two stages, and the rest of the stages in the cascade will train on “harder” examples, and thus have much higher false positive rates than the early stages and as a result require more classifiers. By increasing the τ threshold in Equation (2.1), which is designed to yield a low error on the training data, we can decrease the false positive rate, at the expense of a decrease in the detection rate. This adjustment allows us to generate the receiver operating characteristic (ROC) curves shown in the next section, and it also allows us to design the cascade with a desirable false negative rate at each stage. Since the false negative rate is given by N= K Y ni , i=1 where n is the false negative rate of each stage in the cascade, and K are the number of stages, if we desire a 90% overall detection rate and K = 10, we would require each ni to be 99% since .9910 ≈ .90. The 99% false negative rate can easily be achieved by decreasing the τ threshold in Equation (2.1), even at the expense of a high false positive rate at each stage. The overall false positive rate is given by P = K Y pi , i=1 where pi is the false positive rate of each stage. Even a high false positive rate of 40% at each stage would equate to an overall false positive rate of only .01%, since .4010 ≈ .0001. 18 The design of a good cascade is not trivial. Viola and Jones present a simple algorithm that determines the number of features to be used at each stage by selecting a desired false negative and false positive rate [49], however, it assumes that each feature is of equal computational complexity. In our case, and in Chen and Yuille’s [10] cascaded classifier, this assumption does not hold. In principle one could design an algorithm to evaluate the time complexity of each feature type and choose how many and of what type features should be placed in K stages in order to minimize the overall running time of the classifier. Unfortunately, this is a very difficult problem. In practice, however, one can design a reasonably good cascade using the guiding principle that efficient features should be evaluated near the front of the cascade and more computationally expensive features should be evaluated near the end of the cascade. In our chosen cascaded classifier, we did not allow AdaBoost to select variance-based features for the first stage since we wanted it to be very efficient at eliminating a large portion of window locations early on. We should also mention that not only is detection fast in a cascaded classifier, but so is its training. Since each stage eliminates a large number of negative examples, the latter stages train on a much smaller set of examples. For a 123-feature single-stage classifier, full training with two bootstrap operations takes 18 hours to train, whereas a 6-stage classifier with the same number of features in total takes 5 hours. 2.6 Results In this section we present our results on the ‘Gilman’ dataset. 2.6.1 Datasets Unlike in our ‘Regents’ dataset, the camera on the ‘Gilman’ dataset was mounted much closer to the intersection, which resulted in greater projective distortion of the license plate as each car progresses through the intersection. We 19 Figure 2.7: The three sets of positive examples used in training the license plate detector – sets 1, 2, and 3, with a resolution of 71 × 16, 80 × 19, and 104 × 31, respectively. investigated training our license plate detector on plate images of a single scale and performing detection on a pyramid of scales for each frame, but found that that detection rate was not as good as having a dedicated detector trained on several scales. Therefore, the final training and test datasets were created by sampling three images of each car when it is approaching, entering, and exiting the intersection for 419 cars over several hours of video. The plates were then manually extracted from these images and split into three sets of small, medium, and large area. This provided 359 training images and 60 test images for each of the three sets. The average size of a plate in each set was 71 × 16, 80 × 19, and 104 × 31 respectively. The images in each set are shown in Figure 2.7. To allow for an easier method of extracting negative examples for training and to test our detector, we 20 100 Detection Rate (%) 92 84 76 68 60 0 0.05 0.1 0.15 0.2 0.25 False Positive Rate (%) 20000 random 10000 random, 10000 FP 10000 random, 5000 FP, 5000 FP Figure 2.8: ROC curves for a 5-stage cascade trained using 359 positive examples and three different choices of negative training examples. ensured that each of the 419 frames sampled for each set contained at most one visible license plate. We generated additional positive training examples for each set by extracting images from 10 random offsets (up to 1/8 of the width and 1/4 of the height of license plates) of each license plate location (for a total of 3,590), all of the same size as the average license plate size for that set. We found that this yielded better results than just using the license plate location for a single positive example per hand-labeled region. Of course, when the detector was in operation, it fired at many regions around a license plate, which we in fact used as an indication of the quality of a detection. To generate negative examples, we picked 28 license plate-sized images from random regions known not to contain license plates in each positive frame, which resulted in 10,052 per set. We then applied a sequence of two bootstrap operations where false positives obtained from testing on the training data were used as additional negative examples for re-training the cascade. We found that two sequential bootstrap operations of 4,974 negative examples each were more effective 21 than a single bootstrap operation with 9,948 negative examples. A comparison of these two methods is given in Figure 2.8. 2.6.2 Results Figure 2.9 shows a receiver operating characteristic (ROC) curve for our cascaded detector, and a single-stage cascade detector with the same number of features. There appears to be a trend indicating that a larger set (in terms of image size) is learned better than a smaller set. This is most likely due to the detector having access to more information content per image and as a result is able to better discriminate between license plates and non-license plates. In fact, when our detector was trained on the ‘Regents’ dataset where plate sizes were on average only 45 × 15 pixels, the detection rates were much lower even though more training examples were used. The ROC improvement for the resolution increase between sets 1 and 2 does not appear in the single-stage cascade, most likely because it is not a large increase. Table 2.1 shows the number of negative examples remaining at each stage of the cascade during the three training operations. Stages using the same number of negative examples as the previous indicate that the desired detection rate of 99.5% could not be maintained at the previous stage, and the τ threshold of Equation (2.1) was unchanged. Note that with each bootstrap operation the number of negative examples that enter the last stage of the cascade increases a lot more quickly than the linear increase of negative examples because the false positives represent ‘harder’ examples. As was to be expected, the cascaded classifier was much faster in operation with each frame requiring about 100 ms to process, whereas the single-stage classifier required over 3 seconds, but exhibited a superior ROC curve. Figure 2.10 shows a few examples of regions that our detector incorrectly labeled as license plates in our test dataset. Perhaps not surprisingly, a large number of them are text from advertising on city buses, or the UCSD shuttle. 22 100 Detection Rate (%) 92 84 76 68 60 0 0.004 Set 1 0.008 0.012 False Positive Rate (%) Set 2 0.016 0.02 Set 3 (a) 100 Detection Rate (%) 92 84 76 68 60 0 0.05 Set 1 0.1 0.15 False Positive Rate (%) Set 2 0.2 0.25 Set 3 (b) Figure 2.9: ROC curves for (a) a single-stage, 123-feature detector, and (b) a 6stage cascaded detector, with 2, 3, 6, 12, 40, and 60 features per stage respectively. The sizes of the images trained on in sets 1, 2, and 3 are 71 × 16, 80 × 19, and 104 × 31 respectively. The x-axis scales in (a) and (b) were chosen to highlight the performance of the detector on each set. 23 Table 2.1: Negative examples remaining during training at each stage of the cascade. The three training operations shown are (1) initial training with 10,052 randomly chosen negative examples, (2) first bootstrap training with an additional 4,974 negative examples taken from false positives, (3) second bootstrap operation with another 4,974 negative examples taken from false positives from the previous training stage. # of Features (1) (2) (3) 1 2 2 3 10,052 1,295 15,026 4,532 20,000 20,000 3 4 6 12 1,295 537 4,532 2,217 8,499 5,582 5 40 207 861 2,320 6 Remaining 60 0 0 152 0 552 14 Figure 2.10: Examples of regions incorrectly labeled as license plates in the set 3 test set. Those that contain taillights can easily be pruned by applying a color threshold. We also applied our license plate detector to a few car images from the Caltech Computer Vision group’s car database, whose image quality is far better than the video cameras used to create our datasets, and we found that many license plates were detected correctly, at the expense of a high number of false positives due to vegetation, for which our detector was not given negative examples. These could easily be pruned as well simply by applying yet another color threshold. Figure 2.11 shows the output of our detector on one of these images. We did not achieve as low a false positive rate per detection rate on our datasets as either Chen and Yuille, or Viola and Jones, but the false positive rate of 0.002% for a detection rate of 96.67% in set 3 is quite tolerable. In practice, the number false positives per region of each frame is small compared to the number of detections around a license plate in the frame. Therefore, in our final detector we 24 Figure 2.11: Detection on an image from the Caltech Computer Vision group’s car database. do not consider a region to contain a license plate unless the number of detections in the region is above a threshold. 2.7 Future Work It would be advantageous to investigate other types of features to place in the latter stages of the cascade in order to reduce the false positive rate. Colorbased discrimination would be especially useful, since most plates contain a bimodal color distribution of a white background and black or dark blue text. Other features mentioned by Chen and Yuille [10] such as histogram tests and edgelinking were not tried but should be to test their performance in a license plate detection setting. Chapter III License Plate Recognition In this chapter, we present a process to recognize the characters on detected license plates. We begin by describing a method for tracking license plates over time and how this can provide multiple samplings of each license plate for the purposes of enhancing it for higher quality character recognition. We then describe our optical character recognition (OCR) algorithm and present our recognition rates. 3.1 Tracking More often than not, the false positive detections from our license plate detector were erratic, and if on the car body, their position was not temporally consistent. We use this fact to our advantage by tracking candidate license plate regions over as many frames as possible. Then, only those regions with a smooth trajectory are deemed valid. The tracking of license plates also yields a sequence of samplings of the license plate, which are used as input to a super-resolution pre-processing step before OCR is performed on them. Numerous tracking algorithms exist that could be applied to our problem. Perhaps the most well-known and popular is the Kanade-Lucas-Tomasi (KLT) tracker [45]. The KLT tracker makes use of a Harris corner detector to detect good features to track in a region of interest (our license plate) and measures the 25 26 similarity of every frame to the first allowing for an affine transformation. Sullivan et al. [47] make use of a still camera for the purposes of tracking vehicles by defining regions of interest (ROI) chosen to span individual lanes. They initiate tracking when a certain edge characteristic is observed in the ROI and make predictions on future positions of vehicles. Those tracks with a majority of accurate predictions are deemed valid. Okuma et al. [38] use the Viola and Jones [49] framework to detect hockey players and then apply a mixture particle filter using the detections as hypotheses to keep track of the players. Although each of these tracking methods would probably have worked well in our application, we chose a far simpler approach which worked well in practice. Because detecting license plates is efficient we simply run our detector on each frame and for each detected plate we determine whether that detection is a new plate or an instance of a plate already being tracked. To determine whether a detected plate is new or not, the following conditions are checked: • the plate is within T pixels of an existing tracker • the plate is within T 0 pixels of an existing tracker and the plate is within θ degrees of the general direction of motion of the plates in the tracker’s history If any of these are true, the plate is added to the corresponding tracker, otherwise a new tracker is created for that plate. In our application T 0 was an order of magnitude larger than T . Figure 3.1 shows the tracking algorithm in action. Our tracking algorithm was also useful for discarding false positives from the license plate detector. The erratic motion of erroneous detections usually resulted in the initiation of several trackers each of which stored few image sequences. Image sequences of 5 frames or fewer were discarded. 3.2 Super-Resolution Video sequences such as the ones obtained from our cameras provide multiple samplings of the same surface in the physical world. These multiple 27 Figure 3.1: A car tracked over 10 frames (1.7 seconds) with a blue line indicating the positions of the license plate in the tracker. samples can sometimes be used to extract higher-resolution images than any of the individual samples. The process of extracting a single high-resolution image from a set of lower-resolution images is called super-resolution. Super-resolution is different from what is known as image restoration where a higher-resolution image is obtained from a single image, a process also sometimes referred to as enhancement. The investigation into super-resolution was inspired by the low-resolution license plate images in our ‘Regents’ dataset. In that dataset, the noisy and blurry 45 × 15 pixel license plate images made it very difficult to read the text on the plates. Before we describe the super-resolution algorithm we shall use, we shall describe our assumed image formation model. A plane in the scene undergoes a 28 (a) (b) (c) (d) (e) (f) Figure 3.2: Our image formation model. The (a) full-resolution image H undergoes (b) a geometric transformation Tk followed by (c) a blur with a PSF h(u, v); is (d) sub-sampled by S, and finally (e) additive Gaussian noise η is inserted. The actual bk from our camera is shown in (f). The geometric transformation observed image L is exaggerated here for illustrative purposes only. geometric transformation that maps its world coordinates to those of the camera. The optics of the camera blur the resulting projection at which point the camera samples it at the low-resolution we observe. Because of imperfections in the sampling device, noise is introduced, which we shall assume to be spatially uncorrelated, additive, and Gaussian-distributed with zero-mean and constant variance. Expressed in more formal terms, the imaging process is: bk (x, y) = S ↓ (h(x, y) ∗ H(Tk (x, y))) + η(x, y), L (3.1) with the following notation: bk L S↓ – k th estimated low-resolution image – down-sampling operator by a factor of S h – point spread function (PSF) ∗ – convolution operator H – high-resolution image Tk – geometric transformation η – additive noise This image formation process is illustrated in Figure 3.2. Note that the actual observed image in Figure 3.2 (f) appears to have a further blurring effect after the additive noise step when compared to Figure 3.2 (e). This could be due to a slight 29 motion-blur, which is not taken into account by our model. The goal of a super-resolution algorithm is to find H given each observed Lk . The sub-sampling factor S is usually chosen to be 2 or 4, and the estimation of Tk and h(x, y) are discussed in sections 3.2.1 and 3.2.2 respectively. We shall use the b symbol to differentiate between estimated and actual b images. In other words, H, represents the actual high-resolution image, and H denotes its estimate. 3.2.1 Registration The process of determining the transformation Tk for each image is known as registration. In the general case, Tk is a projective transformation (planar homography) and its reference coordinates are usually those of one of the images in the sequence. If all the images are roughly aligned by the detector, as was the case with our detector, the choice of a reference image is arbitrary, and we chose the first of each sequence. As a simplification, we are assuming that Tk is simply translational since, as the reader may recall from Chapter 2, our license plate detector is customdesigned for three different scales, and the variation in size of detections within a scale is minimal. To calculate the translation of each image Lk in the sequence relative to the reference image L1 , we divided each image into several patches and used the normalized cross-correlation measure of similarity P (I1 (x) − I1 )(I2 (x) − I2 ) N CC(I1 , I2 ) = qPx 2 2 x (I1 (x) − I1 ) (I2 (x) − I2 ) (3.2) to find the best place in L1 of each patch. In Equation (3.2), I1 = 1 X 1 X I1 (x) and I2 = I2 (x) N x N x are the means of I1 and I2 . N CC(I1 , I2 ) takes on values in [−1, 1], with 1 representing most similar and -1 representing least similar. Each patch I1 is placed over all possible offsets of the same size, I2 , over the reference image L1 , and the average 30 offset of each correspondence is computed and treated as the translation from Lk to L1 . This simple process leads to sub-pixel accuracies for each translation. Since registration is a crucial pre-processing step for the extraction of an accurate high-resolution estimate, we applied an all-pairs cross-correlation procedure on the plates in each tracked sequence to ensure all images in the sequence are somewhat similar and no erroneous detections are included. Those images with poor correlation to the rest are discarded. 3.2.2 Point Spread Function The blur operation in Equation (3.1) is modeled by a convolution with a point spread function (PSF). The PSF should approximate the blur of both the optics of the camera as well as its sensor. Zomet and Peleg [24] suggest three methods of estimating it: • Use camera specifications obtained from manufacturer (if available) • Analyze a picture of a known object • Use the images in the sequence Capel and Zisserman [7] instead suggest to simply use an isotropic Gaussian, which Capel found to work well in practice [6]. For our experiments we chose a Gaussian of size 15 × 15 and standard deviation of 7, which was used to create the blur operation in Figure 3.2. 3.2.3 Algorithm Our super-resolution algorithm is based on a probabilistic framework. The algorithm estimates the super-resolution image H by maximizing the conb b given the set of ditional probability P r(H|L) of the super-resolution estimate H b observed low-resolution images L = {Lk }. We do not know P r(H|L) directly, b Using but using the imaging model of Equation (3.1) we can determine P r(L|H). 31 Bayes’ Rule, b r(H) b P r(L|H)P b . P r(H|L) = P r(L) To find the most probable high-resolution image H, we need to maximize b r(H). b P r(L|H)P (3.3) b A further simplification We can drop the P r(L) term since it does not depend on H. is sometimes made by assuming that all high-resolution images are equally likely, b is maximized. The high-resolution estimate obtained in which case just P r(L|H) from this process is the maximum likelihood (ML) estimate. In our case, however, we do have some prior knowledge of the high-resolution images of license plates, which we can use to our advantage. We shall first describe a method of finding the ML estimate and then describe the priors we use in Section 3.2.5. 3.2.4 Maximum Likelihood Estimate Using our assumption that image noise is Gaussian with zero-mean and variance σ 2 , Capel and Zisserman [7] suggest the total probability of an observed b is image Lk given an estimate of the super-resolution image H b = P r(Lk |H) Y x,y b (x,y)−L (x,y))2 −( L k k 1 2σ 2 √ e . σ 2π (3.4) The log-likelihood function of Equation (3.4) is: L(Lk ) = − X bk (x, y) − Lk (x, y))2 . (L (3.5) x,y If we assume independent observations, b = P r(L|H) Y b P r(Lk |H), (3.6) k and the corresponding log-likelihood function for all images in the set L becomes L(L) = X k L(Lk ) = − X k 2 b k − Lk k . kL (3.7) 32 The ML estimate then is obtained by finding the H that maximizes Equation (3.7): HM = argmax X H L(Lk ) k b k − Lk k2 . = argmin kL (3.8) H If the formation process in Equation (3.1) that maps the high-resolution b to L bk is expressed in matrix form as estimate H b ck = Mk H, L (3.9) we have a system of N linear equations for all N images in the sequence. Stacking these vertically, we have: L1 L2 .. . LN M1 M2 b = .. H. . MN b L = MH. (3.10) Using this notation, the solution of Equation (3.7) can be obtained by b = (M> M)−1 M> L. H (3.11) In practice, M is very large and its pseudo-inverse is prohibitive to compute and therefore iterative minimization techniques are used. The iterative methods also b when the high-resolution images are not all equally facilitate the computation of H likely, and several priors are included in Equation (3.3). We use simple gradient descent as our minimization method. 3.2.5 Maximum a Posteriori Estimate b we use for obtainIn this section we shall describe the priors P r(H) ing a maximum a posteriori estimate (MAP). The MAP estimate is obtained by maximizing the full expression in Equation (3.3). The most common prior used 33 in the super-resolution literature is the smoothness prior introduced by Schultz and Stevenson [42]. Capel and Zisserman also use a learnt face-space prior [8]. For super-resolution of text specifically, Donaldson and Myers [12] use a bi-modal prior taking into account the bi-modal appearance of dark text on light background. The two priors we experimented with were the smoothness and bi-modal prior. Smoothness Prior The smoothness prior we used was introduced by Schultz and Stevenson [42] and has the probability density: b b H(x,y)) b y)) = cs e−ρ(H(x,y)− , P rs (H(x, (3.12) b y) is the average of the pixel intensities where cs is a normalizing constant, H(x, b of the four nearest neighbors of H: b b b b b y) = H(x − 1, y) + H(x + 1, y) + H(x, y − 1) + H(x, y + 1) , H(x, 4 and ρ(x) is the Huber cost function: x2 ρ(x) = 2α|x| − α2 (3.13) , |x| ≤ α . (3.14) , |x| > α b y) − H(x, b y) expression is a measure of the local smoothness around The H(x, a pixel (x, y), where large indicates discontinuities and small indicates a smooth region. A plot of the Huber function is shown in Figure 3.3 (a). Its use is justified by Donaldson and Myers [12] who suggest the linear region of ρ(x) for |x| > α preserves steep edges because of the constant derivative. Bi-Modal Prior The bi-modal prior used by Donaldson and Myers [12] is an exponential fourth-order polynomial probability density with maxima at the corresponding 34 1.02 0.8 1 0.6 Prb/cb Gradient Penalty 1 0.4 0.98 0.2 0 −1 −0.5 0 Gradient 0.5 0.96 0 1 0.25 0.5 0.75 Image Intensity (a) 1 (b) Figure 3.3: (a) The Huber penalty function used in the smoothness prior with α = 0.6 and red and blue corresponding the regions |x| ≤ α and |x| > α respectively; (b) an un-scaled version of the bi-modal prior with µ0 = 0.1 and µ1 = 0.9. black and white peaks of the pixel intensity distributions of the high-resolution image: b y)) = cb e−(H(x,y)−µ0 ) P rb (H(x, b 2 (H(x,y)−µ b 1) 2 , (3.15) where cb is a normalizing constant and µ0 and µ1 are the centers of the peaks. The function is shown in Figure 3.3 (b) for a choice of µ0 = 0.1 and µ1 = 0.9. Donaldson and Myers estimate µ0 and µ1 for each high-resolution estimate, but instead we used the constants in Figure 3.3 (b). Computing the Estimate Combining the likelihood and two prior probability distributions and substituting into Equation (3.3), we have H = argmax H Y k b · P r(Lk |H) Y x,y b y)) · P rb (H(x, b y)). P rs (H(x, (3.16) 35 Taking the negative log-likelihood of the right-hand side, H = argmin − X H 2 b − Lk k + kMk H k X b y) − H(x, b y)) − ρ(H(x, x,y X b y) − µ0 )2 (H(x, b y) − µ1 )2 . (H(x, (3.17) x,y b ES (H), b and For convenience, we shall refer to each of the three terms as EM (H), b To control the contributions of each term we weigh EM (H), b ES (H), b and EB (H). b by the constants cM , cS , and cB , respectively: EB (H) H = argmin b + cS ES (H) b + cB EB (H). b cM EM (H) (3.18) H We chose to use gradient descent to minimize Equation (3.17), therefore, b The we need to find the derivative of the entire expression with respect to H. derivative of the ML term is straightforward: ∂ b = −2Mk > (Mk H b − Lk ), E (H) b M ∂H (3.19) and the derivative of the bi-modal term is: ∂ b = 2(H(x, b y) − µ0 )(H(x, b y) − µ1)(2H(x, b y) − µ0 − µ1). E (H) b B ∂H (3.20) The derivative of the smoothness term is more tricky to compute since each neighb y) involves H(x, b y) in its H b calculation. Therefore, we need to unroll bor of H(x, b around H(x, b y) and then find the derivative: ES (H) ... + x ... + x ) + ρ(x2 − )... + 4 4 x1 + x2 + x3 + x4 ... + x ) + ρ(x3 − ) + ... + ρ(x − 4 4 ... + x ρ(x4 − ) + ..., 4 b = . . . + ρ(x1 − ES (H) (3.21) 36 where b y) x = H(x, b y − 1) x1 = H(x, b − 1, y) x2 = H(x b + 1, y − 1) x3 = H(x b y + 1). x4 = H(x, The derivative then is ∂ b y)) = 1 ρ0 (x1 − . . . + x ) + EB (H(x, b 4 4 ∂H 1 0 ... + x ρ (x2 − )+ 4 4 x1 + x2 + x3 + x4 ρ0 (x − )+ 4 ... + x 1 0 ρ (x3 − )+ 4 4 1 0 ... + x ρ (x4 − ). 4 4 (3.22) Having obtained the derivatives of each term, we iteratively step in the direction opposite the gradient until we reach a local minimum. At each step we add some b ES (H), b and EB (H) b terms, controlled by the portion of the gradient of the EM (H), factors cM , cS , and cB , respectively and the step size. Instead of constructing each Mk matrix in Equation (3.19) explicitly, we only apply image operations such as warp, blur, and sample for multiplications with Mk and Mk > Mk , similar to Zomet and Peleg’s work [52]. Since Mk is the product of each of the image operations, it can be decomposed into Mk = SBWk , (3.23) where S is the down-sampling matrix, B is the matrix expressing the blurring with the PSF, and Wk is the transformation matrix representing Tk . Therefore, Mk > = W k > B > S > , (3.24) 37 (a) (b) (c) (d) Figure 3.4: Super-resolution results: (a) sequence of images processed, (b) an upsampled version of one low-resolution image, (c) the average image, (d) the final high-resolution estimate. where Wk > is implemented by applying the reverse transformation Wk applies, B > is implemented by applying the same blur operation as B since we are using an isotropic Gaussian PSF, and S > is implemented by up-sampling without any interpolation. We use the average image in the sequence resized to four times the resolution using bi-linear interpolation as the initial high-resolution estimate. The choice of the average image as an initial estimate is justified since it contains little of the noise found in the individual images as is seen in Figure 3.4 (c). Since we are performing a cross-check of each image with each other image in the sequence during registration, the first few images (which have most detail) are pruned. Had we implemented a more general transformation estimation for registration, we would have been be able to take advantage of these images, but simple translation estimation with them included negatively affected the average image and thus the initial super-resolution estimate. 3.2.6 Discussion There are numerous parameters in our image formation model and our super-resolution algorithm that require either estimation or initial adjustment. The 38 values of these parameters have a profound effect on the final super-resolution images. Some of these results may look more appealing to us as humans, but the only way to determine whether super-resolution in general is worth-while is to actually determine whether they improve the OCR rate. This was the approach taken by Donaldson and Myers [12], however, they used Scansoft’s DevKit 2000, a commercial OCR engine, on printed text, for which most commercial OCR packages are designed. Although we were unable to obtain a copy of their choice of OCR package, the commercial OCR software we experimented with performed very poorly on our super-resolution images, most likely because the OCR engines were not specifically trained on our forms of license plate text, or were not of sufficiently high resolution. Donaldson and Myers found that the biggest factor super-resolution had on improving OCR performance was the more clear separation of characters rather than the reduction of noise. The separation of characters, which is a result of the bi-modal prior, can also be observed in our data as shown in the super-resolution estimate in Figure 3.4 (d). The image also exhibits a clear bi-modal pixel intensity distribution, and in fact, the contrast is good enough to not require binarization algorithms to be applied, a pre-processing step often necessary for OCR packages to work correctly. 3.3 Optical Character Recognition In this section we describe a very simple algorithm to recognize the char- acters on detected plates and propose additional methods that may be used in further research. 3.3.1 Previous Work It was our initial intent to apply a binarization algorithm, such as a modified version of Niblack’s algorithm as used by Chen and Yuille [10], on the 39 extracted license plate images from our detector, and then use the binarized image as input to a commercial OCR package. We found, however, that even at a resolution of 104 × 31 the OCR packages we experimented with yielded very poor results. Perhaps this should not come as a surprise considering the many custom OCR solutions used in existing LPR systems. The most common custom OCR approach used by existing LPR systems is correlation-based template matching [35], sometimes done on a group of characters [11]. Sometimes, the correlation is done with principal component analysis (PCA) [27]. Others [44] apply connected component analysis on binarized images to segment the characters and minimize a custom distance measure between character candidates and templates. Classification of segmented characters can also be done using neural networks [37] with good results. Instead of explicitly segmenting characters in detected plates Amit et al. [2] use a coarse-to-fine approach for both detection and recognition of characters on license plates. Although they present high recognition rates, the license plate images they worked with were of high-resolution, and it is not clear whether their method will be as effective on the low-resolution images in our datasets. Because of the simplicity of the template matching method, we chose to experiment with it first, and it proved to work reasonably well. 3.3.2 Datasets We generated training and test data by running our license plate detector on several hours of video and extracting sequences of images for each tracked license plate. This process resulted in a total of 879 plate sequences each of which was labeled by hand. Of these, 121 were chosen at random to form an alphabet of characters for training. These 121 sequences contained the necessary distribution of characters to form 10 examples per character, for a total of 360 examples (26 letters and 10 digits). This alphabet of training images is shown in Figure 3.5. The remaining 758 plates were used for testing the OCR rate. 40 Figure 3.5: The alphabet created from the training set. There are 10 examples for each character for the low-resolution, average image, and super-resolution classes, shown in that respective order. Figure 3.6 shows a histogram of the frequency of all characters in our training and test datasets. Note that the majority of characters are numbers with ‘4’ being most common since most of today’s California plates start with that number. The frequencies of ‘I’, ‘O’, and ‘Q’ were relatively small most likely due to their potential confusion with other similarly shaped characters. 3.3.3 Template Matching Unless text to be read is in hand-written form, it is common for OCR software to segment the characters and then perform recognition on the segmented image. The simplest methods for segmentation usually involve the projection of row and column pixels and placing divisions at local minima of the projection functions. In our data, the resolution is too low to segment characters reliably in this fashion, and we therefore decided to apply simple template matching instead, which can simultaneously find both the location of characters and their identity. 41 License Plate Character Frequencies 750 600 450 300 150 0 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Figure 3.6: Character frequencies across our training and test datasets. The algorithm can be described as follows. For each example of each character, we search all possible offsets of the template image in the license plate image and record the top N best matches. The searching is done using the NCC metric shown in Equation (3.2), and a threshold on the NCC score is applied before considering a location a possible match. If more than one character matches a region the size of the average character, the character with the higher correlation is chosen and the character with the lower correlation is discarded. Once all templates have been searched, the characters for each region found are read left to right forming a string. N is dependent on the resolution of the license plate image and should be chosen such that not all N matches are around a single character when the same character occurs more than once on a plate, and not too large so that not all possible regions are processed. This method may seem inefficient, however, the recognition process takes on the order of half a second for a resolution of 104 × 31, which we found to be acceptable. This recognition time is much smaller than the several seconds required to estimate a super-resolution image. Our results for this method are shown in Section 3.4. 42 3.3.4 Other Methods We would like to propose several ideas for future work on license plate OCR. The first method is to apply shape context matching [4] on characters segmented after applying connected components and a thinning pre-processing step [44] on the high-resolution estimates. Shape contexts have shown to be very effective at recognizing hand-written digits, and it is reasonable to presume that the method might work well on license plate characters. The second method that might benefit further research in this area is to apply the AdaBoost framework to recognizing segmented characters. At the time of this writing we are not aware of any OCR algorithms that use boosted classifiers, but the filters we presented in Chapter 3 may also be adapted to individual characters, with the caveat that many more training examples would be required and the AdaBoost classifier we presented would need to be modified for multiclass classification. Mori and Malik [32] use a Hidden Markov Model (HMM) to choose the most likely word when performing text recognition in images with adversarial clutter. A similar method may apply to license plate recognition to learn and recognize common character sequence types, such as a digit, followed by three letters, followed by three digits. 3.4 Results Our template matching method was not well-suited for recognition on the super-resolution images using the super-resolution templates in our training alphabet. Our low-resolution templates yielded far better results on the test set, which is most likely due to a better correlation resulting from the natural blur that occurs in the low-resolution images, allowing more intra-class variance. Therefore, in this section we present our results on just the low-resolution image sequences. Figure 3.7 shows our recognition results on the low-resolution images 43 Low−Resolution Edit Distance Histogram Number of Plates 160 120 80 40 0 0 1 2 3 4 5 6 7 Edit Distance Standard Loose Figure 3.7: Template matching OCR results on the low-resolution test set for ‘standard’ and ‘loose’ comparisons between recognized characters and actual characters. from the test set, taken from the second frame in the image sequence of the plate trackers. We used the edit distance, sometimes also referred to as the Levenshtein distance, to measure how similar our recognized text was to the labeled plates in the test set. Because certain characters are easily confused with others, even by humans, we also applied a ‘loose’ character equality test whenever the edit distance algorithm compared two characters. The groups of characters {‘O’, ‘0’, ‘D’, ‘Q’}, {‘E’, ‘F’}, {‘I’, ‘T’, ‘1’}, {‘B’, ‘8’}, and {‘Z’, ‘2’} were each considered of the same type and no penalty was applied for incorrect readings within the group. Figure 3.7 shows the number of license plates read with various numbers of mistakes with and without using the ‘loose’ comparison measure. Figure 3.8 shows the template matching method applied to the actual low resolution images in the test set. Note that over half of the test set was recognized with two or fewer mistakes. One can observe a large degradation in image quality with each progressive horizontal section. The template matching is most often thwarted by plate boundaries, which are more and more visible as the size of the plate decreases. Our goal for this thesis was to have an unconstrained LPR system, and 44 Figure 3.8: Recognition results for the images in our test set. Each horizontal section lists plates whose read text contained 0, 1, 2, 3, 4, 5, 6, and 7 mistakes. 45 these OCR rates are quite satisfactory for our purposes. An alternative to superresolution would be to perform OCR on each image in the sequence and obtain the most likely text in that fashion, however, this experiment was not performed. Chapter IV Make and Model Recognition As with our license plate recognition problem, detecting the car is the first step to performing make and model recognition (MMR). To this end, one can apply a motion segmentation method such as [50] to estimate a region of interest (ROI) containing the car. Instead, we decided to use the location of detected license plates as an indication of the presence and location of a car in the video stream and to crop an ROI of the car for recognition. This method would also be useful for make and model recognition in static images, where the segmentation problem is more difficult. In this chapter, we describe several feature-based and appearance-based methods commonly used in object recognition and evaluate their recognition rates on car images extracted from our video stream. 4.1 Previous Work To the best of our knowledge, MMR is a fairly unexplored recognition problem. Various work has been done on car detection in street scene images [29] [43] [39] and aerial photographs [41]. Dorko and Schmid [13] use scale invariant features to detect cars in images with 50% background on average. Agarwal et al. [1] automatically create a vocabulary of car parts, such as tires and windshields, from training images and detect cars by finding individual parts and comparing 46 47 their spatial relations. Interestingly, most of the car detection literature only deals with side-views of cars, perhaps because from a large distance the side profile provides richer and thus more discriminating features. The work of Ferencz et al. [14] is most closely related to our problem statement. Their work is helping develop a wide-area car tracking system and is not formulated as a recognition problem, but what they call an object identification problem. In our system we are interested in determining to which make and model class a new vehicle belongs, and although all classes consist of cars, there is fair amount of variation within each of the make and model classes. In contrast, Ferencz et al. are interested in determining whether two images taken at different times and camera orientations are of the exact same car, where there is really only a single example that serves as a model. They solve this problem by automatically finding good features on side views of cars from several hundred pairs of training examples, where good features refer to features that are good at discriminating between cars from many small classes. 4.2 Datasets We automatically generated a database of car images by running our license plate detector and tracker on several hours of video data and cropping a fixed window of size 400 × 220 pixels around the license plate of the middle frame of each tracked sequence. This method yielded 1,140 images in which cars of each make and model were of roughly the same size since the license plate detector was specialized to respond to a narrow range of license plate sizes. The majority of these images are shown in Figure 4.1. The crop window was positioned such that the license plate was centered in the bottom third of the image. We chose this position as a reference point to ensure matching was done with only car features and not background features. Had we centered the license plate both vertically and horizontally, cars that have their plates mounted on their bumper would have 48 exposed the road in the image. Although this method worked well in most cases, for some cars, the position of the license plate was off-center horizontally, which allowed for non-car regions to be included in the ROI. After collecting these images, we manually assigned make, model, and year labels to 790 of the 1,140 images. We were unable to label the remaining 350 images due to our limited familiarity with those cars. We often made use of the California Department of Motor Vehicles’ web site [23] to determine the makes and models of cars with which we were not familiar. The web site allows users to enter a license plate or vehicle identification number for the purposes of checking whether or not a car has passed recent smog checks. For each query, the web site returns smog history as well as the car’s make and model description if available. The State of California requires all vehicles older than three years to pass a smog check every two years. Therefore, we were unable to query cars that were three years old or newer and relied on our personal experience to label them. We split the 1,140 labeled images into a query set and a database set. The query set contains 38 images chosen to represent a variety of make and model classes, in some cases with multiple queries of the same make and model but different year in order to capture the variation of model designs over time. We evaluated the performance of each of the recognition methods by finding the best match in the database for each of the query images. 4.3 Appearance-based Methods Appearance-based object recognition methods work by treating entire images as feature vectors and comparing these vectors with a training vector space. An M × N image would be transformed into a single M N -dimensional feature vector consisting of just pixel intensities from the image. In practice, M and N are too large to search the training vector space efficiently for a best match and some sort of dimensionality reduction is done first. Common dimensionality 49 Figure 4.1: Our automatically generated car database. Each image is aligned such that the license plate is centered a third of the distance from bottom to top. Of these images, 1,102 were used as examples, and 38 were used as queries to test the recognition rates of various methods. We used the AndreaMosaic photo-mosaic software to construct this composite image. 50 reduction techniques are principal component analysis (PCA) [33] [34] and the Fisher transform [3]. Because appearance-based methods work directly with feature vectors consisting entirely of pixel brightness values (which directly correspond to the radiance of light emitted from the object), they are not good at handling illumination variability in the form of intensity, direction, and number of light sources, nor variations in scale. The Fisherface method [3] and Illumination Cones [18] address the illumination variability problem but are not invariant to scale. In this section, we describe the Eigenface recognition method, which has frequently been used in face recognition, and evaluate its performance on MMR. 4.3.1 Eigencars In principal component analysis, a set of feature vectors from a high- dimensional space is projected onto a lower dimensional space, chosen to capture the variation of the feature vectors. More formally, given a set of N images {x1 , x2 , ..., xN }, each expressed as an n-dimensional feature vector, we seek a linear transformation, W ∈ Rn×m , that maps each xk into an m-dimensional space, where m < n, such that W > ΣW (4.1) is maximized. Here, N 1 X Σ= (xi − µ)(xi − µ)> , N − 1 k=1 and N 1 X µ= xi N i=1 is the average image. The covariance matrix Σ is also referred to as the total scatter matrix [3] since it measures the variability of all classes in the n-dimensional feature vectors. 51 Finding the W that maximizes Equation (4.1) is an eigenvalue problem. Since n is usually very large (in our case 88,000) and much larger than N (1,102), computing the eigenvectors of Σ directly is computationally and storageprohibitive. Instead, consider the matrix A = [x1 − µ, x2 − µ, ..., xN − µ] . (4.2) Then, Σ = AA> . Using singular value decomposition (SVD), A can be decomposed as A = U DV > , where U and V are orthonormal and of size n × N and N × N respectively, and D is an N × N diagonal matrix. Using this decomposition, Σ becomes Σ = AA> = U DV > (U DV > )> = U DV > V D> U = U D2 U > , (4.3) where D2 consists of {λ1 , λ2 , ..., λN }, where λi are the first N eigenvalues of Σ, with the corresponding eigenvectors, and, therefore columns of W , in the columns of U . Because these eigenvectors are of the same dimensions as the set of xi images, they can be visualized and in the face recognition literature are referred to as ‘eigenfaces’ [48]. We chose to more aptly call them ‘eigencars’ since our domain of input images consists of cars. The first ten eigenvectors corresponding to the ten largest eigenvalues are shown in Figure 4.2 (b), and µ is shown in Figure 4.2 (a). The eigencars recognition algorithm can then be described as follows: Off-line 1. Construct the A matrix from a set of N images {x1 , x1 , ..., xN } 2. Compute the SVD of A to obtain the eigenspace U and the diagonal matrix D containing the eigenvalues in decreasing order 3. Project each of the N column vectors of A onto the eigenspace U to obtain a low-dimensional N × N feature matrix F = A> U , and scale each row of F by the diagonal of D 52 (a) (b) Figure 4.2: (a) The average image, and (b) the first 10 eigencars. On-line 1. Subtract the average image µ from the query image q, q0 = q − µ 2. Project q0 onto the eigenspace U to obtain an N -dimensional feature vector f and scale f by the diagonal of D 3. Find the row k of F that has the smallest L2 distance to f and consider xk to be the best match to q Results We applied the algorithm to our database and query sets and obtained a recognition rate of only 23.7%. This is a very low recognition rate, however, the recognition rate using random guessing is 2.5%. Figures 4.3 and 4.4 show the query images and the top ten matches in the database for each query using the on-line recognition method. Note the stark similarity in overall illumination of all matches for each query, even though 53 the matches contain a large variation of makes and models. This suggests the algorithm is not recognizing car features, but rather illumination similarity. Belhumeur et al. suggest that the three eigencars corresponding to the three largest eigenvalues capture most of the variation due to lighting and that it is best to ignore them. Indeed, discarding these eigenvectors increased the recognition rate to 44.7%. The results of this modified approach are shown in Figures 4.5 and 4.6. Note that the matches no longer exhibit the strong similarity in illumination as before. We also tried removing the top 7 largest eigenvectors, which led to a recognition rate of 47.4%. Removing any more eigenvectors, however, had a negative effect. Discussion The most computationally intensive part of the eigencars algorithm is the computation of F = A> U . With A consisting of the full resolution images, the process takes about four hours, and requires roughly 1,500MB of RAM. We also performed the recognition experiment on sub-scaled versions of the images with 200 × 110 resolution and found that this greatly reduced the off-line training time and significantly reduced the memory requirements without adversely affecting the recognition rate. The on-line part of the algorithm is reasonably fast. It only takes one or two seconds to project q0 onto the eigenspace U . We shall see that this is a strong advantage of the appearance-based method when we evaluate the performance of feature-based methods in Section 4.4. The Fisherface [3] method is a more recent appearance-based recognition method that has similar computational requirements as the Eigenface method and has been shown to yield superior recognition rates in the face recognition domain because it selects a linear transformation that maximizes the ratio of the betweenclass scatter to the within-class scatter. It therefore requires us to place our set of xk training images into separate classes. Due to time constraints, we did not 54 Figure 4.3: The first 19 query images and the top 10 matches in the database for each using all N eigencars. 55 Figure 4.4: The second 19 query images and the top 10 matches in the database for each using all N eigencars. 56 Figure 4.5: The first 19 query images and the top 10 matches in the database for each using N − 3 eigencars. 57 Figure 4.6: The second 19 query images and the top 10 matches in the database for each using N − 3 eigencars. 58 evaluate this method. 4.4 Feature-based Methods In contrast to appearance-based recognition methods, feature-based recognition methods first find a number of interesting features in an image and then use a descriptor representative of the image area around the feature location to compare with features extracted from training images of objects. The features should belong to the objects to be recognized, and should be sparse, informative, and reproducible, the latter two properties being most important for object recognition. If the features themselves are not sufficiently informative, descriptors are used for matching methods, where the descriptors are usually constructed from the image structure around the features. 4.4.1 Feature Extraction Here, we discuss several feature extraction methods commonly used in object recognition. Corner Detectors In the computer vision community, interest point detection is often called corner detection even though not all features need be corners. Corner detection is often used for solving correspondence problems, such as in stereopsis. Corner features occur in an image where there is a sharp change in the angle of the gradient. In practice, these points of sharp change in the angle of the gradient do not always correspond to real corners in the scene, for example in the case of occlusion junctions. Two popular corner detectors are the Harris [19] and Förstner [15] detectors. The output of a Harris corner detector on a car image from our dataset is shown in Figure 4.7. 59 Figure 4.7: Harris corner detections on a car image. Yellow markers indicate occlusion junctions, formed by the intersection of edges on surfaces of different depths. Corner features by themselves are not sufficiently informative for object recognition, but Agarwal et al. [1] combine them with patches of the image used as a descriptor. Salient Features Kadir and Brady [25] have developed a low-level feature extraction method inspired by studies of the human visual system. Their feature detector extracts features at various scales that contain high entropy. For each pixel location x, the scale s is chosen in which the entropy is maximum, where by scale we mean the patch size around x used to obtain a probability distribution P on the pixel intensities used in the entropy H calculation: H(s, x) = 255 X Ps,x (i) log Ps,x (i). (4.4) i=0 Equation (4.4) assumes pixel intensities take on values between 0 and 255. Unlike the corner detector, Kadir and Brady features carry a scale descriptor in addition to their position in the image. We created an efficient implementation of the detector using our integral image optimization technique from Section 2.5.1 for the calculation of P around x for the various scales. Our results on our car image are shown in Figure 4.8. 60 (a) (b) Figure 4.8: Kadir and Brady salient feature extraction results on (a) a car image from our database, and (b) an image of a leopard. We found that Kadir and Brady features had low repeatability when applied to our car images and were, therefore, not further explored. They seem to be more suitable in some images over others as can be seen in Figure 4.8 (b). SIFT Features The corner detector we described earlier is sensitive to changes in image size and, therefore, does not provide useful features for matching images of different sizes. Scale invariant feature transform (SIFT) features recently developed by Lowe [30] overcome this problem and are also invariant to rotation and even partially invariant to illumination differences. The process of extracting SIFT features consists of four steps: scale-space extremum detection, keypoint localization, orientation assignment, and descriptor assignment. The scale space L(x, y, σ) of an image I(x, y) is defined as a convolution of a variable-scale Gaussian kernel: L(x, y, σ) = G(x, y, σ) ∗ I(x, y), where G(x, y, σ) = 1 −(x2 +y2 )/2σ2 e . 2πσ 2 61 Figure 4.9: SIFT keypoints and their orientations for a car image. The scale parameter σ is quantized and keypoints are then localized by finding extrema in D(x, y, σ) = L(x, y, kσ) − L(x, y, σ), where kσ is the next highest scale. The location of the extrema are called keypoints. Orientation assignment of each keypoint is then done by computing the gradient magnitude m(x, y) and orientation θ(x, y) of the scale space for the scale of that keypoint: m(x, y) = p (L(x + 1, y) − L(x − 1, y))2 + (L(x, y + 1) − L(x, y − 1))2 θ(x, y) = tan−1 L(x, y + 1) − L(x, y − 1) L(x + 1, y) − L(x − 1, y) Figure 4.9 shows 352 keypoints and their orientations extracted from our example car image from our database. Finally, the descriptor is assigned by dividing the region around the keypoint into 16 symmetric sub-regions and assigning 8 orientation bins to each subregion. The final result is a 16 × 8 = 128-dimensional feature vector. When comparing two SIFT descriptors, the L2 distance measure is used. 4.4.2 Shape Contexts A shape context is an image descriptor introduced by Belongie et al. [4] and has been shown to be very good for matching shapes. Some success- 62 (a) (c) (b) (d) (e) (f) Figure 4.10: (a) Query car image with two interest points shown, (b) database car image with one corresponding interest point shown, (c) diagram of log-polar bins used for computing shape context histograms, (d,e,f) shape context histograms for points marked ‘B’, ‘C’, and ‘A’ respectively. The x-axis represents θ and the y-axis represents log r increasing from top to bottom. ful applications include hand-written digit recognition [4] and breaking “Completely Automated Public Turing Tests to Tell Computers and Humans Apart” (CAPTCHA) [32] protection mechanisms used by internet companies such as Yahoo to deter automated signups for thousands of email accounts. Although the shape context descriptor is best suited for binary images, we felt it would be interesting to test it in the context of grayscale car images. The shape context descriptor is computed as follows. Given an interest point x, we consider a circle of radius r centered on x and divide it into sections according to a log-polar grid as shown in Figure 4.10 (c). We then count the number of edge pixels within a radius r that fall in each bin. The resulting histogram is known as the shape context of x. Figure 4.10 shows the shape context for a pair of matching points and a point on the shape far away from each matching point. 63 Note the similarity in the descriptor for the corresponding points and how vastly different it is for point A. The shape context descriptors are usually compared using the χ2 distance X khi (k) − hj (k)k2 d(hi , hj ) = , khi (k) + hj (k)k bins k (4.5) where hi and hj are the two descriptors. Sometimes, the L2 distance is used instead, though we found that using it had little effect on the overall recognition results. The original shape context work [4] used a histogram with 5 logarithmic divisions of the radius and 12 linear divisions of the angle. In our recognition experiments we also tried a histogram of size 9 × 4 in addition to the original 5 × 12. In [31], Mori et al. augment the shape context histogram to include edge orientations, which we have not experimented with. 4.4.3 Shape Context Matching Shape context matching was the first feature-based method we tried. The algorithm we implemented works as follows: 1. For each image d in the database and a query image q, take a random sampling of N points from the edge images (as shown in Figure 4.11) of q and d and compute the shape context around each point. 2. For each database image d: (a) For each sampled edge point pq in q find the best matching sampled point pd in d within some radius threshold that has a shape context with the smallest χ2 distance according to Equation (4.5). (b) Sum all χ2 distances for every correspondence and treat the sum as some cost c. 3. Choose the d that has the lowest cost c and consider that the best match 64 (a) (b) Figure 4.11: (a) Image edges and (b) a random sampling of 400 points from the edges in (a). In the original work on shape contexts, step 2 was performed for several iterations using the correspondences to compute a thin plate spline transformation that transforms q. Since we are matching 3-D rigid bodies under an approximately affine camera, we instead computed an estimation for the affine transformation using RANSAC that would best align q with d but found that the affine estimate was not sufficiently stable and the natural alignment obtained by using the license plate as a reference point was sufficient. Our recognition rates on our query set were 65.8% using a 5 × 12-size shape context and 63.2% using a 9 × 4-size shape context. The radius of the descriptor we used was 35 pixels and the sampling size N of points was 400. 4.4.4 SIFT Matching We also explored matching query images using the SIFT feature extractor and descriptor discussed earlier. The algorithm we used was the following: 1. For each image d in the database and a query image q, perform keypoint localization and descriptor assignment as described in Section 4.4.1. 2. For each database image d: (a) For each keypoint kq in q find the keypoint kd in d that has the smallest L2 distance to kq and is at least a factor of α smaller than the distance 65 to the next closest descriptor. If no such kd exists, examine the next kq . (b) Count the number of descriptors n that successfully matched in d. 3. Choose the d that has the largest n and consider that the best match. Discussion We found that a few types of keypoint matches resulting from the above algorithm did not contribute to the selection of a best car match. For example, some matching keypoints corresponded to entire groups of digits and letters on the license plates of a query image and a database image even though the the cars to which they belonged looked quite different. Since the best car match in the database is determined based on the number of matched keypoints, spurious matches should be ignored. We, therefore, applied the following keypoint pruning procedures: • Limit horizontal distance between matching keypoints. This helps remove outliers when estimating an affine transformation between the query and database images. • Ignore keypoints that occur in the license plate region. • Do not allow multiple query keypoints to match to the same database keypoint. • Compute an affine transformation from the query to the database image when there are more than three matching keypoints. If the scale, shear, or translation parameters of the transformation are outside a threshold, set the number of matching keypoints n to 0. We used Lowe’s implementation [30] of the keypoint localization part of the algorithm. Unlike in Lowe’s implementation, the query’s keypoint descriptors were 66 compared with the keypoint descriptors of each image in the database. This means that the second best descriptor was not chosen for an object other than the current database image. Also, modifying the threshold from the 0.36 appearing in the published code to 0.60 (which is closer to the suggested in Lowe’s paper) increased the number of matches, but had little effect on the overall recognition rate – misclassified cars using one method were correctly classified with the other at the expense of different misclassifications. When the number of matching descriptors between the query image and a database image is equal to that of another database image, we break the tie by selecting the database image with the smaller overall L2 distance between all the descriptors. This only occurred when the best matches in the database had one or two matching descriptors, and applying the tie-break procedure had little effect on the overall recognition rate. Results The SIFT matching algorithm described above yielded a recognition rate of 89.5% on the query set. The recognition results are shown in Figures 4.12 – 4.15 for each query image. Note that the top 10 matches were all of the same make and model for some of the queries with over 20 similar cars in the database. 67 Figure 4.12: Query images 1-10 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. 68 Figure 4.13: Query images 11-20 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. 69 Figure 4.14: Query images 21-29 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. 70 Figure 4.15: Query images 30-38 and the top 10 matches in the database using SIFT matching. Yellow lines indicate correspondences between matched keypoints of the query (top) and database (bottom) images. 71 Table 4.1: Summary of overall recognition rates for each method. Method Recognition rate Eigencars using all eigenvectors 23.7% Eigencars without 3 highest 44.7% Shape context matching with 9 × 4 bins 63.2% Shape context matching with 5 × 12 bins 65.8% SIFT matching 89.5% 4.4.5 Optimizations Finding the best match for a query image in our database of 1,102 images for both shape context and SIFT matching takes about 30 seconds, compared to 0.5 seconds with the Eigencars method. The high recognition rate achieved with SIFT matching is certainly appealing, but for our system to be real-time, MMR must be as fast as the LPR algorithms. Several possibilities exist that may help in that regard. Instead of comparing features in the query image with every single database image, it would be useful to cluster the database images into groups of similar type, such as sedan, SUV, etc. and perform a hierarchical search to reduce the number of comparisons. A promising method that is applicable to our situation is the recent work by Sivic and Zisserman [46]. They formulate the object recognition problem as a text retrieval problem, which itself has been shown to be remarkably efficient based on our daily experiences with internet search engines. Future work on MMR should investigate the possibility of incorporating a similar approach. 4.5 Summary of Results Table 4.1 summarizes the overall recognition rates of the appearance- based and feature-based methods we evaluated. Table 4.2 lists the the queries used in our test set and shows which methods were able to classify each query correctly. Note that most of the queries SIFT 72 matching was not able to classify correctly had 5 or fewer entries similar to it in the database. It is safe to assume that having more examples per make and model class will increase the recognition rate. 73 Table 4.2: Test set of queries used with ‘Size’ indicating the number of cars similar to the query in the database and which method classified each query correctly. Make and Model VW Beetle Honda Accord-1 Honda Accord-2 Honda Accord-3 Honda Civic-1 Honda Civic-2 Honda Civic-3 Honda Civic-4 Toyota Camry-1 Toyota Camry-2 Toyota Camry-3 Toyota Camry-4 Toyota Corolla-1 (dent) Toyota Corolla-1 Toyota Corolla-2 Toyota Corolla-4 VW Jetta-1 VW Jetta-2 Ford Explorer-1 Ford Explorer-2 Van Van Van Van (occluded) Nissan Altima-1 Nissan Altima-2 Nissan Altima-3 Nissan Altima-4 Nissan Sentra-5 Toyota 4Runner Ford Focus-1 Ford Focus-2 Ford Mustang Honda CR-V BMW 323 VW Passat Toyota Tundra Toyota RAV4 Toyota Sienna-1 Toyota Sienna-2 Size 5 18 20 17 19 16 12 11 20 16 11 4 14 26 8 15 6 4 6 18 5 4 3 3 4 9 6 6 9 7 10 9 7 3 3 6 Eigencars Eigencars Full Minus 3 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ SC SC SIFT 5 × 12 9 × 4 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ Chapter V Conclusions and Future Work 5.1 Conclusions We have presented a useful framework for car recognition that combines LPR and MMR. Our recognition rates for both sub-problems are very promising and can serve as an important foundation to a query-based car surveillance system. Our LPR solution is real-time and works well with inexpensive camera hardware and does not require infrared lighting or sensors as are normally used in commercial LPR systems. Our MMR solution is also very accurate, however, further research is required to make it real-time. We have suggested ideas on how this may be achieved in Section 4.4.5. 5.1.1 Difficulties At the start of our project, we anticipated several difficulties that we would possibly encounter. Some of these include: 1. The weather can sometimes make the background undesirably dynamic, such as swaying branches and even wind-induced camera shake. 2. Variability exists in the license plate designs of different states, and even in the character spacing such as found in vanity plates. 74 75 3. Depending on the Sun’s position, a vehicle’s shadow may be mistaken as being part of the vehicle. 4. Various types of vehicle body damage or even dirt might impact LPR and MMR. 5. Recognition algorithms might only work during broad daylight or only with very good lighting. 6. The surface of most cars is specular, a material property known to cause problems for appearance-based recognition algorithms. In this section, we discuss the observed performance of our system in each of the above situations. 1. The effects of wind were heavily pronounced in the ‘Regents’ camera since the camera’s optics were extended to their full zoom range and even light winds caused camera shake. Even though image stabilization techniques could alleviate this effect, camera shake does not influence our license plate detection algorithms because the entire frame is searched for license plates, and the license plate tracker is sufficiently robust to handle the camera movement we observed. 2. Our datasets did not include an adequate sampling of out-of-state plates and vanity plates to determine how well our system would handle these instances. However, the few such plates we observed seemed to be detected and recognized no differently. 3. Vehicle shadows did not affect our car recognition algorithms. Because of our choice of license plate location as a reference point when segmenting the car image, the segmented image contained only pixels belonging to a car and no background except in very rare cases with SUVs whose plates are mounted off-center. Even in those cases, our MMR algorithm performed well as seen in Figure 4.14. 76 4. Figure 4.13 shows an example of a query of a car with a dent and a van with partial occlusion. For both cases SIFT matching matched the correct vehicle in the database, while the appearance-based methods failed. License plate detection did in fact perform quite poorly on very old or dirty plates, however, those instances were rare, and even we as humans were unable to read those plates. 5. It might be worthwhile to investigate possible night-time make and model recognition methods where some crude intuition might be formed about the vehicle examined based on taillight designs. We have not experimented with night-time video data, but external lighting would certainly be required in those cases for our system to operate. 6. The specular material of car bodies had little observed effect on our MMR rates. In most cases, reflections of tree branches on cars’ windows resulted in features that simply did not match features in the database. In a few instances, as seen in Figure 4.13, several features caused by a tree branch reflection resulted in a match, but were simply not enough to impact the overall recognition rate, and in general, with more examples per make and model this would hardly be a problem. 5.2 Future Work Although our work is a good start to a query-based car surveillance sys- tem, further research is necessary to make such a system possible. In this section, we discuss several necessary features that need to be researched and developed. 5.2.1 Color Inference In addition to searching the surveillance database for cars using some make and model description and a partial license plate, it would also be useful to be able to search for a particular color car as the make and model information may 77 be incomplete. Various color- and texture-based image segmentation techniques used in content-based image retrieval such as [9] may be suitable for our purpose. Since we already segment cars statically using the license plate tracker, we could simply compute a color histogram for the entire region and store this as a color feature vector in addition to the existing SIFT feature vectors for each image in the database. To assign a meaningful label to the color histogram, such as ‘red’, ‘white’, ‘blue’, etc., we can find the minimum distance, as described in [9], to a list of pre-computed and hand-labeled color histograms for each interesting color type. 5.2.2 Database Query Algorithm Development Due to the heavy computation necessary to perform both LPR and MMR, a production surveillance system would require constant updates to a surveillance database as cars are detected in the live video stream and not simply as an overnight batch process. An algorithm for querying such a database might work as follows. Given any combination of partial license plate, make and model, or color description of a car: 1. If partial license plate description is provided, perform a search for the license plate substring and sort results using the edit distance measure described in Chapter 3. 2. If make and model description is provided, search the top results from Step 1 for desired make and model. Otherwise search entire database for the given make and model description. 3. If color is provided, return results from Step 2 with a similar color, as described in Section 5.2.1. Queries in the database should return the times each matching vehicle was observed and allow the user to replay the video stream for those times. 78 5.2.3 Make and Model 3-D Structure In our MMR work, we have not explored car pose variation beyond what normally occurs at the stop signs in our scenes. A robust MMR system should also work well in scenes where there is a large variation of poses. This could require the estimation of a car’s 3-D structure to be used as additional input to the MMR algorithms. Bibliography [1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse, part-based representation. PAMI, 26(11):1475–1490, 2004. [2] Y. Amit, D. Geman, X. Fan. A coarse-to-fine strategy for multiclass shape detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 26, 1606– 1621. 2004. [3] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. PAMI, pp 711–720. 1997. [4] S. Belongie, J. Malik, J. Puzicha. Matching shapes. Proc. ICCV. pp. 454-461, 2001. [5] G. Cao, J. Chen, J. Jiang, An adaptive approach to vehicle license plate localization. Industrial Electronics Society, 2003. IECON ’03. Volume 2, pp 1786- 1791 [6] D. Capel. Image Mosaicing and Super-resolution. PhD thesis, University of Oxford, 2001. [7] D. Capel, A. Zisserman. Super-resolution enhancement of text image sequences. International Conference on Pattern Recognition, pages 600–605, Barcelona, 2000. [8] D. Capel, A. Zisserman. Super-resolution from multiple views using learnt image models. In Proc. CVPR, 2001. [9] C. Carson, S. Belongie, H. Greenspan, J. Malik. Blobworld: color- and texturebased image segmentation using EM and its Application to image querying and classification. PAMI, 24(8):1026–1038, 2002. [10] X. Chen, A. Yuille. Detecting and reading text in natural scenes. CVPR. Volume: 2, pp. 366–373, 2004. [11] P. Comelli, P. Ferragina, M. Granieri, F. Stabile. Optical recognition of motor vehicle license plates. IEEE Trans. On Vehicular Technology, Vol. 44, No. 4, pp. 790–799, 1995. 79 80 [12] K. Donaldson, G. Myers. Bayesian Super-Resolution of Text in Video with a Text-Specific Bi-Modal Prior. SRI http://www.sri.com/esd/projects/vace/docs/ IJDAR2003-Myers-Donaldson.pdf [13] G. Dorko and C. Schmid. Selection of scale-invariant parts for object class recognition. Proc. ICCV, 2003. [14] A. Ferencz, E. Miller, J. Malik. Learning hyper-features for visual identification. NIPS, 2004. [15] W. Förstner. E. Gülch. A fast operator for detection and precise location of distinct points, corners and circular features. Proc. Intercommission Conference on Fast Processing of Photogrammetric Data, Interlaken. 281–305, 1987. [16] Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, Volume 121: 256–285, 1995 [17] Y. Freund, R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. EuroCOLT 95, pages 23-37, SpringerVerlag, 1995. [18] A. Georghiades, P. Belhumeur, D. Kriegman. From few to many: Illumination Cone models for face recognition under variable lighting and pose. IEEE trans. PAMI, pp.643–660, 2001. [19] C. Harris, M. Stephens. A combined corner and edge detector. Alvey Vision Conference, pp 147–151, 1988. [20] H. Hegt, R. de la Haye, N. Khan. A high performance license plate recognition system. SMC’98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.98CH36218). IEEE. Part vol.5, 1998, pp.4357–62 vol.5. New York, NY, USA. [21] http://en.wikipedia.org/wiki/London Congestion Charge [22] http://www.cbp.gov/xp/CustomsToday/2001/December/custoday lpr.xml [23] California DMV Smog Check Web Site http://www.smogcheck.ca.gov/vehtests/pubtstqry.aspx [24] M. Irani, S. Peleg. Super resolution from image sequences. In International Conference on Pattern Recognition, pages 115120, 1990. [25] T. Kadir and M. Brady. Saliency, scale and image description. Proc. IJCV, 45(2): 83–105, 2001. 81 [26] V. Kamat, S. Ganesan. An efficient implementation of the Hough transform for detecting vehicle license plates using DSP’S. Real-Time Technology and Applications Symposium (Cat. No.95TH8055). IEEE Comput. Soc. Press. 1995, pp.58–9. Los Alamitos, CA, USA. [27] N. Khan, R. de la Haye, A. Hegt. A license plate recognition system. SPIE Conf. on Applications of Digital Image Processing. 1998. [28] K. Kim, K. Jung, and J. Kim, Color texture-based object detection: an application to license plate localization. Lecture Notes in Computer Science: International Workshop on Pattern Recognition with Support Vector Machines, pp. 293–309, 2002. [29] B. Leung. Component-based Car Detection in Street Scene Images. Master’s Thesis, Massachusetts Institute of Technology, 2004. [30] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2(60):91–110, 2004. [31] G. Mori, S. Belongie, J. Malik. Efficient Shape Matching Using Shape Contexts, PAMI (to appear), 2005 [32] G. Mori and J. Malik. Recognizing objects in adversarial clutter: breaking a visual CAPTCHA. Proc. CVPR, 2003. [33] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance. ICJV, 1995. [34] H. Murase, M. Lindenbaum, Spatial temporal adaptive method for partial eigenstructure decomposition of large images. Tech. Report 6527, Nippon Telegraph and Telephone Corporation, 1992. [35] T. Naito, T. Tsukuda, K. Yamada, K. Kozuka. Robust recognition methods for inclined license plates under various illumination conditions outdoors. Proc. of IEEE/IEEJ/JSAI International Conference on Intelligent Transportation Systems, pp. 697702,1999. [36] T. Naito, T. Tsukada, K. Yamada, K. Kozuka, S. Yamamoto, Robust licenseplate recognition method for passing vehicles underoutside environment. IEEE T VEH TECHNOL 49 (6): 2309–2319 NOV 2000. [37] J. Nijhuis, M. Brugge, K. Helmholt, J. Pluim, L. Spaanenburg, R. Venema, M. Westenberg. Car license plate recognition with neural networks and fuzzy logic. Proceedings of IEEE International Conference on Neural Networks, Perth, Western Australia, pp 21852903. 1995. [38] K. Okuma, A. Teleghani, N. de Freitas, J. Little and D. Lowe. A boosted particle filter: Multitarget detection and tracking, ECCV, 2004. 82 [39] C. Papageorgiou, T. Poggio. A trainable object detection system: car detection in static images. MIT AI Memo, 1673 (CBCL Memo 180), 1999. [40] R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197– 227, 1990. [41] C. Schlosser, J. Reitberger, S. Hinz, Automatic car detection in high-resolution urban scenes based on an adaptive 3D-model. Proc. IEEE/ISPRS Workshop on ”Remote Sensing and Data Fusion over Urban Areas”. 2003. [42] R. Schultz, R. Stevenson. Extraction of high-resolution frames from video sequences. IEEE Transactions on Image Processing, 5(6):996–1011, 1996. [43] H. Schneiderman, T. Kanade. A statistical method for 3D object detection applied to faces and cars. IEEE CVPR, 2000. [44] V. Shapiro, G. Gluhchev. Multinational license plate recognition system: segmentation and classification. Proc. ICPR 1051–4651. 2004. [45] J. Shi, C. Tomasi, Good Features to track. Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR94), Seattle, June 1994. [46] J. Sivic, A. Zisserman. Video google: a text retrieval approach to object matching in videos. Proc. ICCV, 2003. [47] G. Sullivan., K. Baker, A. Worrall, C. Attwood, P. Remagnino, Model-based vehicle detection and classification using orthographic approximations. Image and Vision Computing. 15(8), 649–654. [48] M. Turk, A. Pentland. Face recognition using eigenfaces. Proc. CVPR, 1991. [49] P. Viola, M. Jones. Rapid object detection using a boosted cascade of simple features. Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on , Volume: 1, 8–14 Dec. 2001 Pages:I-511 - I-518 vol.1 [50] J. Wills, S. Agarwal, S. Belongie. What went where. Proc. CVPR pp. 98104, 2003. [51] Y. Yanamura, M. Goto, D. Nishiyama, M. Soga, H. Nakatani, H. Saji. Extraction and tracking of the license plate using Hough transform and voted block matching. IEEE IV2003 Intelligent Vehicles Symposium. Proceedings (Cat. No.03TH8683). IEEE. 2003, pp.243–6. Piscataway, NJ, USA. [52] A. Zomet, S. Peleg. Super-resolution from multiple images having arbitrary mutual motion. Super-Resolution Imaging, Kluwer Academic, 2001.