Boosting Sparse Representations for Image Retrieval by Kinh H. Tieu Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2000 @ Massachusetts Institute of Technology 2000. All rights reserved. A uthor.......... .. . . . . . . . .. . . . . . . . .. . . . . . . . . Department of Electrical Engineering and Computer Science January 31, 2000 ,,I Z1, Certified by......... . . . j . . . . . . /'? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Paul Viola Associate Professor, Department of Electrical Engineering and Computer Science Thesis Supervisor ................... Arthur C. Smith Chairman, Department Committee on Graduate Students Accepted by ................................... MASSACHUSETTS INSTITUTE OF TECHNOLOGY MAR 0 4 2000 LIBRARIES MiTLibraries Document Services Room 14-0551 77 Massachusetts Avenue Cambridge, MA 02139 Ph: 617.253.2800 Email: docs@mit.edu http://libraries.mit.edu/docs DISCLAIMER OF QUALITY Due to the condition of the original material, there are unavoidable flaws in this reproduction. We have made every effort possible to provide you with the best copy available. If you are dissatisfied with this product and find it unusable, please contact Document Services as soon as possible. Thank you. * Color page reproduction not available. Boosting Sparse Representations for Image Retrieval by Kinh H. Tieu Submitted to the Department of Electrical Engineering and Computer Science on January 31, 2000, in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering Abstract In this thesis, we developed and implemented a method for creating sparse representations of real images for image retrieval. Feature selection occurs both offline by choosing highly selective features and online via "boosting". A tree of repeated filtering with simple kernels is used to compute the initial set of features. A lower dimensional representation is then found by selecting the most selective of these features. At query time, boosting selects a few of the features useful for the particular query and ranks the images in the database by taking the weighted vote of an ensemble of classifiers each using a single feature. This method allows for a large number of potential queries and facilitates fast retrieval on very large image databases. The method is tested on various image sets using standard measures of retrieval performance. An online demo of the system is available via the World Wide Web1 . Thesis Supervisor: Paul Viola Title: Associate Professor, Department of Electrical Engineering and Computer Science 1http://www.ai.mit.edu/projects/lv/ 2 Acknowledgments This research was supported in part by Nippon Telegraph and Telephone. First I thank this great nation for giving me and my family hope. My life would be drastically different without the incredible opportunities I have received here. Thanks to the Massachusetts Institute of Technology and the Artificial Intelligence Laboratory for supporting my studies and providing superb resources for my research. I thank all the members of the Al Lab that have helped me along the way, especially the Learning and Vision Group. Special thanks to John Winn, Dan Snow, Mike Ross, Nick Matsakis, Jeremy De Bonet, Christian Shelton, John Fisher, Chris Stauffer. Extra special thanks to Erik Miller for reading the thesis and offering helpful comments. Thanks to Professor Eric Grimson for advice, support and helping me to fulfill my academic requirements in a timely and worthwhile manner. Finally, I thank my advisor Professor Paul Viola for his care, encouragement, support, and insight. He has greatly enriched my understanding of this work and of how to do research. It is a rewarding and pleasurable experience working with Paul, and I appreciate the time, energy, and thought that he provides. For my family, I would not be here without your unconditional encouragement, support, and love. 3 Contents 1 Introduction 1.1 1.2 Thesis Summary . . . . . . . . . . . . . . . . . Image Retrieval . . . . . . . . . . . . . . . . . . 1.2.1 2 The Problem Model . . . . . . . . . . . 1.2.2 Specifications . . . . . . . . . . . . 1.3 Scene Analysis . . . . . . . . . . . . . . . 1.4 Image Indexing . . . . . . . . . . . . . . . 1.4.1 Measurement Space . . . . . . . . 1.4.2 Similarity Function . . . . . . . . . 1.5 Selective Measurements . . . . . . . . . . 1.5.1 Image Generation . . . . . . . . . 1.5.2 Measurement Design . . . . . . . . 1.6 Learning a Query . . . . . . . . . . . . . . 1.7 Performance Evaluation . . . . . . . . . . 1.8 Im pact . . . . . . . . . . . . . . . . . . . . 1.9 Why Try to Solve Image Retrieval Today? 1.10 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Previous Approaches 2.1 Color Indexing . . . . 2.1.1 Histograms . . 2.1.2 Correlograms . 2.2 Color, Shape, Texture 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5 2.2.6 2.3 2.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VisualSEEk . . . . . . . . . . . . . . . . 2.4.2 Flexible Templates . . . . . . . . . . . . 2.4.3 Multiple Instance Learning and Diverse Density Database Management Systems . . . . . . . . . 2.5.1 2.5.2 2.6 2.7 . . . . . . . . . . . . . . . . . . . . Wavelets . . . . . . . . . . . . . . . . . . . . . . Templates . . . . . . . . . . . . . . . . . . . . . 2.4.1 2.5 Q B IC . . CANDID BlobWorld Photobook JACOB . CONIVAS . . . . Chabot . . . . . . . . . . . . . . . . . . SCORE . . . . . . . . . . . . . . . . . . Summary of Previous Approaches Related Work . . . . . . . . . . . 2.7.1 Information Retrieval . . 2.7.2 Object Recognition . . . . 2.7.3 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 . . . . . . . . . . . . . . . . . . . . 2.7.4 Line Drawing Interpretation . .3 . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 36 36 37 38 38 39 40 41 42 42 42 4 Learning Queries Online 4.1 Image Retrieval as a Classification Task . . . 4.2 Aggregating Weak Learners . . . . . . . . . . 4.3 Boosting Weak Learners . . . . . . . . . . . . 4.4 Learning a Query . . . . . . . . . . . . . . . . 4.5 Observations . . . . . . . . . . . . . . . . . . 4.6 Relevance Feedback . . . . . . . . . . . . . . 4.6.1 Eliminating Initial False Negatives . . 4.6.2 Margin Boosting by Querying the User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 44 45 45 46 46 48 49 49 3 Selective Measurements 3.1 Motivation . . . . . . . . . . . . . . . . 3.1.1 Preprocessing vs. Postprocessing 3.1.2 Image Representations . . . . . . 3.1.3 Sparse Representations . . . . . 3.2 Design . . . . . . . . . . . . . . . . . . . 3.2.1 F ilters . . . . . . . . . . . . . . . 3.2.2 Filtering Tree . . . . . . . . . . . 3.2.3 Color Space . . . . . . . . . . . . 3.2.4 Normalization . . . . . . . . . . . 3.3 Measurement Selection . . . . . . . . . . 3.3.1 Selectivity . . . . . . . . . . . . . Discussion 6.1 Summary . . . . . . . . 6.2 Applications . . . . . . . 6.3 Research Areas . . . . . 6.4 Connections to Biology. 6.5 Future Research . . . . A Implementation of A.1 Architecture . . A.2 User Interface . A.3 Online Demo . . . . . . . . . . . the Image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 50 50 52 52 52 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 64 64 65 65 65 Retrieval System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 67 67 5 Experimental Results 5.1 Performance Measures . . . . . 5.2 Natural Scene Classification . . 5.3 Principal Components Analysis 5.4 Retrieval. . . . . . . . . . . . . 5.5 Face Detection . . . . . . . . . 5.6 Digit Classification . . . . . . . 6 . . . . . . . . . . . . . . . . . . . . . 5 List of Figures 1-1 An airplane and a race car may be classified as belonging to the class of vehicles using functional property "transportation" . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Example images from an airplanes and a race cars class defined by visual properties. 1-3 Schematic of an image retrieval system. . . . . . . . . . . . . . . . . . . . . . . . . . 1-4 A set of random images (that can be found on the World Wide Web) that illustrates the diversity of visual content in images. . . . . . . . . . . . . . . . . . . . . . . . . . 1-5 An example of a scene with many objects and complex relationships between the objects. For example, is the Golden Gate Bridge the only important object in the image? Are the mountains in the background important? Should it be classified as a scene of a coastline? Are the people in the foreground important? There are many ways to describe this image, the difficulty lies in being able to represent all these relationships in a tractable manner . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-6 A query on the AltaVista system (which uses color histogram measurements) for images similar to the sunset image in the upper left corner. Here color is effective because orange colors dominate in images of sunsets but not in other images. . . . . 1-7 A query on the AltaVista system for images similar to the image of the Eiffel Tower in the upper left corner. Color is ineffective for a queries such as this one where background colors (i.e., the blue sky here) dominate. . . . . . . . . . . . . . . . . . . 1-8 Possible renderings of two images generated by picking a few specific itmes from a large dictionary of visual events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-9 The diagonal pattern of vertical edges (marked in red) arising from the stairs in this image represent a "staircase" pattern. . . . . . . . . . . . . . . . . . . . . . . . . . . 1-10 The boosting process iteratively constructs a set of classifiers and combines them into a single boosted classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-11 A query on the system described in this thesis for images similar to the three example images of airplanes at the top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2-2 2-3 2-4 2-5 3-1 The color histograms for both images are very similar because they both contain similar amounts of "yellow"," green",and "brown". However most people would agree that these images represent very different visual concepts. . . . . . . . . . . . . . . . A hand-crafted template for images of waterfall scenes. Here the waterfall concept is defined to be a white region in between two green regions with a blue region on top. An E-R diagram for an image of a sailboat. . . . . . . . . . . . . . . . . . . . . . . . A query on the AltaVista system with the keyword "Eiffel" that illustrates the rich semantic content of particular words. Since an image would probably be labeled "Eiffel" only if it contained an image of the Eiffel Tower, text is effective for this query. A query on the AltaVista system with the keyword "jets". Although we wanted images of jet propulsion vehicles, the system retrieved images of the Jets football team. The image on the right is a permutation of the pixels of the image on the car (left). Using too general an image representation allows for these kinds of unrealistic images. 6 11 11 13 14 15 17 18 19 20 21 22 25 30 31 33 34 37 On the left are the principal components of image patches. On the right are the principal components of the responses to the fourth principal component on the left (a bar-like filter). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 The 9 primitive filters used in computing the measurement maps. In practice, these can be efficiently computed with separable horizontal and vertical convolutions. . . . 3-4 A schematic of the filtering tree where a set of filters is repeatedly applied to an image to capture more global structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 Response of an image of a tiger to a particular filtering sequence. Note that this feature has detected a strong peak corresponding to the arrangement of the stripes on the body of the tiger........ ................................... 3-6 Histograms for a highly selective (left) and unselective measurement (right). The highly selective distribution had a selectivity of 0.9521 and a kurtosis of 132.0166. The unselective measurement had a selectivity of 0.4504 and a kurtosis of 2.3703. . . 3-2 Some typical images exhibiting the "waterfalls" concept. Note the diversity in the images as some contain sky, some contain flora while others are mostly rock, etc. . . 4-2 The left plot shows how boosting finds measurements which better separate images of mountains from other images better than choosing random measurements (right). 4-3 The boosting measurements which separate well images of mountains do not discriminant im ages of lakes as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 At left is the initial performance. On the right is the improved performance after the most four most egregious false positives were added as negative examples. This example is on a data set containing five classes with 100 images in each class (see C hapter 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 40 41 42 43 4-1 5-1 5-2 5-3 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 5-14 5-15 5-16 5-17 5-18 An example image from each class of sunsets, mountains, lakes, waterfalls, and fields. Measurements with four layers of filtering (right) performed better than using only one layer of filtering (left). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A comparison between using color histograms with the chi-square similarity function (left) and using selective measurements and boosting (right). . . . . . . . . . . . . . Here we show that boosting can be used with color histograms to give comparable performance. Boosting achieves this using only half of the histogram measurements. This results in a substantial computational savings on a large database. . . . . . . . A query for waterfalls using color histograms and the chi-square function. Note global color histograms cannot capture the flowing vertical shape of waterfalls. . . . . . . . A query for sports cars using color histograms and the chi-square function. Note unsurprisingly that the few cars found are all red-cars of other colors in the database are not ranked highly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A query for waterfalls using color histograms and boosting. Note that boosting finds images with similar color distributions, but most of those images are not of waterfalls. A query for sports cars using color histograms and boosting. Note that boosting finds images with similar color distributions, but most of those images are not of cars. . . The principal components correlations (left) are much smaller than the original measurem ent correlations (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance is reduce using the principal components of the measurements. . . . . . A query for sunsets. The positive examples are shown on the left. . . . . . . . . . . . A query for waterfalls. The positive examples are shown on the left. . . . . . . . . . A query for sports cars. The positive examples are shown on the left . . . . . . . . . A query for cacti. The positive examples are shown on the left. . . . . . . . . . . . . The average receiver operating curve (left) and precision-recall curve (right) for detecting faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A query for faces. The positive examples are shown on the left. . . . . . . . . . . . . Results for images of the ten digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Results for images of the ten digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 44 48 48 49 51 51 52 53 54 55 56 57 57 58 58 59 60 61 61 62 62 63 6-1 An image with various possible classifications based on visual content, context, prior know ledge, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 A-1 Image retrieval system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2 Image retrieval interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 68 8 List of Tables . . . . . . . . . . . . . . . . . . . 2.1 A comparison of different image retrieval systems. 4.1 The boosting algorithm for learning a query online. T hypotheses are constructed each using a single feature. The final hypothesis is a linear combination of the T hypotheses where the weights are inversely proportional to the training errors. 9 . . . 32 47 Chapter 1 Introduction 1.1 Thesis Summary Today, with digital cameras and inexpensive mass storage devices, digitizing and storing images is easier than ever. It is not uncommon to find databases with over 500,000 images [33]. Moreover, multimedia content on the Internet continues to grow dramatically-in 1995, there were an estimated 30 million images on the World Wide Web [47], today that number is certainly much higher. This explosion of images creates a need to be able to index these databases so that users can easily and quickly find particular images. The goal of an image retrieval system is to provide querying capabilities comparable to those which exist for text document databases such as [2]. The problem involves designing an image representation suitable for retrieval and an algorithm for learning a query. The representation must somehow capture the visual content in images. Since the images desired are not known ahead of time, it is difficult to provide the learning algorithm with many training examples. So training often occurs with only the few training examples provided by the user. A further requirement of the entire system is that it must operate in real time so that it can be used interactively. This thesis presents an approach to the image retrieval problem that is motivated by the statistics of real images (i.e., images of real scenes, not computer generated ones). The task is: "Given some description of visual content, find images which fit the description." In other words, given a concept or class of images, find other images exhibiting the same concept or belonging to the same class. This has important implications for computer vision because being able to find images which match an arbitrary description implies being able to concretely describe an image. In other words, there must be an explicit description of every image so that the description can be compared (by a computer) to the target concept. Humans can retrieve images easily and almost unconsciously especially because we can compare the functional properties of objects in an image. For example both the airplane and race car in Figure 1-1 could be assigned to a "vehicle" class since both provide transportation. This type of categorization may be regarded as "higher" level cognition and resembles [39]'s superordinate categories. Functional categorization requires more than just visual features of an image; some characterization of the function "transportation" is needed. This type of characterization may also involve some assessment of the utility of objects. Strategies for using functional properties are not well developed, even in more constrained domains such as text document retrieval. It is exactly the ease and mystery with which our brains perform this task that makes it extremely difficult to program a computer (which currently requires explicit instructions) to do it'. We will sidestep this problem by only considering information which can be extracted from the "visual content" of images. For example, although our system should be able to identify the Eiffel Tower as a tower, it is not expected to infer meta properties such as "(location Paris)". Thus, it is reasonable to expect our system to be able to categorize airplanes as one class, and race cars as another, since each of those 'This is often a good definition of current artificial intelligence problems. 10 Figure 1-1: An airplane and a race car may be classified as belonging to the class of vehicles using functional property "transportation". NOW= Figure 1-2: Example images from an airplanes and a race cars class defined by visual properties. classes share many visual characteristics as shown in Figure 1-2. This formulation of the problem is commonly known as "query by image content". This more closely corresponds to the "basic level" categorization of [39]. We can thus formulate image retrieval as the following task: Given very few example images, rapidly learn to retrieve other examples. For over 20 years, computer vision has tried to develop systems which take an arbitrary input image and generate a scene description. This general problem has been decomposed into more specific subproblems such as image segmentation, object detection and object recognition. The idea is that if we could solve the subproblems and combine the results, we could compute a complete description of an image and hence compare images by comparing descriptions. Although progress has been made in each of the subproblems, each remains unsolved. Part of the reason for this is that the subproblems are intimately related in subtle ways since having a perfect segmentation would help isolate areas in an image for detection or recognition, and having a perfection detection or recognition system would help us find the segmented regions corresponding to different objects. Research in fusing the results of these subsystems is less developed. We will attack the image retrieval problem without trying to explicitly solve the computer vision subproblems. This reasoning is motivated by the observation that one should not try to solve a more difficult intermediate problem if one can more easily solve the problem directly [49]. We will exploit 11 the statistics of real images by presuming that although many possible events (i.e., visual events such as an image of the Eiffel Tower) may occur in an image, in any particular image, only a small set of events will actually happen. This observation suggests a sparse distributed representation for images. We will design a set of features such that for any particular image, only a few key ones will be active. However over all possible images, each feature should have about equal probability of being active. These features will necessarily be selective, meaning the average response over all images is much lower than the strong responses for a few particular images. To measure the similarity between images, a system can learn which are the important features and compare just those features. To select the most relevant subset of features with only a few example images, "boosting" [20] is used create a series of learners. Each successive learner tries to do better on examples that the previous one did poorly on. This approach is well suited to image retrieval since the selective features can be computed offline once and for all. The learning can be performed quickly with just a few training examples because the representation is sparse. Since only a few relevant features will be selected, the entire database of images can be searched with fewer computations per image. Besides being a possible solution for image retrieval, this thesis suggests a theory for useful generic intermediate level image representations suitable for simple, fast learning with a small number of training examples. 1.2 1.2.1 Image Retrieval The Problem Model Image retrieval can be decomposed into two parts as shown in Figure 1-3: image indexing storing images in a representation that facilitates retrieval. querying learning the deserved query concept, and searching the database for images matching the concept. Image indices are typically precomputed offline so that they do not have to be recomputed for each query. Depending on the type of index used, this may limit the kinds of queries which are possible, but improves retrieval speed. For example if only color information is stored in an index, then queries based on shape will not be possible. However for queries involving color, only the index is required, and the image never needs to be rescanned. Querying is typically performed online as this is part of the interactive user interface of the retrieval system. In machine learning, a concept is induced or "learned" given a set of training examples. The concept could be a rule to discriminate between images of airplanes and race cars. Machine learning is often applied to classification problems where the problem is to assign a label given an input. The training examples are usually given as a large set of labeled data (x", tn) where X" is the nth input image and t' is the corresponding target label. The goal is to learn the concept so that unseen test examples can be labeled correctly. Traditional machine learning methods for classification are difficult to apply to retrieval because they often require a small number of known classes and many labeled training examples. Retrieval differs from classification because the target class is unknown in advance, and only defined at query time. In addition, it is possible for a single image to be relevant for two different queries (e.g., an image picturing a sailboat off the coast of France may be relevant both for a query of boats and coasts). Making matters worse, image databases are often huge (thousands or millions of images) and usually contain a diverse set of images as shown in Figure 1-4. Formally, the task is to rank all the images in a database according to how similar they are to a query concept using a ranking function: r(x, Q) : x -* [0,1] where x is an image to be ranked, Q (1.1) = q 1 ,. .. qN are example images of the query concept, 0 stands for most dissimilar, and 1 stands for most similar. Note that r(x, Q) depends on the particular query 12 = inde imae image database representation user-selected examples 4comlearn query concept I retrieved results Figure 1-3: Schematic of an image retrieval system. 13 m, Figure 1-4: A set of random images (that can be found on the World Wide Web) that illustrates the diversity of visual content in images. Q. Typically N ranges from 1-5. This situation calls for a way to index the images in the database to allow for fast (because online users will not tolerate a long response delay) and flexible (because different users will want to retrieve different types of images) machine learning of r(x, Q) using very few training examples. 1.2.2 Specifications Image retrieval is a difficult problem because it is not clear how to best represent images. For text, because words have such rich semantic content, significant results have been achieved by merely considering documents as lists of word counts without any regard for word order. An image retrieval system must also be efficient in order for it to be practical on large databases. Below are the two primary requirements of an image retrieval system: fast search through a large database (thousands or millions of images) quickly (milliseconds or seconds). flexible handle a large number (hundreds or thousands) of different image queries (concepts) such as sunsets, cars, crowds of people, etc. As noted in [46], an image retrieval system must operate in real time. This allows the system to make use of relevance feedback from the online user. Feedback generally consists of the user choosing more examples and eliminating unwanted ones. The method of specifying queries determines the type of user interface an image retrieval system should have. We have chosen an interface where a query is specified by choosing a set of example images representative of the class of images implied by the query. This is commonly known as "query by example". It frees the user from having to manually draw an image or from trying to describe the visual content with words 2. In particular a system should not demand too much of the user except for a few simple tasks such as: 2 "A picture is worth more than a thousand words." - anonymous 14 Figure 1-5: An example of a scene with many objects and complex relationships between the objects. For example, is the Golden Gate Bridge the only important object in the image? Are the mountains in the background important? Should it be classified as a scene of a coastline? Are the people in the foreground important? There are many ways to describe this image, the difficulty lies in being able to represent all these relationships in a tractable manner. browsing cycling through random images. positives choosing some positive examples of the desired concept. negatives possibly choosing some negative examples. feedback possibly doing a few rounds of relevance feedback. corresponding to picking more positives and negatives. 1.3 Scene Analysis Based on the formulation of the image indexing problem, it seems appropriate to represent each image as a description of the scene that was imaged. This description would include the various objects in the image and how they are related spatially and by other relations. To determine the similarity of two images, we simply determine the similarity of the scene descriptions. There are several reasons that this type of system does not exist today. First, there does not exist a robust method for extracting scene descriptions given an image. To do that we must first be able recognize arbitrary objects in an image. Object recognition is an ongoing goal of computer vision and continues to defy a general solution. Second, it is not clear what should be considered an object (i.e., should the mountains in Figure 1-5 be recognized as one object or multiple mountains?). Finally given two scene descriptions, it is not clear how similarity should be measured. Do we find how many objects images x and y have in common, or are the spatial and other relationships between the objects more important? 1.4 Image Indexing The first step to creating an image retrieval system is to develop a representation for images that facilitates retrieval. Image indexing takes an input image (initially represented as an array of pixel intensities) and produces a new representation that makes it easy to find a particular set of images in a database. One way to think of this problem is to consider the analogous task of making indices for books in a library. For example, a book might be indexed by author, title, publication date, as well as by an abstract. The goal of the index or representation is to permit quick access of a particular book or class of books. 15 1.4.1 Measurement Space To index images, we need to take some measurements from the input image. Let us define a measurement as some function of the original pixel values of the image. Assume that we have some way of extracting a set of measurements M from an input image. For example, one measurement could be the count of the number of pixels of a certain color. Assuming the elements xi of M are scalars (i.e., xi C R), we can arrange them into a measurement vector x = [x 1 , .. . , Xd]T where d is the total number of measurements. We can thus consider an image as an element of the abstract space Rd. This is the multidimensional input (usually called a "feature" vector, although that connotes binary measurements) commonly assumed by many classical pattern recognition techniques [18]. 1.4.2 Similarity Function We can define the similarity s(x, y) between two images (vectors) x and y simply as a function of the L 2 distance between the vectors: s(x, y) = e-d(xY) 2 (1.2) where d d(x,Iy) = (Xi - yi)2 (1.3) i1 is the L 2 distance. Points x and y are maximally similar if they are represented as the same point in Rd since the L 2 distance between them will be 0 and their similarity with be the maximum value of 1 (note that d(x, y) > 0). As the distance between two images in measurement space linearly increases, the similarity will decrease exponentially. This formulation is equivalent to saying that similarity is proportional to the density of x under a spherical gaussian centered on y (or vice versa) since s(-, -) is in the form of a gaussian density: A~X) = 1/ I |,1/2 V/2 |r e-(XI) T E'(X-L). (1.4) Note that the similarity function is probably not the correct or "natural" metric for images in measurement space. For example, the triangle inequality property of the distance functions may not be valid (e.g., red is similar to orange and orange is similar to yellow, but the sum of these two distances can be smaller than the distance between red and yellow since these colors may appear very dissimilar). In addition, it is unclear whether or not similarity should decrease exponentially with distance. Based on sample query results, many commercial image retrieval systems today appear to use color histograms (pixel color counts as measurements) and a static ranking function such as: s(x, Y) r(x, Q) (1.5) where N E q = n (1.6) n= 1 is the sample mean of the measurements of the example images. Once again, this is equivalent to similarity being proportional to the density of the gaussian centered at p. In these systems, first a color histogram representation would be generated for every image in the database. A user specifies a query by choosing an example image. The system then retrieves the most similar images by finding the images which are closest to the example image in the L 2 sense. Color histograms work well when the amount of color happens to be a discriminating measurement for a particular class of images 16 33 6 KU into sia!nil more into skririb M more more into SM& 7MUM3 25 KB mom inlo Jbik more nmo sip ODMN= Wo LO moe in I&J mote moreinto j~jlj m2 Ludo rnore iAO 72m72 lft 44 KB Itik I KB morm ino irnik Figure 1-6: A query on the AltaVista system (which uses color histogram measurements) for images similar to the sunset image in the upper left corner. Here color is effective because orange colors dominate in images of sunsets but not in other images. for a such as sunsets as shown in Figure 1-6. However as shown in Figure 1-7, color is inadequate sky blue finding of job better a done has system the fact In query of images of the Eiffel Tower. similarity static single a using measurements, useful with Even than tall, steel framed structures. function may not be appropriate for every type of query. For example, although blue sky was not an important feature for the Eiffel Tower query, it may be useful in queries for airplanes. Selective Measurements 1.5 1.5.1 Image Generation Consider the following observation of real images: Many possible visual events can occur in images, but in any particularimage, only a few will actually happen. visual Suppose we want to generate some image. We can pick from a very large "dictionary" of Taj the car, sports a dalmatian, a ground, rocky or sandy, events (e.g., clear or cloudy sky, grassy, background the In beach. the on car sports a of image an Majal, etc.). For example we can generate an would be the ocean and perhaps some waves and a few mossy rocks. We could have generated are images these of renderings Possible buffalos. grazing of entirely different picture such as one image-it shown in Figure 1-8. Note however that we cannot have too many things in any single 17 snrrb Z95x4S3 more rdo 59 KB sinflr more inbo lS KB stniar 31x23 more inbo simiik 2sox370 127 KB more inbo Dnx358 more ino s'mibr 18KB 46 2x23 moreinbo KB snhibr 45x3Z5 more -nio 177xZ57 more inbo 12sx92 more irno 4m6xo312 more inio W KB smiira 2n KB smirB 2 KB sinr 13 s Kb milaf Figure 1-7: A query on the AltaVista system for images similar to the image of the Eiffel Tower in the upper left corner. Color is ineffective for a queries such as this one where background colors (i.e., the blue sky here) dominate. 18 Figure 1-8: Possible renderings of two images generated by picking a few specific itmes from a large dictionary of visual events. is very unlikely to have both the buffalos and sports car in the same scene. The statistics of real images allows for many types of images but not too many types of visual events in one particular image (e.g., we are very unlikely to find an image of an aircraft carrier and a herd of camels on the slopes of Mt. Everest). Now suppose we want to know how similar image x and y are. Since we hypothesize that images are generated by selecting visual events, one reasonable approach is to determine the similarity between the events in x and y. So if both x and y contain a sports car on the beach, we might say x is very similar to y. Of course, remember that we are given images in the form of arrays of pixel intensities. Since there is no explicit representation of visual events, we will try to make measurements which respond when these events are present. It is as if we must somehow explain how an image is produced from the abstract generative model of images previously described. The key question is how to design our measurements. 1.5.2 Measurement Design Our approach is based on extracting "selective measurements" of images which correspond to the visual events that we are interested in. Intuitively these are relatively global, high level structural organizations of more primitive measurements. A primitive measurement such as an edge can occur 3 in many places in a particular image . We are more concerned with higher order events such as a specific pattern of edges as in tiger or zebra stripes or a "staircase" pattern. Figure 1-9 shows how a diagonal arrangement of vertical edges correspond to a staircase pattern. We believe that a key characteristic of these events is that they are sparse. Measurement for these events will be selective since they will only respond when a group of pixels are arranged in a specific pattern. A measurement which detects whether a pixel is green or not will not be selective since many pixels of an image may be green and furthermore many images will contain green pixels. We have designed a filtering sequence that attempts to extract selective measurements. The first stage of filtering uses masks which signal primitive events such as oriented edges and bars. Each successive stage of filtering uses the same set of filters but of increasing support (the ratio of the filter and image sizes) and is applied to the response images of the previous stage. The idea is to capture the structural organization of lower level responses. Note that by itself, the ability to uniquely identify every image is not useful. In fact the array of pixels representation is already adequate for that purpose. However it does not suggest an obvious way to generalize (i.e., grouping a set of images as members of a class). We can view this most primitive representation as "overfitting" particular images. In machine learning, a classifer is a system which when given an input outputs the corresponding class label. Just as we can trivially build a classifier with zero training error by simply memorizing the training samples with a lookup 3 For example, in any region where there is a strong intensity gradient. 19 Figure 1-9: The diagonal pattern of vertical edges (marked in red) arising from the stairs in this image represent a "staircase" pattern. table, we can just as easily assign a unique label (e.g., 1, 2, ... , N) to each sample. However it is unlikely that two images will ever have the exact same array of pixel values, so this method will not allow us to label new images. For example, a slight translation of all the pixels to the right will cause a change in almost every one of the array values. However, we would still regard the slighted shifted image as very similar to the original image. We need measurements which occur at an intermediate level of representation. This will enable our system to compare these measurements and use them to identify a variety of image classes. We have designed our selective measurements to fill this need. An added benefit of selective measurements is that they can speed up learning. Selective measurements will induce a sparse representation for images. In other words, only a few measurements will be active for a particular class of images. Thus a learning algorithm need only focus on those particular measurements useful for identifying images in the class. 1.6 Learning a Query To take advantage of the selective measurements representation of images, we will use a "boosting" framework to learn the concept implied in a class of images. In our case we would like a classifer which will tell us whether an image belongs to say the class of waterfall images (i.e., is similar to a set of example images of waterfalls). Often it is quite easy to construct a classifer which performs better than chance simply by using some sort of information in the image such as whether or not there exists a striped pattern. Using other measurements we can build more weak classifiers which perform better than random, but not much better. Boosting [20] is a general method for improving a weak classifier by iteratively constructing a set of them and linearly combining their outputs into a single boosted classification. Since only a few key measurements will be meaningful for a particular image class, the goal of the learning algorithm is to find these key measurements. Boosting starts by constructing a classifier which uses only one measurement. It then modifies the distribution of training example images so that incorrectly classified ones are more heavily weighted. The process is then repeated by selecting another measurement and building a second classifier. After a few iterations, we have a set of classifiers, each of which is like a rule of thumb for classifying an image. Boosting ensures that when all the classifiers are linearly combined, the weighted classification is more accurate. Figure 1-10 shows a schematic of the boosting process. Boosting takes advantage of the learning algorithm's ability to adapt to different training data and explicitly reweights specific training examples to alter the distribution of the training data. Boosting enables the learning algorithm to select the key measurements most relevant for the given query and ignores the rest. After this initial training phase, querying the entire database only requires 20 training examples measurements I select a new measurement -* classifier I boosted classifier classifer 2 build classifier 14 classifer M P-- test classifier adjust weights weights Figure 1-10: The boosting process iteratively constructs a set of classifiers and combines them into a single boosted classifier. computations with T measurements, where T is the number of boosting iterations. Empirically we have found that 10 to 50 measurements gives reasonable results. As a practical advantage, this makes storing the image representations in an inverted file more efficient. An inverted file indexes the database by measurement instead of by image. For example file 1 contains measurement 1 for all images, file 2 contains measurement 2 for all images, etc. This type of organization is more useful for the boosting procedure which looks at one measurement in multiple images at a time. 1.7 Performance Evaluation It is difficult to evaluate image retrieval performance because ground truth does not exist. Intuitively we all know how a good system should perform, but there does not exist an image database such that for any given query, the optimal results are well defined. The enormity of databases and the multitude of possible queries make constructing a standard test database difficult. Ultimately, the best measure is correlation with human judgments of similarity. Figure 1-11 shows an example query of the image retrieval system using selective measurements and boosting. Many people would agree that most of the retrieved images are similar to the example images. We will carefully construct some test databases and evaluate performance using standard measures borrowed from information retrieval [26]. 1.8 Impact In addition to being a practical tool for image indexing, search, and categorization, an image retrieval system must necessarily address important theoretical questions in vision and learning. The task of looking for images similar to an example image implies a definition of similarity for images. This definition in turn relies on an understanding of how to represent and explain images. Humans can retrieve images with ease. By formalizing this problem and developing a system to solve it, we can gain insight into the brain's solution. Note that the different types of possible queries (or concepts) is very large. The image retrieval system must be general enough to measure similarity for very different types of images, ranging from images of sunsets to crowds of people. Although developing measurement detectors for eyes, noses, and mouths will enable queries for faces or people, they will not work for queries of cars, waterfalls, etc. It is difficult to know how many image concepts need to be supported. In addition, the desired image class remains unknown until a query is specified. What 21 Figure 1-11: A query on the system described in this thesis for images similar to the three example images of airplanes at the top. 22 this calls for is a set of measurements that is general enough for a large class of images. The problem with using many different measurements is that learning structure in large measurement spaces is more difficult. This problem is exacerbated by the fact that we can only expect users to provide a few example images of the class. Many traditional machine learning techniques require hundreds or thousands of examples to learn a concept. Thus the image retrieval problem is a practical need, and presents important challenges to computer vision and machine learning. 1.9 Why Try to Solve Image Retrieval Today? Our approach to image retrieval describes images with visual content measurements. Finding the measurements themselves is a difficult problem and currently most representations use undiscriminating measurements such as color histograms. In addition, learning the query with only a few example images is a difficult task. Despite these shortcomings the advantage of this approach is that it deals with visual content directly making the system more intuitive and natural. It also allows for an investigation into the statistics of real images and learning with only a few examples. Thus instead of waiting for a complete general theory of vision to be found or remaining content with text annotation for images, our exploration with image retrieval may yield some interesting results and point out other problems that need to be solved. 1.10 Thesis Organization In this introduction we provided the reader with an overview of the image retrieval problem and proposed an approach motivated by the statistics of real images. We also briefly introduced ideas in computer vision, machine learning, and information retrieval which are relevant to this research. The rest of the thesis details the ideas presented here. Chapter 2 surveys the current approaches to image retrieval. Both historical approaches and current state of the art methods will be examined. We will compare our approach to previous methods and point out where our primary contribution lies. We will also briefly mention related work. Chapter 3 begins the discussion of the approach to the problem. We describe a method of extracting selective measurements from images and selecting particular measurements for the retrieval task. In Chapter 4, we describe how the image representation developed in Chapter 3 is used for image retrieval. We will present the approach for learning image concepts with very little training data. In Chapter 5, we will show the results of experiments using our approach for image retrieval. We also discuss our method of selecting and using image data sets and various performance measures. Chapter 6 summarizes our work and discusses future research directions. 23 Chapter 2 Previous Approaches Approaches to image retrieval have been primarily exploratory. There is no satisfactory theory for why certain measurements and classifiers should be used. Although there has been an explosive growth of digital image acquisition and storage technology, researchers have only begun to improve image retrieval technology. These systems can be divided into feature/measurement based approaches such as [46, 23, 6, 29, 38, 19, 43, 25, 10, 1, 34, 24] and database management system approaches such as [33, 3]. Some common characteristics of many previous approaches are: * Only a single positive example is used. " Negative examples are often not considered. " The user is required to adjust the weights (importance) of various features. " The user often needs to be experienced with an intricate query language. * Various features are used without a principled way of combining them. * There is no query learning, only a fixed similarity function. These characteristics put a heavy burden on the user to be familiar with the intricacies of the retrieval system. Often a particular example image and set of weights will work well, but a slightly different setting of the weights or a different example will drastically alter the retrieval. Many of the early systems were tested on very small databases (on the order of a hundred images). This is a small number compared to the typical size of current image collections, and a miniscule fraction of the number of images on the World Wide Web. 2.1 2.1.1 Color Indexing Histograms One of the earliest approaches to image indexing used color histograms [46]. A color space such as RGB (red, green, blue) is discretized into a fixed set of colors c 1 , ... , cd to be used as the bins of the histogram. The color histogram for an image is simply a table of the counts of the number of pixels in each color bin. [46] used the following "histogram intersection" similarity function: s(x, t) = Z min(xi, ti) zi=1 ti (2.1) where x is the test image, t is the model image, and i indexes the bins in the histogram. This gives the sum of the fractional matches between the test image and the model. If objects in x and t are segmented from the background, then this is equivalent to the L, distance (sum of absolute values) of the histograms treated as Euclidean vectors. Color histograms work well for a database with a 24 Figure 2-1: The color histograms for both images are very similar because they both contain similar amounts of "yellow"," green",and "brown". However most people would agree that these images represent very different visual concepts. known set of segmented objects which are distinguishable by color (i.e., the magnitude of the color variation of the object under different photometric conditions is within the color quantization size). The advantages of color histograms that motivate their use are: " Histograms are easy to compute (in pixels). O(N) time using O(d) space where N is the number of * They are invariant to translation and rotation in the image plane, and to pixel resolution. * Color is a much simpler and more consistent descriptor of deformable objects (e.g., a cloud, a sweater) than rigid shape-based representations. The disadvantage of histograms is that we lose information about the distribution of color or shape of colored regions. This is because the histogram is global and no attempt is made to capture any spatial information. By definition, the histogram can be computed by treating all pixels independently, since color is a pixel level property. These extremely local measurements are then combined into global counts. The implicit assumption is that the spatial distribution of color is unimportant. However the simple example in Figure 2-1 shows that this is not the case. Two other assumptions are: (1) all colors are equally important, (2) colors in the same discrete color bin (uniformly discretized) are similar. [46] performed experiments in which the model image was assumed to be segmented so that the histograms were not corrupted by noisy backgrounds. They used a small database of 66 images of objects well differentiated by color (e.g., different cereal boxes and shirts). For a practical system, users would be required to segment objects in the model image. This would require that the histograms be computed online, slowing down the overall retrieval process. Note that no attempt was made to use multiple examples and negative examples. Also there was no machine learning of queries. Although there are an exponential number of possible histograms (in the number of colors m), for real images, this space is effectively much smaller because some combinations of colors are highly unlikely. Also many dissimilar objects such as red cars and red apples will be tightly clustered, while similar objects such as red cars and black cars will be distantly separated in the color histogram space. So color histograms naturally only work well when the query is adequately described by global counts of colors, and these counts are unique to images relevant to the query. 2.1.2 Correlograms Color correlograms augment global color histograms to describe the spatial distributions of colors [23]. The correlogram of an image is the set of pairwise distance relationships between colors: 9ci,cjk(x) = Pr pi Eci,P2 [p2 Ecc,|p1 - p 2 1= k]. (2.2) It gives the probability that any pixel pi colored ci is within an L,, distance (i.e., maximum vertical or horizontal distance) k from any pixel P2 colored cj. Note that the correlogram is still global 25 in a way because it describes how a particular color ci is distributed across the image. It is a generalization of the simple color histogram because we can always get the pixel color counts by marginalizing the "auto correlogram" gc,c ,k over all distances k. To keep the correlograms small and tractable, in practice k ranges discretely from a set D such as {1, 3, 5, 7} (i.e., the assumption is that large spatial correlations are not useful for similarity). In this way, color correlograms can be computed in O(M 2NIDI) time (without any optimizations) where IDI is the cardinality of D. [23] also augmented the simple L 1 distance by considering a "relative" L 1 distance where the degree of difference in a bin is inversely proportional to the average size of the bin counts (to account for Weber's Law 1): d(x, t) = Z j','k(X) -i,,k ( ij.k gci,cj,k (t) + 9ci,cj,k (X) + . 1 (2.3) [23] demonstrated reasonable retrieval results on experiments where a scene was imaged under different photometric conditions and underwent small transformations such as translation and rotation. Since color is fairly stable under different lightning conditions and translation and rotation, correlograms work well for retrieving images of the same scene. However for general queries of different scenes which are similar (e.g., images of cars of different colors), color alone is not a discriminative enough measurement as previously discussed. 2.2 2.2.1 Color, Shape, Texture QBIC QBIC (Query By Image Content) [19] is a highly integrated image retrieval system that incorporates many types of features/measurements. Queries may be specified with an example image or by a sketch. Text annotations can be used as well. QBIC can also process video sequences by breaking them into representative frames and using the still image tools to process each frame. Users define a query using average color, color histograms, shape, and texture (although these can only be selected from a predefined set of sample textures). Users are also required to weight the relative importance of these features. Shape is described by first doing a foreground/background analysis to extract objects. Then edge boundaries of the object are found. Comparing shapes is fairly expensive even though QBIC uses dynamic programming [12]. In addition, histogram similarity is measured using a quadratic distance function d(x, y)= (X y)TA(x - y) (2.4) where the matrix A specifies some notion of similarity between pairs of colors. A few prefiltering schemes are used to attempt to speed up retrieval since computing a quadratic distance can be slow. Although QBIC offers a slew of features, they are not much more discriminative than those used by color indexing approaches. Users must also be intimately acquainted with the system to properly weight the relative importance of features. 2.2.2 CANDID CANDID (Comparison Algorithm for Navigating Digital Image Databases) [25] attempts to probabilistically model the distribution of color, shape, and texture in an image. At each pixel in an image, localized color, shape, texture features are computed. A mixture of gaussians probability density is used to describe the spatial distribution of these features in the image. A mixture density 'Results from psychophysics experiments on sensory discrimination show that for a wide range of values, the ratio of the "just noticeable difference" to the stimulus intensity is constant [27]. 26 is M p(x) = Zp(xj)P(j) j=1 (2.5) where P(j) is the prior probability of x being generated from gaussian j. In particular P(j) must P(j) = 1. The parameters for the mixture model are estimated using the K-means satisfy E, clustering algorithm [8]. The similarity of two images I, and 12 is measured using a normalized inner product similarity function nsi(R = f1 P2) [fP P11 (x)PI2 (x)dx (x) dx fR PI2(xd]/ (2.6)1 which is the cosine of the angle between the two distribution functions. The mixture of gaussians model avoids the need to arbitrarily designate discrete bins when using histograms. It does require choosing in advance the number of components or clusters M. An added advantage of modeling the distribution is that it is possible to visualize the relative contribution of individual pixels to the overall similarity score. [25] achieved good results on restricted databases of satellite and pulmonary CT scan images with 100 and 200 total images respectively. The primary disadvantage of CANDID is that both the density estimation and querying phases are relatively slow and will not scale well to larger databases. 2.2.3 BlobWorld In the BlobWorld [6] system, images are represented as a set of "blobs" or 2D ellipses of arbitrary eccentricity, orientation, and size. Blobs are constrained to be approximately homogeneous in color and texture. This representation is designed to reduce the dimensionality of the input image while retaining discriminatory information (i.e., the assumption that homogeneous blobs are adequate for discrimination). By using blobs of coherent color instead of global color histograms, some local spatial information is preserved. In this respect, blobs are similar to correlograms with small distances. Texture is measured by the moment matrix of the horizontal and vertical image gradients, and a ''polarity" term which measures the extent to which the gradients agree. To cluster points under the color and texture dimensions the expectation-maximization (EM) algorithm [8] is used to fit a mixture of gaussians model. The algorithm iteratively tries to find the parameters for each gaussian such that the log likelihood of the data under the parameters is maximized. The number of clusters chosen ranges from two to five, and is selected based on the minimum number of clusters which fit the data adequately (this means using one fewer cluster if the log likelihood does not drop too much). In effect, this clustering determines a discrete set of prototype colors and textures. After clusters are found, a majority voting and connected components algorithm is used to group pixels into blobs. EM is then used again to find the two dominant colors and mean texture within a blob. The spatial centroid and scatter of the blob is also computed. To formulate a query, the user first submits an image. The system returns the blob representation of the image. The user then chooses some blobs to match and weights their relative importance. Each blob is ranked using a diagonal Mahalanobis distance, and the total score is combined using fuzzy logic operations on the blob matches. [6] shows experiments in which BlobWorld outperforms the simple color histogram approach of [46]. 2.2.4 Photobook The philosophy of Photobook [34] is to index images in a way that preserves enough information to reconstruct the original image. To achieve this, the system is divided into three separate subsystems. The first subsystem attempts to capture the overall "appearance" of the image. This is done using principal components analysis [48]. This method describes an image by the deviation from an average 27 image along the principal axes of variation. The second subsystem captures information about an image's 2D shape. Color differencing is used to extract the foreground object. An interconnected shape model is then built for the object. A canonical shape is determined from the stiffness matrix using finite element methods [34]. The eigenvectors of this matrix encode the deformations from the canonical shape. The texture subsystem encodes images with a Wold decomposition [34]. This roughly corresponds to the periodicity, directionality and randomness of the texture. Although Photobook has been shown to perform well, the tests were on limited types of images such as faces of white males over 40, mechanical tools, and cloth samples for curtains. The primary disadvantage of Photobook is that an image must first be assigned strictly to one of the subsystems. Queries are limited within a subsystem. This essentially turns the image retrieval problem into a classification problem where the three umbrella classes of appearance, 2D shape, and texture are defined a priori. This pigeonholes many types of images such as those of animals which have both characteristic shapes and textures. Photobook resorts to complicated detection algorithms which try to identify particular types of images such as faces in order to accomplish this classification. Sophisticated techniques for aligning and scaling objects to establish correspondences between pixel locations and image features are also critical for principal component analysis. The representation used by Photobook actually preserves enough information to reconstruct the image. However, much of this information may not be relevant for a query. 2.2.5 JACOB JACOB (Just A COntent-Based query system for video databases) [10] is a two step retrieval system using color histograms and normalized axial moments to describe shape. Queries are specified by example. The first step uses a weighted L 1 distance on the histograms to filter the database and prune very "dissimilar" images. The second step compares shape using a normalized shape correlation function [10]. An edge density measure that is simply the proportion of intensity gradients above a certain threshold was proposed as a texture feature. Much of the research with JACOB involved identifying representative image frames in video sequences using motion information and a neural network. 2.2.6 CONIVAS CONIVAS (CONtent-based Image and Video Access System) [1] lets users specify queries using an example or a sketch. Local color and edges are used as features in an information theoretic similarity function [1]. 2.3 Wavelets Wavelets represent a computer graphics approach to image retrieval. Wavelet representations are an effective way to compress images [44]. [24] used the Haar wavelet transform to index images. Treating an image as a function x(m, n) that specifies the pixel value at location (m, n), the transform decomposes the image into a set of basis coefficients c each of which represent the value of basis 0(). Each basis is a dilation and translation of a single "mother" wavelet. This makes the transform particularly simple and efficient to compute. The Haar wavelet is defined as: 0(x) 1 -1 if 0 < x < 1/2 if 1/2 < x <1 0 otherwise The Haar basis consists of the functions J(x) = 2i/ 2 h(2jx - k) for j, k = -2, -1,0,1,2,.... These are the translated and dilated versions of the mother wavelet. Note that the Haar transform essentially computes the intensity difference between the two halves of the image. The different basis functions compute these differences at different locations and scales. 28 [24] truncated small coefficients to zero and used a weighted L 1 distance of the vectors of coefficients: d(x,y) = Wj, kIC, k - cykI. (2.7) j,k The weights wk were determined by optimizing the distances with respect to a preselected set of training images. The primary motivation for using wavelets as an image representation is that the largest coefficients can be used to compress the image. Thus the intuition is that we can measure the similarity between images by calculating the difference between their largest coefficients. The Haar basis approximates an "edge" detector by assigning large coefficients to high frequency areas of the image. This is done across all scales of the image. The decomposition is also fairly efficient to compute in O(N) time where N is the number of pixels in the image. Although Haar wavelets are well suited for storing and reproducing images at multiple resolutions, retrieving images does not explicitly require generating them. Also, the link between optimal compression and optimal discrimination is tenuous. One of the major problems of the Haar transform is that it is not invariant to translation. Thus shifting an image will make it very "different" from the unshifted image. Besides that, edges may not be discriminating enough for image retrieval. 2.4 2.4.1 Templates VisualSEEk VisualSEEk [43] compares images using spatial arrangements of colored regions. A query is broken into a fast phase using only color features and a slow phase which compares spatial relationships between colored regions. Instead of color histograms, VisualSEEk uses "color sets" which are bit vectors which index a color space. A color set can be computed from a histogram by thresholding the counts in each bin to {0, 1}. The use of color sets is predicated on the assumption that only a few colors dominate an image region. It also speeds up distance calculations. Regions for a color set are found by labeling pixels of the image according to its color bin. This labeled image is then filtered and then minimum bounding rectangles are designated as regions. The first phase of querying acts primarily as a filter which prunes away images which do not satisfy the color criterion. It uses a quadratic distance function similar to that of QBIC. Queries may be formed using the area, spatial extent, or absolute location of a region based on its centroid and minimum bounding rectangle. Matching spatial relationships in the second phase is made tractable by using a "2D string" representation. An example is "(region 1 < region 2, region 3 > region 4; region 1 > region 3 )" where "< / >" may mean "left/right of" in the first part of the string and "above/below" in the second part. In VisualSEEk, the user is required to sketch the regions and spatial relationships between regions. For some queries in a 3,100 image database, using color sets alone performed almost as well as color histograms. Adding spatial relationship constraints improved queries for images such as sunsets where well defined colored regions exist. However tests with a synthetic database of images with uniform color regions found that the method of finding colored regions and extracting spatial relationships performed significantly worse than using manually entered ground truth regions and relationships. 2.4.2 Flexible Templates The absolute values of the intensities in an image vary greatly under different illumination conditions. This makes simple template matching ineffective for recognition. However, the intensity ratios between different regions in an image are more stable. This idea of a ratio template has been used to define image invariants for object recognition [42]. It has also been applied to natural scene classification where the templates capture the qualitative spatial and color relationships between 29 Figure 2-2: A hand-crafted template for images of waterfall scenes. Here the waterfall concept is defined to be a white region in between two green regions with a blue region on top. image regions [29]. The reasoning behind flexible templates is that global, low frequency qualitative measurements are adequate for describing many image concepts. For example, a waterfall template can be defined as a white region in between two green regions with a blue region on top as shown in Figure 2-2. Specifically, an image is first divided into regions of coherent color. Then the differential color and spatial relationships between these regions are extracted as measurements for classification. [29] hand-crafted flexible templates for natural scene classification (e.g., mountains, fields, waterfalls) and found that simple relationships were sufficient for effective results. This approach can be used with other measurements of image regions besides color such as dominant edge orientation. Interestingly we can also get image log intensity ratios by taking the difference between log intensities: log x - log y = log x/y 2.4.3 Multiple Instance Learning and Diverse Density To use flexible templates for image retrieval, templates must be learned online because it is impractical to predict what types of image concepts will be queried. [38] used a multiple instance learning framework to automatically construct templates for natural scene classification. They observed that an image can be considered a "bag" of instances. In particular, rows, blobs, and blobs with neighbors (all of pixels) were used as the possible types of instances. Images relevant to a concept are labeled positive because somewhere in the image is an instance of the target concept (e.g., a blob of white in between two blobs of green for a waterfall). Bags are labeled negative if all the instances are irrelevant to the concept. The goal is to find the instances consistent across the positive bags and different from the negative bags. To do this the learning algorithm searches for the point in instance space with highest "diverse density". This point is close to at least one instance in every positive bag and far from every negative instance. A test image is relevant if one of its instances is close to this diverse density point. More formally, the maximum diverse density point t is found as: argmaxf JPr(tlBft) fi Pr(tIB-) t (2.8) % i where B+ is a positive bag and B- is a negative bag. This is equivalent to maximizing the likelihood , B+, B-, ... , B-It) under the assumption of a uniform prior on the maximum diverse Pr(B .... and conditional independence of the bags given the point. In practice, the conditional point density modeled as gaussians, and t is found by using gradient ascent from multiple starting are probabilities that their results improved if there was feedback from the user indicating further found points. [38] examples. They also found that more complex instances improved the number negative positive and but took a very long time to learn. A key observation made by [38] is retrieved, of relevant images make concept learning simpler. measurements that more discriminating 30 >1.~ sailboat Io< (whte n ocean ~ below laI) - -- sky azr Figure 2-3: An E-R diagram for an image of a sailboat. 2.5 2.5.1 Database Management Systems Chabot Chabot [33] is primarily a database management system that incorporates color histograms. All the other features are manually entered keywords such as name, place, and short descriptions. Its goal was to integrate textual meta data with visual content. By using a powerful relational database, the system enables complex queries using a rich query language. However, the database only returns exact matches and does not rank images according to similarity. Since the only type of direct visual content is color, the system is often effective only when used in conjunction with keywords. In fact, precision was as low as 5.8% indicating very low discrimination when using color alone. Using keywords exclusively often made the system too stringent with recalls as low as 18.1%. In practice, it was found that users were required to have experience with the database and knowledge of how to formulate effective queries. 2.5.2 SCORE SCORE (System for COntent-based approximate image REtrieval) is also primarily a database management system. In fact, it does not use any direct visual properties. The visual content of an image is described with manually entered attributes in the form of an entity-relationship (E-R) diagram, an example of which is shown in Figure 2-3. Although the E-R diagram appears descriptive, note that it has to be manually entered. Also the diagrams are often inconsistent in that some users will describe the image in Figure 2-3 as a "ship on the sea". Queries must also contain the exact words used in the E-R diagram such as "azure sky" instead of "blue sky". Some success in grouping adjectives which describe similar visual content has been achieved with the use of a thesaurus. As with Chabot, example-based queries are not possible. Instead users are required to be familiar with SCORE's specific query language. 2.6 Summary of Previous Approaches The primary limitations of the approaches described is that the visual measurements are not very discriminative. For example, in arbitrary real images, red can occur in scenes of sunsets, red sports cars, roses, clothing, etc. Color is not discriminative enough with respect to a large set of real images. So even complex relationships between colors will be inappropriate for many query concepts. Other low level features such as horizontal and vertical edges are also undiscriminative. In addition they 31 system [46] correlograms BlobWorld flexible templates features color color color, shape, texture color spatial global global local relationships queries example example example, weighting positives/negatives QBIC color, shape, texture global example, weighting VisualSEEk CANDID JACOB CONIVAS Photobook color color, shape, texture color, shape color, edges appearance, shape texture texture color, keywords keywords relationships local global local global sketch example example example, sketch example local global global example, sketch database query database query wavelets Chabot SCORE Table 2.1: A comparison of different image retrieval systems. are also unselective since almost all images will contain some horizontal and vertical edges. Many early approaches used a static similarity function. This makes it difficult to automatically tune the system for a particular query. More recent approaches apply some sort of query learning, but usually the learning is necessarily complicated and therefore slow because the measurements are not discriminating enough. Our approach computes highly selective measurements and leverages the strong discriminatory nature of these measurements to use simple learning with boosting. 2.7 2.7.1 Related Work Information Retrieval One way to represent an image is via a list of words. For example, an image of the Eiffel Tower might be indexed with words such as 'Eiffel', 'Tower', and 'Paris'. Other possible words are 'tourism', 'tour' (French for 'Tower'), '1889' (year built). In accordance with the familiar phrase, "a picture is worth more than a thousand words", many thousands of words can be associated with an image. Unfortunately, currently there is no method of extracting all the associated words given an image. Moving from the visual input of an image to a textual representation will necessarily result in a loss of visual information. We could label images with words in this way if we had a system to perform object recognition on an image. However even with this approach the words would only give a list of the objects in the image and for an image of the Eiffel Tower, the words that such a system may provide could be "tower, sky, ground." Descriptions such as "metal, skeletal structure located in the middle of a plaza" would be hard to cone by. Developing a rich word-based description of images would require extracting the relationships between objects in an image, describing the background, etc. An entire article on the Eiffel Tower would be needed for a single image. Evidence for the inadequacy of using only text to describe an image was shown in the database management systems approaches to image retrieval. Ultimately text may neither be rich enough'nor are users consistent enough in choosing effective keywords. Much of the success in text document information retrieval is due to the rich semantic content of certain words. For example, given the word "Eiffel" in a document, there is a very high probability that the document is relevant to the Eiffel Tower. For example an image search on the commercial AltaVista [2] system with the keyword "Eiffel" is shown in Figure 2-4. Although the query is successful, a further query for other images which are "visually similar" to the first image of the Eiffel Tower returns an unsatisfying result as shown in Figure 1-7. We may try to use text indexing 32 ampEiff ..pg io 2KB or more into more Eifel Tower.jpeg 3NxR54 more irdo 121 KB srnika( 11530069.jpg 9 KB 2zx222 more irdo 17 more into KB I I KB into James andth ...pg 11477045. pg _81 KB3 448xno more into 0o KB I7oxzs more irdo 11530085.jpg 211 x296 a KB 11427845.jpg 12 KB 17on2a more 10133661.jpg 255XI67 Z4ar into 11477051.jpg 15 KB more irdo 2wmlIs more into 11412228.jpg 172XZ56 7 KB moe into Figure 2-4: A query on the AltaVista system with the keyword "Eiffel" that illustrates the rich semantic content of particular words. Since an image would probably be labeled "Eiffel" only if it contained an image of the Eiffel Tower, text is effective for this query. for image retrieval but often intuitive keywords for searching turn out to be unsuccessful. An example of a query using the keyword "jets" on the same system is shown in Figure 2-5. From this example we can see that the word "jets" is not descriptive enough to select images of air transport vehicles. Since images are digitally represented as collections of pixel intensities, we cannot say much about an image given the value of a single pixel. Like the word "jets", but in a more extreme way, whether or not a particular pixel is green will not tell us much about the visual content of the image. However given a more selective image measurement, we can approximate the kind of semantic content present in words more closely. So instead of trying to create a text-based description of an image, we will try to generate a representation based directly on visual content. 2.7.2 Object Recognition A related area of computer vision research is object recognition. Indeed this is still the subject of much work today. The two primary approaches to this problem are: (1) building 3D models for recognition [30], and (2) recognizing objects directly using only 2D views [35]. Unfortunately much of this research is only applicable when the image being considered contains a single object. Object recognition systems often assume segmentation as a preprocessing step where each object in the image has been segmented into its own spatial region. Within a single real image, many objects may be found and objects often occlude one another. Also, some objects 33 jets4.jpq ISx217 8 KB 145x213 more into JW%*4.Iy9 9KB srniL Iso, 15 sKB more into sinrfle lets line.jpg more into shnitr vik1123a.jpg 3x2AO more into 15 KB 1tgJir lets.gif 1285KB s'nibr s7I5 moreinto 14ox I75 9 KB more into ninri etal.lpg 216xIOS Io KS more into 145r110 more into I5KB s nilar I2DGOH more into 10 KB siniku jets.jpg 142 2o more Into 233m 33r7 more into 2 KB KB snhiar Figure 2-5: A query on the AltaVista system with the keyword "jets". Although we wanted images of jet propulsion vehicles, the system retrieved images of the Jets football team. 34 are perhaps better considered as mass objects such as sand, grass, and sky. They defy a simple 3D model and may be more appropriately labeled as texture. In addition, heavily deformable objects such as clothing can be extremely difficult to model. Even with a perfect object recognition system, one stills need to develop a good representation for the relationships between different objects in an image in order to provide a complete scene description. Work on object detection such as face detection has met with much success, achieving higher than 90% accuracy rates on various test data [45, 40]. However these systems are often quite complicated and finely tuned for detecting only one type of object such as faces. Building a detector for every type of object may be possible, but current approaches make it impractical to combine all of these detectors (we might need thousands of these) into an image retrieval system fast enough for online querying. 2.7.3 Segmentation Another area of historical computer vision research that remains active today is image segmentation [30]. This can be considered the dual of edge detection because the perfect edge detector should find all the boundaries between different regions while the perfect segmentation algorithm should label all the different regions bounded by the edges. For example, in a world with only polyhedral objects, the segmentation algorithm should find all the regions which correspond to the faces of the polyhedrons. In real images of scenes such as the one shown in Figure 1-5 it is often difficult to precisely formulate the segmentation problem. Should each suspension bridge cable be segmented into a different region or should they all be considered part of a single spatially disconnected region? Should each leaf of the bushes be its own region or should the entire bushy area be grouped into one region? Often the "right" segmentation is not well defined and depends on the subsequent use of the segmentation. If the goal is to find a particular leaf, then each leaf should be segmented into its own region. However if the goal is to identify bushy areas then all the leaves should be combined into a single region. Segmentation may aid an image retrieval system by breaking down the overall similarity into a sum of the similarities between image regions, but the desired segmentation is ill defined and comparing multiple regions may still be a difficult task. 2.7.4 Line Drawing Interpretation There is a large body of research in computer vision on problems that are closely related to the abstract problem of image understanding. Much of the early work in computer vision concentrated on finding edges within images-with the assumption that edges provided the most useful information for explaining an image. Indeed this is true for line drawings, and success was achieved in interpreting scenes of polyhedral objects. With information about the intensities of the regions between lines, it is possible to extract 3D models of polyhedral objects [30]. This is mainly due to the simplifications introduced by considering only polyhedral (as opposed to curved) objects with faces (regions) of uniform brightness. The constraints in the polyhedral image world allow for a relatively complete and compact description of scenes and is well suited for edge detection methods which depend on sharp intensity discontinuities at an edge. We have a good theory of polyhedral scenes and practical applications such as the Copy Demo (a system that analyzed images of toy blocks and was able to control a manipulator to move these blocks). However, it is difficult to apply these methods to image retrieval because real images are much more complicated than line drawings of polyhedral objects. Most objects in the world are not strictly polyhedral, and although we may be able interpret a line drawing to some degree, producing the line drawing from an actual image remains unsolved. In real images, regions seldom have constant brightness (i.e., they often have highly textured regions) and discontinuities between regions are often smoothed out by noise. Successes in scene analysis have been limited to image domains such as line drawings and images of artificial objects such as toy blocks [18] making this research not directly applicable to the image retrieval problem. 35 Chapter 3 Selective Measurements 3.1 Motivation Chapter 2 reviewed various approaches to image retrieval. The primary measurements used to represent images often centered on color (e.g., histograms, correlograms, spatial relationships between regions) [46, 23, 29, 38, 19, 43]. Some methods also extracted measurements based on textures and shapes [19, 25, 10, 1, 34, 24]. These measurements were of limited sophistication and often heavily constrained to particular types of images such as faces. Various methods of computing the similarity between images were used, from simple Euclidean distances to dot products of probability distributions. Different methods were also used to learn what measurements were important to a particular query from hand-weighting to multiple instance learning. And although these approaches show promise in many areas, image retrieval technology is still not at the level demanded by users. This situation calls for an approach that does not automatically assume that color is the best measurement to make. It is no surprise that different approaches using similar measurements have hit an impasse. One conclusion is that no matter how sophisticated the classifier, retrieval effectiveness is inherently limited by the discriminative power of the measurements. Thus, it may be fruitful to search for better measurements. With more powerful measurements, perhaps even a very simple classifier might do the job. Our approach builds on the ideas in [16]. 3.1.1 Preprocessing vs. Postprocessing Before an image is presented to a classifier, it is often preprocessed into a more suitable representation. This is commonly done to reduce the input to a more tractable dimensionality 1 . Preprocessing often takes the form of feature/measurement extraction. This is usually guided by the domain knowledge specific to the problem. For example, it might be useful to detect edges in an image. Usually the more preprocessing we less complex we have to make the classifier. If the measurements correspond exactly to the class labels, the classifier is trivial. On the other hand, if the classifer must work with the raw input, it might need more complex computations to achieve a similar level of performance. This issue manifests itself in image retrieval as how much image indexing should be done offline versus at query time. Since many image processing operations are computationally expensive even for today's computers, it is impractical to defer all measurements until a query is made. This would force the same measurements to be recomputed for every image in the database each time a query is made. Thus the image database is really a database of image re-presentations, often in a very different form than the raw "array-of-pixels" representation. Besides deciding how much preprocessing to do, it is more important to decide what type of preprocessing should be done. Many current image retrieval systems do not stress the importance of having the right measurements in the first place. For example, systems which only use color histograms automatically limit retrieval 1 A 512x512 pixel image has 262,144 dimensions, one for each pixel. 36 Figure 3-1: The image on the right is a permutation of the pixels of the image on the car (left). Using too general an image representation allows for these kinds of unrealistic images. effectiveness because color alone cannot distinguish between say an image of a red sports car and a red rose. In the limit, we could design measurements to detect particular visual concepts such as cars or flowers. However this would render our system capable of handling only those two concepts. So although hard, specific decisions at the outset may make image retrieval more efficient, it reduces flexibility. 3.1.2 Image Representations A completely generic representation of images is of little use. A very general representation is to choose each pixel as a dimension in Euclidean Rd space where d is the number of pixels. The 2 problem with this representation is that it is overly general because it can model arbitrary images not just real ones'. For example, we can take an image of a car and permute the pixels as shown in Figure 3-1. The result is a valid point in Rd space but bears no resemblance to any image one would normally see. This representation is also completely local since it considers only pointwise measurements (where we take each pixel or dimension as a measurement). We stated that a highly specific representation could make machine learning easy, but would limit the system to only the concepts implied in the measurements. A general representation is highly underconstrained and makes learning more difficult. Using a single measurement for each concept severely limits the total number of distinguishable 4 concepts. With d measurements, only d concepts are represented . The pattern for a particular concept is represented as completely local activity in one and only one measurement. Since the classifier only has to check which measurement is active to identify the concept, the measurements must take into account the entire image (i.e., the support for the measurement is global). If the measurement is only influenced by part of an image (i.e., local support), then it cannot guarantee that the visual content in the other parts do not affect the image concept. In addition, the concepts must be known a priori in order to construct detectors for them. This information is not available in image retrieval. Too much a priori modeling of the measurements drastically limits the capacity of the representation to capture new objects. In fact, an image can only belong to one concepthierarchical concepts are not possible. These measurements are too specific and overconstrained. At the other extreme is a completely distributed representation. This maximizes the capacity of the measurement space since concepts are defined by activity across all the measurements. With a completely distributed binary code, 2 d concepts are possible. The measurements can have completely local support because the classifier will take into account information from all the measurements. The problem with this type of representation is that it can make learning a discriminant difficult because the classifier must always consider all of the measurements. Some measurements may be correlated while others may not be discriminating. The crosstalk between patterns is larger since all measurements are likely to be active. 2 1n fact, it is maximally general. Each pixel dimension can be adjusted independently to yield all possible images. 4 This is the proverbial "grandmother cell" which fires only when you see your grandmother [22]. 3 37 3.1.3 Sparse Representations One solution is to develop a set of measurements that is neither too local nor too global in terms of image support and neither too constrained nor too general in terms of representation. Each measurement should measure some part of the visual content in an image. Concepts should be based on some of these measurements but not all of them. These are the characteristics of a sparse representation where an image is coded as the activity of a set of measurements. In a sparse code, an image is represented with only a few measurements. For the code to be useful in distinguishing between images, it must also be distributed. That is for one image measurement A might return large values, but for another image A could be small. The average value of each measurement over all images is then approximately equal. This type of intermediate representation is motivated by the sparse causal structure of images. Note that of all the possible visual items that may occur in an image, for any particular image, only a few will be present. For example, although an image could be of a car, a boat, an airplane, etc., a particular image will probably contain only one of those objects. If we regard the items as causes, we should design our measurements to respect the statistics of these causes. This is based on the assumption that the similarity between images corresponds to the similarity between the causes. A sparse representation closely approximates a parts-based model that can account for many objects but not all possible things since we want to restrict ourselves to real images. As an analogy, consider a text document. Individual characters are very local measurements, while the entire string of characters is a completely global measurement. It turns out that an intermediate level representation such as words is more useful for retrieval. Many possible words (tens of thousands) can occur in a document, but a particular document will only contain a few (hundred) unique words. Our goal is to capture the visual events (words) in an image (document). In image retrieval, the classifier must be trained at query time because the target class remains unknown until a query is defined. Since human users will only tolerate a certain amount of response delay, training must be fast. Thus we can choose which measurements to use not only by their discriminative power but also by how well they facilitate fast learning. We can also try to use classifiers which will learn quickly for many types of measurements. In other words, we will use domain knowledge to guide both the measurements we use and our choice of classifier. A sparse representation trades off the advantages and disadvantages of strictly local-specific and global-general representations. The image retrieval problem may not permit a small set of dense measurements to represent the many possible image concepts. A high dimensional representation will increase the capacity but might make learning too difficult because only a few example images are available. Our strategy will be to start off will a large number of possible measurements and then quickly pick a few which are discriminating for the desired concept. 3.2 Design The goal of selective measurements is to signify interesting visual content in an image. Clearly a single pixel does not convey much interesting information about the image (i.e., many images will contain a green pixel). We can consider image patches of pixels (e.g., 9x9 neighborhood) and look at their principal components [22]. These are the linear projection axes that maximize the variance over all image patches and for a gaussian distribution also preserve the most information about the patches [28]. These measurements are local (with respect to the entire image) and linear, and so do not explicitly capture the global structure of the image. We can capture more global structure by computing the principal components of the rectified responses to the small scale principal components. To capture more large scale structure, the responses can be subsampled before being processed. This operation can be repeated until we have captured the global structure of the image. So instead of just finding horizontal and vertical edges, we measure particular arrangements of these edges so that we have good detectors for gratings and other more complex and interesting visual content. We have observed that the principal components of the first level responses are qualitatively similar to the first level principal components (i.e., horizontal arrangements of horizontal edges are common since these signify horizontal lines in an image). This was done by obtaining 1000 random 38 Figure 3-2: On the left are the principal components of image patches. On the right are the principal components of the responses to the fourth principal component on the left (a bar-like filter). 9x9 patches from a set of natural images. The patches were packed into 81 dimensional vectors. The principal components were found by taking the eigenvectors corresponding to the largest eigenvalues of the covariance matrix of the vectors. For a vectors x 1,..., xN, the sample covariance matrix is: 1N E= N (x )(x - y)T (3.1) n=1 where Y is the sample mean. The eigenvectors u are the vectors that satisfy: Eu = Au (3.2) where A is the corresponding eigenvalue. A set of response images was generated by filtering (convolving) the same set of images with the fourth principal component. The principal components of these response images were then found as before. Figure 3-2 shows the similarities between the principal components and the first level response principal components. Similar results were obtained with filtering with other principal components. Since the filters at different levels are similar, the entire computation is simplified because the same filters and computations can be repeated. 3.2.1 Filters The filters that we use are local linear averaging and difference filters as shown in Figure 3-3. We chose the microstructure filters described in [36] which have been shown to be useful for texture discrimination. They are generated by all possible combinations of three basis filters: [1,2,1]/4, [1,0,1]/2, and [1,-2,1]/4. These are qualitatively similar to the principal components we and [4] measured in natural images. Intuitively this amounts to finding edges and bars in an image. The bases are normalized so that the total energy passed remains unchanged. The coefficients of the edge and bar bases sum to zero so that constant regions do not produce a response. These filters are separable so that the computations can be done more efficiently. A 2D function x(i, j) is separable if it can to separable rowwise be written as x(i, j) = y(i)z(j) [27]. Separable filters reduce a full convolution 2 2 operations are arithmetic ) M 1) M + O((N convolution, full a For and columnwise convolutions. filters, only separable With filter. the of size the is M and image the of needed where N is the size computations of number the reduces This needed. are operations 0(2 * M(N + M - 1)2) arithmetic by about O(M/2). To eliminate spurious responses at image boundaries, only the valid part of the convolution was kept (i.e., the response image size is N - M + 1). 39 MC Figure 3-3: The 9 primitive filters used in computing the measurement maps. In practice, these can be efficiently computed with separable horizontal and vertical convolutions. 3.2.2 Filtering Tree An image is processed by a filtering tree which consists of convolving the image with each of M filters. The M output images x 1, ... , X are then rectified by taking the absolute value. Then it is subsampled by a factor of two since we want to capture larger scale structure on the next round of filtering. The same filtering process is then applied to each response image. For the resolution of our images, four levels of filtering were possible. Using 9 local filters over the red, green, and blue channels of an image gave a total of 3*94 = 19, 683 response images. A sum was then taken over the pixel values of the final response images for a total of 19,683 measurements. The final summation is a measure of the total response energy of the measurement. Usually, energy is defined as the sum of the squares of the measurements, but since our measurements are strictly positive, we have implicitly used a sum of the absolute values of the measurements to define energy. The steps of the algorithm in summary are listed below: 1. convolve 2. rectify 3. subsample by 2 4. summation (after four levels of steps 1-3) Formally, the filter tree computes each measurement as: (3.3) Ti,j,k,l, gi,j,k,l(x) = pixels where (If i,j,k ) (3.4) Xi,j,k,l = 12 Xi,j,k = xi,j = 12 (k * Xi,j) 12 (If* xij) (3.5) (3.6) 12 (Ifi * xl); (3.7) i= * x is the image; i, j, k, 1 are indices over the repeated linear filters; I is the absolute value function; 12 is subsampling by a factor of two; and xi, xij, Xi,j,k, Xi,j,k,l are the response images. Figure 3-4 is a schematic of the filtering tree. 40 rectify subsample . . filter. rectify subsample filter rsos m e e ma m filteri n Figure 3-4: A schematic of the filtering tree where a set of filters is repeatedly applied to an image to capture more global structure. For many of the features we found that the response histogram within an image was surprisingly kurtotic with kurtoses much larger than 3 (gaussian). The sample kurtosis is defined as: (xn- x , (3.8) n= 1 where N is the total number of samples, I is the sample mean, and a- is the sample standard deviation [37]. A gaussian distribution is often used as a baseline measure of kurtosis. Its kurtosis 5 can be (x= 3) is designated as 'low' since the tails fall off rapidly . The kurtosis of a distribution measures kurtosis Intuitively, deviation. compared to the one for a gaussian with the same standard the 'peakedness' of a distribution. Thus a measurement with high kurtosis is sparse within the image. This is a different type of sparsity than the sparse representation mentioned earlier because it is within an image instead of between images. We believe it is also important because it signals an interesting visual event in the image. A measurement of a green pixel in an image would not as the be sparse since many pixels might be green. This motivates taking the sum of the pixels Summing respones. final measurement of the image because most of the pixels have negligible also provides desired translation invariance since we assume the location of the response to be unimportant. Figure 3-5 shows the sequence of response images generated by a particular feature on the image of a tiger. The filtering sequence has responded strongly to the the body of the tiger where the stripes are prominent. 3.2.3 Color Space The red, green, and blue channels of RGB color space are non-negligibly correlated [27]. Overall intensity variations will affect all of the channels. To reduce this correlation we decompose color is the images into the opponent color axes: (R+G±B)/3, R-G, 2*B-R-G. The first component less are they channels, RGB between differences are components intensity. Since the other two intensity. in changes sensitive to overall 5 The tails of a gaussian fall off ase Ts2 41 Figure 3-5: Response of an image of a tiger to a particular filtering sequence. Note that this feature has detected a strong peak corresponding to the arrangement of the stripes on the body of the tiger. 3.2.4 Normalization As further preprocessing, we normalize our measurements by dividing each measurement for an image by the sum of all the measurements for the image. This normalizes the total measurement energy in an image to unity. In practice we have found that this type of normalization helps account for large variations in the overall contrast in images. 3.3 3.3.1 Measurement Selection Selectivity We now possess a method for extracting selective measurements from an image which signify interesting visual events in the image such as horizontal arrangements of vertical lines as in the tiger image. Our primary goal is to find measurements which will be useful in discriminating different image concepts. If we have large sets of labeled data, then we could use many conventional machine learning feature selection techniques to find discriminating projections. Since we do not have labels before a query is made online, we need a new criterion for choosing which measurements to keep. The task is to eliminate measurements which are not discriminating. To this end, we choose measurements which have sparse distributions over all images. Intuitively, although a measurement might be interesting in an image, it might occur in many images, thus the response histogram over all images will not be sparse. So we can once again use kurtosis as a guide in choosing which measurements to keep. Measurements for which most images have no response, but a few images have large responses will have high kurtosis. These measurements will allow a learning algorithm to separate most images from the few images that respond strongly. A highly kurtotic distribution also has low entropy since most of the probability is tightly concentrated around the peak. One can show that for a fixed variance, the gaussian distribution has the highest entropy [14]. Thus we want measurements with non-gaussian distributions over all images. Low entropy codes and non-gaussian distributions have been proposed as interesting projections and useful in factorial coding [5, 17]. Thus we will choose measurements with the largest kurotoses. Another measure of sparseness is the selectivity defined in [7]. Selectivity is defined as: selectivity = 1 - max(x) . (3.9) Selectivity ranges from 0 to 1. It is high when the measurement does not respond appreciably to most inputs (resulting in a small average), but responds very strongly to a few particular inputs (increasing max(x)). Empirically we have found that choosing measurements with high selectivity results in similar performance as using highly kurtotic measurements. Figure 3-6 shows the response 42 200 25 1 180SO 20 120 15 1000 10-- so- 40 0 5 10 15 20 25 30 35 40 45 0 50 5 10 15 20 25 30 35 40 45 50 Figure 3-6: Histograms for a highly selective (left) and unselective measurement (right). The highly selective distribution had a selectivity of 0.9521 and a kurtosis of 132.0166. The unselective measurement had a selectivity of 0.4504 and a kurtosis of 2.3703. histograms for a highly selective and an unselective measurement. Note that most of the values for the highly selective measurement are close to zero with a few large ones at the tail. The unselective measurement has large values for most of images. Emprically, using the 5000 most selective measurements gave good results. 43 Chapter 4 Learning Queries Online The work by [38] shows how learning a query can lead to much better results than using a static similarity function. The disadvantage of supervised learning is that it can be slow and require many training examples. This is especially true when the measurements have little discriminatory power. On the other hand, using highly selective measurements can make it possible to learn with only a few training examples. It can also simplify and speed up training time. 4.1 Image Retrieval as a Classification Task We can think of image retrieval in terms of classification by defining a positive (relevant) class C1 and a (negative) irrelevant class Co. The two classes are related so that Co = ,C1. For example, if C1 is "images of waterfalls", then Co is "images without waterfalls". However, a priori, C1 and Co are undefined-the concept implied in these classes is unknown until some examples are chosen. So although C1 might be "images of the waterfalls" for one query, it might be "images of mountains" for another. Thus classification into positive and negative classes is possibly only after a query has been specified. All of the images x' in the positive class C1 are assumed to have common visual properties such as an image of a waterfall. However the set of images in the negative class CO is usually heterogeneous. That is Co may contain images of mountains, fields, sunsets, etc.-as long as there are no waterfalls. So although all items in Co must be assigned to the same class, they will not necessarily have many features in common 1 . This often makes the negative class more difficult to model directly. Figure 4-1 shows example images from a "waterfall" concept. 1 The positive class can be regarded as a simple hypothesis while the negative class is a composite hypothesis. Figure 4-1: Some typical images exhibiting the "waterfalls" concept. Note the diversity in the images as some contain sky, some contain flora while others are mostly rock, etc. 44 Recently, there has been success using simple classifiers such as hyperplanes in a high dimensional measurement space [49]. These measurements are a more complex representation of the input. Simple classifiers such as nearest neighbor [18] can delineate arbitrary decision boundaries even with naive measurements. However both of these techniques often require a large number of labeled training data which is unavailable in image retrieval. Armed with selective measurements, we will use a classifier and learning method that works well with a small number of training examples. We also want training to be fast. 4.2 Aggregating Weak Learners Recall that in machine learning, we are given a set of classified training examples and would like to induce a rule that is useful in classifying unseen test examples. Often the classifier is trained by adjusting some internal parameters based on the training data. For example, let the data be {xIt' = 0} and a vector x" and its corresponding classification be t" C {0, 1}. Define Co C, = {xnIt" = 1}. Let the internal parameters be vectors Ao and pia with the same dimensionality as x. The classifier is trained by setting x" 7iC=ii. for i = 0, 1 (4.1) In other words, parameters yo and p, are set to the averages of the training examples in Co and C1 respectively. A new example is classified as: XE Co C1 if I - Pol < IX -pti I otherwise so that it is assigned to the class with the average that is closest (most "similar"). Although the above classifier is both simple to train and simple to use, it may not classify unseen test examples or even the training examples well. This might be because there are very few training examples (such as in image retrieval) so that the average is biased. It could also be that the classifier model as determined by the classification rule and parameters do not fit the data. Often though, it is relatively easy to build classifiers which perform better than chance. These are called weak learners. A natural extension is to train a set of classifiers and aggregate their outputs. The questions are how to build a set of classifiers and how should the classifications be combined. Two deterministic classifiers will learn the same rule if they are trained on the same data. One method that has been empirically shown to improve classifier error rate is Bagging (Bootstrap Aggregation) [9]. The training data is randomly sampled with replacement to create different bootstrap training examples. This enables us to generate a set of different classifiers. The final classification is taken as a majority vote of all the classifiers. 4.3 Boosting Weak Learners Another approach is to train a set of classifiers iteratively. On each round of constructing a new classifier, every training example is reweighted to effectively change the distribution of the training data. The final classifier is a linear combination of the component classifiers. This is the method of Boosting [20]. Consider a set of weak hypotheses (classifiers) hi(-), . .. , hT(-) (i.e., each has error ei <_ 1/2) with outputs y. = hi(x) where hi(-) E {0, 1}. If none of the hypotheses alone achieve acceptable error rates, we can use a linear combination of all of them: T h(x) = 5caihi(x) i= 1 45 (4.2) where h(x) is the boosted hypothesis. If all the a are equal then this is a majority vote. The key with boosting is that the training distribution is reweighted before the next hypothesis is constructed. Incorrectly classified examples are reweighted more heavily so that the next hypothesis chosen will tend to classify these examples correctly. The ac are set inversely proportional to the Ei so that the votes of more accurate hypotheses are more heavily weighted. The efficacy of boosting critically depends on the ability of the hi(.) to overfit the training data. By changing the training distribution via reweighting we can get a series of different classifiers so that each successive one is driven to correctly classify the examples the preceding one classified incorrectly. Theoretically, [20] have shown that boosting will eventually drive the training error to zero, exponentially fast in the number of boosting iterations. In practice, we have found this to be true after approximately 20 iterations. Although it might seem that this would drastically overfit the training data, empirically, the boosted classifier seems to perform well on the unseen test data. It appears that boosting may be increasing the margin or separation between the positive and negative examples. As shown by [49], this generally improves performance on test data. 4.4 Learning a Query Recall that we chose our measurements in an unsupervised fashion using a high selectivity criterion. At query time, we have a set of training images so we can find measurements which are relevant for the particular query. The idea is to build a series of hypotheses, each one using a single measurement. Many image retrieval systems use only a single example image to define a query and never consider negative examples. This is an artificially imposed constraint since it is often the case that the user can pick a few positive and negative examples. To enlarge the set of training examples, we can also augment the examples designated by the user by an initial set of presumed negative, random examples. Although it is possible that we may randomly chose an image that should really be positive, for large databases, the number of images in any particular image concept is small. This gives us a set of labeled training images from which to build our classifier. We initially weight the positive images (which are chosen by the user) higher so that the sum of the weights of the positive examples equals the sum of the weights of the random negative examples. This primes the system for correctly classifying the positive examples. We chose a simple minimum distance classifier in which an image is assigned to the closest concept in the single measurement dimension. Thus x is in class 1 if Ix - p, < IX - P2 where /1 k is the sample mean of images in class k. To construct the first hypothesis (choose the first measurement to keep), we build hypotheses for each measurement and choose the one with the lowest training error. After the first hypothesis is found, the training examples are reweighted in order to emphasize the incorrectly labeled examples. In particular we use the AdaBoost algorithm [20] shown in Figure 4.1. The x, are images, t, is the corresponding class label for the image and wi, is the weighting for image n on boosting round i. The process is repeated for the desired number of iterations T. In our experiments, 10 to 50 rounds of boosting were sufficient for reasonable retrieval. Note that our boosting measurement selection requires O(Td) time since each of d measurements must be checked on each round of boosting (except for those that are already chosen). However after this initial learning process, only T measurements are needed to query the database. For large databases this uses much less computation than using all of the measurements. It is possible that using k > 1 measurements for each classifier may lead to better performance. However this creates n!/k!(n - k)! possible subsets of measurements to check. Choosing two measurements out of a thousand (n = 1000, k = 2) would require looking at 499,500 subsets. This is too slow for a practical image retrieval system. 4.5 Observations By looking at the measurements selected for various queries, we can see that different measurements are indeed needed to handle different types of queries. Figure 4-2 shows how the measurements chosen by boosting better separate images of mountains from other natural images than using 46 Boosting Algorithm " Given examples images (xi), .. . , (xN, tN) where tj = 0, 1 for negative and positive examples respectively. " Initialize weights w 1 ,, 1 -,1--fort, = 0,1 respectively, where L and M are the number of negatives and positives respectively. " For i = 1,... , T: 1. Train one hypothesis hj for each feature 2. Choose hi(-) such that Vj # i, E < j using wi with error E6= Pr" (hj (x,) , tn). Ej (i.e., the hypothesis with the lowest error). 3. Update Wi+1,n = W~ where en = 0, 1 for example Xn classified correctly or incorrectly respectively, and so that wj+ 1 is a distribution. 'i . Normalize Wi+1,n W" i = Output the final hypothesis: T T h(x) = Zaihi(x) ; i=i where ai = log EZai i=i A Table 4.1: The boosting algorithm for learning a query online. T hypotheses are constructed each using a single feature. The final hypothesis is a linear combination of the T hypotheses where the weights are inversely proportional to the training errors. 47 rand, muntain boost, mountain 30 120- . 25 100- 20 80- 15 60 0 2000 4000 6000 5000 0 1200 10 - 30 10 40 50 60 70 00 0 Figure 4-2: The left plot shows how boosting finds measurements which better separate images of mountains from other images better than choosing random measurements (right). .%b008111m0un1i, 0 0 '0k0s 0 0.5 1 1.5 2 2.5 3 3.5 4 Figure 4-3: The boosting measurements which separate well images of mountains do not discriminant images of lakes as well. randomly selected measurements. Although these particular measurements may be useful in queries for mountains, Figure 4-3 shows that the same measurements are not as discriminating for images of lakes. Thus different measurements are needed to handle different queries, and boosting selects those which are useful for a particular query. 4.6 Relevance Feedback Since the image retrieval system is used interactively, there is an opportunity for the user to improve the query by iteratively adding positive and negative examples. In information retrieval this is called relevance feedback and is a natural way for the user to obtain further information from the user. Research in this area of image retrieval is becoming very active [31, 15]. We view this situation as an opportunity for the image retrieval system to do some active machine learning [11]. That is, instead of the system waiting for the user to hand it some training examples, the system can suggest some examples for the user to label. It is often useful for the system to query those examples it is most unsure of. One simple example of useful feedback in image retrieval is that the initial query results often contain many highly ranked false positives. Since these are readily seen by the user, the user can add the most egregious (highly ranked) of these as negative examples. Figure 4-4 shows how 48 laws4, nr um, hiselect laws4, normum, hiselect, 20feats, egregius 20feats 1 0.9 0.9 08 0.8 0.7 0.7 0.6 0.5 0.5 - --------0,2 0.2 0.1 0.1 0 0.1 0.2 0.3 0.4 r0.5 0.6 0.7 0.8 0.9 01 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 09 1 Figure 4-4: At left is the initial performance. On the right is the improved performance after the most four most egregious false positives were added as negative examples. This example is on a data set containing five classes with 100 images in each class (see Chapter 5). performance improves with feedback from the user. 4.6.1 Eliminating Initial False Negatives Sometimes a few of the randomly chosen initial negative examples turn out to be true positives. To eliminate these we return the negatives in the training set which are closest to the threshold for being classified as positive. This allows the user to eliminate those images from the randomly chosen negative example training set. 4.6.2 Margin Boosting by Querying the User Another source of feedback is to present users with images in the test set which are near the threshold. The user can then add any true positives which are below threshold in order to provide a wider range of examples for the learning algorithm. This is useful because any positive examples added which are already well above threshold will not change the input distribution as much as one which is still classified as negative. 49 Chapter 5 Experimental Results 5.1 Performance Measures The standard measures of retrieval performance are precision and recall. Let N be the number of images returned for a query, N, of which are relevant. Let Nt be the total number of relevant images in the database for this query. Recall is defined as: _Nr re = Ni . (5.1) Precision is defined as: pr = N. (5.2) So recall is the proportion of relevant images returned, while precision is the proportion of images returned which are relevant. For image retrieval, precision may be more important than recall since users will often only want to see the top 10-20 ranked images. To maximize the number of relevant images returned, we want to maximize the precision within the top ranked images. It is useful to combine the two performance measures into a single precision-recall graph which shows the precision achieved as a function of recall. In an ideal system, precision would remain maximal at 1.0 from 0 to 1.0 recall. Random performance would stay at steady low precision across recalls. In practice, retrieval typically starts at high precision with low recall. As recall increases, precision usually drops since the threshold for returning images as relevant is successively lowered until all images are returned. 5.2 Natural Scene Classification To test the classification performance of the system we constructed five classes of natural images (sunsets, mountains, lakes, waterfalls, and fields) using the Corel Stock Photo1 image sets 1, 26, 27, 28, and 114 respectively [13, 38]. The images had dimensions of 72x108 or 108x72. Each class contains 100 images. Each class was split into 10 subsets of 10 training examples. In addition, one example from each negative class was added for a total for 4 negative examples. The results shown are the average of the 10 trials. The training examples were not considered during testing. Figure 5-1 shows a representative image from each class. We tested to see if more selective measurements improved performance. Measurements were computed using one and four layers of filtering. Each experimental trial used the same number of features. Figure 5-2 shows how using more layers of filtering improved performance. For natural 'This publication includes images from the Corel Stock Photo images which are protected by the copyright laws of the U.S., Canada and elsewhere. Used under license. 50 Figure 5-1: An example image from each class of sunsets, mountains, lakes, waterfalls, and fields. lawsi1, laws4, 20feats 0.7 0.6 0.6 a 2Wleals 0.7 1 ------------- hlIelect 0.8 - d 0.8 - normm. 0.5 -0.5 0.4 -t.. 0.4 . 0.3- ... --. 0.3- .. 0.2- . 0.1 0.1 0 01 02 03 04 05 06 07 0.8 0 0.9 01 0.2 03 04 05 06 07 0.8 09 1 Figure 5-2: Measurements with four layers of filtering (right) performed better than using only one layer of filtering (left). scenes we have used the following abbreviations in the figures: ss(sunsets), mt(mountains), lk(lakes), wf(waterfalls), and fd(fields). The selectivities for four layer measurements ranged from a minimum of 0.5086 to a maximum of 0.8623. The selectivities for one layer measurements ranged from a minimum of 0.4242 to a maximum of 0.8366. Figure 5-3 compares the performance of image retrieval system proposed in this thesis compared to the global color histogram using a chi-square measure of similarity. The chi-square measures the difference between two distributions and is based on the distribution of the sum of squares of gaussian random variables [37]. The RGB color space was discretized into 64 bins for the histogram. For this limited test set of five classes of natural images, color turns out to be fairly discriminative. This is obvious for the sunsets class which contains mostly orange color. Also, the fields class contains mostly green, while the mountains class contains mostly blue. Interestingly, we can do just as well using color histograms and boosting as shown in 5-4. The advantage here is that instead of the full 64 bins, boosting gives comparable performance using only 32 bins. This enables the system to rank each image with half as many arithmetic operations. On a very large database, reducing the number of computations by two is a substantial savings. For selective measurements, best performance is achieved for the sunsets class. The images in this class are very homogeneous, typified by a global orange color. Besides an image of a sunset and perhaps the horizon, very little other significant visual content is found in these images. The waterfall images are moderately homogeneous since the an image of a waterfall is fairly salient. Similarly the mountains class contains mainly snow-capped mountains. Lower performance is achieved for the fields and lakes classes because the images in those classes are very heterogeneous. Visually, a field may contain flowers, bushes, grass, and hills. Many images of lakes had what most people would consider mountains and fields in the background. 51 C 0or, 64, chisq laws4, norm - - -4 0.9-n 0.9-- 0.7 0.2 01 , hi80lect, 64feats ------ - 0.1 0 01 02 03 04 06 07 08 09 0 01 02 03 04 0 06 07 08 09 1 Figure 5-3: A comparison between using color histograms with the chi-square similarity function (left) and using selective measurements and boosting (right). Although color is fairly discriminative for these natural images, it is not adequate for other classes such as sports cars. In fact, most man-made objects may be from the same class but be colored very differently. Also, with a larger database, more images will also have orange and blue colors, thus making color undiscriminative. Figures 5-5 and 5-6 show poor results using color histograms and the chi-square function on a 3000 image database. Figures 5-7 and 5-8 show poor results even when color histograms are used with boosting on the same queries. 5.3 Principal Components Analysis Principal components analysis (PCA) is a technique rotates data onto the axes of highest variance [8]. We saw in Chapter 3 how it was used to guide the design of the selective measurements. PCA removes second-order correlations and is often used to find a lower-dimensional representation of data (intrinsic dimensionality). The correlation matrix is defined as: C = 0 - , (5.3) where Ei is the (i, j)'th element of the covariance matrix. We performed PCA on our selective measurements and tested performance with the reduced dimension projections. Figure 5-9 shows the reduction in correlation between the measurements after PCA. This reduction in correlation and dimensionality however also results in a reduction in performance as shown in Figure 5-10. 5.4 Retrieval To test the performance of our system on a larger database, we selected Corel image sets 1-30 for a 3000 image database. The images had dimensions of 72x108 or 108x72. Figure 5-11-5-14 show the results from some example queries. The typical number of images relevant to a particular query is about 100 (3% of the images in the database). Typically 40 rounds of boosting were used to select 40 measurements for querying. An initial set of 100 randomly selected negative examples were used. The example queries shown typically used 1-5 extra example images selected from the most egregious (highly ranked) false positives from an initial query using just the positive images. 52 color, boost, 32teats t 0.9 0.8 0.70.6-. 0.5 -- 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0.9 1 Figure 5-4: Here we show that boosting can be used with color histograms to give comparable performance. Boosting achieves this using only half of the histogram measurements. This results in a substantial computational savings on a large database. 5.5 Face Detection To test if selective measurements and boosting are useful for detection, we tested the system on a database of 923 images of faces and 923 images of non-faces. Initially a large number of images were collected from the World Wide Web. Each frontal face image was obtained by hand segmenting the original image. A non-face image was generated by selecting a random patch of the same size as the face segment from the the same image. This ensures that the set of non-face images comes from a similar distribution of images as that of the face images. All face and non-face images were then resized to 64x64. We split the data set into 92 sets of 10 face training images and 10 non-face training images. Once again, the 10 most egregious false positives were added as additional negative examples after an initial round of querying. The average results for the 92 trials are shown in Figure 5-15. Performance was quite satisfying with only 10% false alarms for over 95% detection. In addition, only 20 measurements were used. Figure 5-16 shows the results of a typical query using only three example face images and an initial set of 100 random non-face images. It would be interesting future research to test the ability of selective measurements and boosting to detect other types of objects across different scales and poses. For a more generic detection system, only the key measurements selected by boosting need to be computed in the first place. 5.6 Digit Classification We tested performance of our image retrieval method on a database of ten digits with 100 examples per class. One example image from each class is shown in Figure 5-17. These images are part of a set of training examples of the National Institute of Standards and Technology (NIST) handwriting database. Precision on this data set degrades very quickly as recall increases. It seems that our measurements are unable to capture the nature of the digit classes and is not invariant to the various transformations from one image to another. 53 I Figure 5-5: A query for waterfalls using color histograms and the chi-square function. Note global color histograms cannot capture the flowing vertical shape of waterfalls. 54 Figure 5-6: A query for sports cars using color histograms and the chi-square function. Note unsurprisingly that the few cars found are all red-cars of other colors in the database are not ranked highly. 55 Figure 5-7: A query for waterfalls using color histograms and boosting. Note that boosting finds images with similar color distributions, but most of those images are not of waterfalls. 56 I I Figure 5-8: A query for sports cars using color histograms and boosting. Note that boosting finds images with similar color distributions, but most of those images are not of cars. 200 400 Goo 800 1DOD 1200 1400 1600 1800 Figure 5-9: The principal components correlations (left) are much smaller than the original measurement correlations (right). 57 lw3,pc, nmum, Mfeats 0.9 0.8 0.7 0.3 0.2 0.1 0 0.1 0.2 0.3 0.4 05 0.6 0.7 0.5 0.9 1 Figure 5-10: Performance is reduce using the principal components of the measurements. Figure 5-11: A query for sunsets. The positive examples are shown on the left. 58 I Figure 5-12: A query for waterfalls. The positive examples are shown on the left. 59 - I - - . --- -- - - -- -- -- -; - zz . - -- U U ... -.. Figure 5-13: A query for sports cars. The positive examples are shown on the left. 60 I - I -- ME25 =±i!d" - ~; - -- --.- ip . Figure 5-14: A query for cacti. The positive examples are shown on the left. lawa3, toaces, 20teats, egreglious laws3, faces, 20feat, egregious 0.9 0.0 0.8 0.8 0.7 0.7 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 ' 0.1 0.2 0.3 0.4 0.5 Pr(false atarm) 046 0.7 0.8 0.9 0.1 1 0.2 0.3 0.4 0.5 0.6 07 0.8 0.9 recal Figure 5-15: The average receiver operating curve (left) and precision-recall curve (right) for detecting faces. 61 , j t49 "tit Figure 5-16: A query for faces. The positive examples are shown on the left. Figure 5-17: Results for images of the ten digits. 62 nist, laws4, 40feats, egregious 0 - 0.9 - S-- 0.8_ _ 0.7 9 1 2 3 4 7 0.6 - 8 - 0.5 0.4 0.3 -- 0.2 -- - . 0.1 - 0 0.1 0.2 0.3 0.4 0.5 recall 0.6 0.7 0.8 0,9 1 Figure 5-18: Results for images of the ten digits. 63 Chapter 6 Discussion 6.1 Summary We have described a system for image retrieval which uses highly selective measurements and boosting to learn queries. The measurements are computed by a tree of repeated filtering with simple local difference filters. This leads to a large number of potentially useful measurements for various types of queries. Boosting learns which measurements are discriminative for a particular class of images and selects only those for querying the database. By using many measurements the system is flexible enough to handle many different types of queries. And since boosting finds a small set of discriminative features, searching the database of images can be made more efficient. Although it is difficult to evaluate the performance of any retrieval system, we have made an attempt here by testing the system against a variety of queries. Our main contribution lies in outlining a theory for image retrieval. This theory suggests that many measurements may be needed to explain different images. However, for any particular class of images, only a few measurements may be important. This idea mirrors the sparse nature of real images. We proposed to find these discriminative measurements with a two step process: (1) selecting measurements with sparse distributions, (2) choosing the few measurements relevant to a particular query with boosting. In summary, we have tried to finesse the image retrieval problem by trying to solve it directly instead of solving object recognition or resorting to text-based representations. 6.2 Applications There are many applications which can arise from an image retrieval system. Some possible ones are listed below: " art galleries and museums " architecture and manufacturing design " remote sensing and resource management " geographic information systems " scientific database management * retailing " fabric and fashion design " trademark and copyright database management " law enforcement " library archiving. 64 6.3 Research Areas Image retrieval borrows techniques and ideas from knowledge based systems, cognitive science, user modeling, computer graphics, image processing, pattern recognition, database management systems, and information retrieval. For a commercial system, many practical issues such as user interface and database organization need to be address. Some of these research areas are listed below: * query specification * file structures for storage and retrieval " relevance feedback " distributed databases " performance measures " level of abstraction * domain independence. 6.4 Connections to Biology Qualitatively our approach is in rough accord with the biology of the human visual system up to the striate cortex. Light enters the eye and impinges upon the photoreceptors which in turn activate retinal ganglion cells. There are about 106 ganglion cells. The nerve impulses travel to the roughly 1.5x10 6 cells in the lateral geniculate nucleus (LGN). These signals then move upstream to the striate cortex which contains about 2x10 8 neurons. If we think of each cell as a dimension, then the image is being represented with a higher dimensional representation each time the signal moves upstream. There is also evidence that cells at higher levels are more selective than those at lower levels. For example, simple cells in striate cortex only respond to bars, while retinal ganglion cells respond to spots or parts of bars. In addition to increase selectivity, higher-level neurons also have larger receptive fields. The subsampling operations in our filtering tree mirrors this structure. The response histograms also show selectivity increasing as measurements become more complicated. We have used the opponent color axes mirroring those in the visual system [32]. Although the visual system as a whole is non-linear, many simple cell responses can be modeled as roughly linear. Thus the succession of linear filters we use are in accord with this observation. It is also known that orientation selective cells are arranged in hypercolumns so that the same visual area projects to all the cells in a hypercolumn. Thus we also use filters at different orientations in each image area. Of course an image retrieval system not need mirror the architecture used by the brain. However, it would be foolish to ignore any advances made in neuroscience since we know that the brain can perform image retrieval very successfully. Of course the visual cortex contains more sophisticated machinery than just simple cells. For example, neurons have been found in the inferotemporal cortex which appear to respond preferentially to shapes of hands [21]. Regardless of how objects may be coded in the brain, it is clear that some aspects of object recognition are useful in determining whether two images are similar. Although we are still far from being able to engineer a computer to recognize objects, there is experimental evidence of a preattentive type of similarity [41]. Certain patterns appear similar even before we recognize any objects. We have tried to start our research from this level of similarity. Future reserach will entail moving to more advanced forms of similarity. 6.5 Future Research One conclusion we can make is that image database retrieval research is still in its infancy. And since image retrieval is intrinsically tied to so many other areas of research, there is much room for future research. The primary goal of such work would be to formulate a quantitative theory for 65 Figure 6-1: An image with various possible classifications based on visual content, context, prior knowledge, etc. what type of measurements to compute and how to best use them. We have hinted at using highly selective measurements and boosting. Below is a list of directions for future research: " restructuring measurements for faster and more accurate retrieval " new datasets and criteria for evaluating performance " analyzing the relationship to psychophysics * clustering and organizing the image database " learning associations from the overall nature of database Although it may take some time to get there, our goal is to be able to interpret Figure 6-1 as an image of snowy mountains, pine trees, a ski lodge, daytime, cold weather, skiing, snow sports, vacation, serenity, etc. 66 Appendix A Implementation of the Image Retrieval System A.1 Architecture Figure A-1 shows our implementation of an image retrieval system based on the ideas in this thesis. The retrieval engine is a self-contained program which runs as a server. Multiple clients can connect to the server and request similar images. Users interact exclusively with the client program to designate queries and view retrieved results. A.2 User Interface Figure A-2 shows the user interface to our image retrieval system. Users can cycle through random images until some relevant examples are found. These may be added as positive or negative examples. The interface also allows the user to select how many measurements to use. The client program also has the ability to connect to different servers. Retrieved results as well as other information from the image retrieval server engine are display in various windows in the client. A.3 Online Demo An online demo of the system is available on the World Wide Wide at: http://www.ai.mit.edu/projects/lv/projects/ImageDatabase.html. client serve :.a image display stored representations user interface retrieval computations client Figure A-1: Image retrieval system architecture. 67 I UtJo0 Figure A-2: Image retrieval interface. 68 Bibliography [1] M. Abdel-Mottaleb, N. Dimitrova, R. Desai, and J. Martino. Conivas: Content-based image and video access system. In A CM Multimedia Conference, pages 427-428, 1996. [2] AltaVista. Altavista. Web: http://www.altavista.com. [3] Y. A. Aslandogan, C. Thier, and C. Yu. A system for effective content based image retrieval. In A CM Multimedia Conference, pages 429-430, 1996. [4] R. J. Baddeley and P. J. B. Hancock. Principal components of natural images. Network: comp. in neur. sys., 3:61-70, 1992. [5] H. B. Barlow. Unsupervised learning. Neur. Comp., 1(1):295-311, 1989. [6] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color-and texture-based image segmentation using em and its application to content-based image retrieval. In Int. Conf. Comp. Vis., 1998. [7] E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. J. Neuro., 42:3248, 1982. [8] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995. [9] L. Breiman. Bagging predictors. Technical Report 421, UCB, Dept. Stat., 1994. [10] M. La Cascia and E. Ardizzone. Jacob: Just a content-based query system fro video databases. In Int. Conf. Acous., Speech, Sig. Proc., 1996. [11] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn., 15(2):201-221, 1994. [12] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990. [13] Corel Corporation. Corel stock photo images. http://www.corel.com. [14] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. [15] I. Cox, M. Miller, T. Minka, and P. N. Yianilos. An optimized interaction strategy for bayesian relevance feedback. In IEEE Comp. Vis. Patt. Recog. Conf., pages 553-558, 1998. [16] J. S. DeBonet and P. Viola. Structure driven image database retrieval. In Adv. Neur. Info. Proc. Sys., volume 10, 1998. [17] P. Diaconis and D. Freedman. Asymptotics of graphical projection pursuit. Ann. Stat., 12:793815, 1984. [18] R. 0. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973. 69 [19] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The qbic system. IEEE Computer, 28(9):23-32, 1995. [20] Y. Freund and R. E. Schapire. A decision-theoretic generalization of online learning and an application to boosting. J. Comp. & Sys. Sci., 55(1):119-139, 1997. [21] C. G. Gross, C. E. Rocha-Miranda, and D. B. Bender. Visual properties of neurons in inferotemporal cortex of the macaque. J. Neurophys., 35:96-111, 1972. [22] J. Hertz, A. Krogh, and R.G. Palmer. Addison-Wesley, 1991. Introduction to the Theory of Neural Computation. [23] J. Huang, S.R. Kumar, M. Mitra, W. Zhu, and R. Zabih. Image indexing using color correlograms. In IEEE Comp. Vis. Patt. Recog. Conf., 1997. [24] C. E. Jacobs, A. Finkelstein, and D. H. Salesin. Fast multi-resolution image querying. In SIGGRAPH, 1995. [25] M. Kelly, T. M. Cannon, and D. R. Hush. Query by image example: the candid approach. In SPIE Stor. and Ret. Image & Video Databases III, volume 2420, pages 238-248, 1995. [26] R. R. Korfhage. Information Storage and Retrieval. John Wiley & Sons, 1997. [27] J. S. Lim. Two-Dimesional Signal and Image Processing. Prentice-Hall, 1990. [28] R. Linsker. Self-organization in a perceptual network. Computer, pages 105-117, March 1988. [29] P. Lipson, E. Grimson, and P. Sinha. Context and configuration-based scene classification. In IEEE Comp. Vis. Patt. Recog. Conf., 1997. [30] V. S. Nalwa. A Guided Tour of Computer Vision. Addison-Wesley, 1993. [31] C. Nastar, M. Mitschke, and C. Meilhac. Efficient query refinement for image retrieval. In IEEE Comp. Vis. Patt. Recog. Conf., pages 547-552, 1998. [32] J. G. Nicholls, A. R. Martin, and B. G. Wallace. From Neuron To Brain. Sinauer Associates, Inc., 3 edition, 1992. [33] V. Ogle and M. Stonebraker. Chabot: Retrieval from a relational database of images. IEEE Computer, 28(9):40-48, 1995. [34] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: content-based manipulation of image databases. Int. J. Comp. Vis., 18(3):233-254, 1996. [35] T. Poggio and S. Edelman. A neural network that learns to recognize three-dimensional objects. Nature, 343:263-266, 1990. [36] W. Pratt. Digital Image Processing. John Wiley & Sons, 1991. [37] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C. Cambridge Univ. Press, 2 edition, 1992. [38] A. L. Ratan and 0. Maron. Multiple instance learning for natural scene classification. In Int. Conf. Mach. Learn., pages 341-349, 1998. [39] E. Rosch. Principles of categorization. In Cognition and Categorization. Lawrence Erlbaum Assoc., Inc., 1978. [40] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Patt. Anal. Mach. Intell., 20(1):23-28, 1998. 70 [41] S. Santini and R. Jain. Gabor space and the development of pre-attentive similarity. In Int. Conf. Patt. Recog., 1996. [42] P. Sinha. Image invariants for object recognition. Invest. Opth. & Vis. Sci., 34(6), 1994. [43] J. R. Smith and S.-F. Chang. Visualseek: a fully automated content-based image query system. In A CM Multimedia Conference, pages 87-98, 1996. [44] G. Strang and T. Nguyen. Wavelets and Filter Banks. Wellesley Cambridge Press, 1996. [45] K. Sung and T. Poggio. Example-based learning for view-based human face detection. Technical Report 1521, MIT AI Lab Memo, 1994. [46] M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comp. Vis., 7(1):11-32, 1991. [47] M. J. Swain, C. Frankel, and V. Athitsos. Webseer: an image search engine for the world wide web. Technical Report TR-96-14, Univ. of Chicago, 1996. [48] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cog. Neuro., 3(1):71-86, 1991. [49] V. Vapnik. StatisticalLearning Theory. John Wiley & Sons, 1998. 71