Boosting Sparse Representations for Image ... Kinh H.

advertisement
Boosting Sparse Representations for Image Retrieval
by
Kinh H. Tieu
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Master of Science in Computer Science and Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
February 2000
@
Massachusetts Institute of Technology 2000. All rights reserved.
A uthor.......... ..
. . . . . . . .. . . . . . . . .. . . . . . . . .
Department of Electrical Engineering and Computer Science
January 31, 2000
,,I Z1,
Certified by.........
. .
.
j
.
.
.
. .
.
/'?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Paul Viola
Associate Professor, Department of Electrical Engineering and Computer Science
Thesis Supervisor
...................
Arthur C. Smith
Chairman, Department Committee on Graduate Students
Accepted by ...................................
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
MAR 0 4 2000
LIBRARIES
MiTLibraries
Document Services
Room 14-0551
77 Massachusetts Avenue
Cambridge, MA 02139
Ph: 617.253.2800
Email: docs@mit.edu
http://libraries.mit.edu/docs
DISCLAIMER OF QUALITY
Due to the condition of the original material, there are unavoidable
flaws in this reproduction. We have made every effort possible to
provide you with the best copy available. If you are dissatisfied with
this product and find it unusable, please contact Document Services as
soon as possible.
Thank you.
* Color page reproduction not available.
Boosting Sparse Representations for Image Retrieval
by
Kinh H. Tieu
Submitted to the Department of Electrical Engineering and Computer Science
on January 31, 2000, in partial fulfillment of the
requirements for the degree of
Master of Science in Computer Science and Engineering
Abstract
In this thesis, we developed and implemented a method for creating sparse representations of real
images for image retrieval. Feature selection occurs both offline by choosing highly selective features
and online via "boosting". A tree of repeated filtering with simple kernels is used to compute the
initial set of features. A lower dimensional representation is then found by selecting the most selective
of these features. At query time, boosting selects a few of the features useful for the particular query
and ranks the images in the database by taking the weighted vote of an ensemble of classifiers each
using a single feature. This method allows for a large number of potential queries and facilitates fast
retrieval on very large image databases. The method is tested on various image sets using standard
measures of retrieval performance. An online demo of the system is available via the World Wide
Web1 .
Thesis Supervisor: Paul Viola
Title: Associate Professor, Department of Electrical Engineering and Computer Science
1http://www.ai.mit.edu/projects/lv/
2
Acknowledgments
This research was supported in part by Nippon Telegraph and Telephone.
First I thank this great nation for giving me and my family hope. My life would be drastically
different without the incredible opportunities I have received here.
Thanks to the Massachusetts Institute of Technology and the Artificial Intelligence Laboratory
for supporting my studies and providing superb resources for my research.
I thank all the members of the Al Lab that have helped me along the way, especially the Learning
and Vision Group. Special thanks to John Winn, Dan Snow, Mike Ross, Nick Matsakis, Jeremy
De Bonet, Christian Shelton, John Fisher, Chris Stauffer. Extra special thanks to Erik Miller for
reading the thesis and offering helpful comments.
Thanks to Professor Eric Grimson for advice, support and helping me to fulfill my academic
requirements in a timely and worthwhile manner.
Finally, I thank my advisor Professor Paul Viola for his care, encouragement, support, and
insight. He has greatly enriched my understanding of this work and of how to do research. It is
a rewarding and pleasurable experience working with Paul, and I appreciate the time, energy, and
thought that he provides.
For my family, I would not be here without your unconditional encouragement, support, and
love.
3
Contents
1
Introduction
1.1
1.2
Thesis Summary . . . . . . . . . . . . . . . . .
Image Retrieval . . . . . . . . . . . . . . . . . .
1.2.1
2
The Problem Model . . . . . . . . . . .
1.2.2 Specifications . . . . . . . . . . . .
1.3 Scene Analysis . . . . . . . . . . . . . . .
1.4 Image Indexing . . . . . . . . . . . . . . .
1.4.1 Measurement Space . . . . . . . .
1.4.2 Similarity Function . . . . . . . . .
1.5 Selective Measurements . . . . . . . . . .
1.5.1
Image Generation . . . . . . . . .
1.5.2 Measurement Design . . . . . . . .
1.6 Learning a Query . . . . . . . . . . . . . .
1.7 Performance Evaluation . . . . . . . . . .
1.8 Im pact . . . . . . . . . . . . . . . . . . . .
1.9 Why Try to Solve Image Retrieval Today?
1.10 Thesis Organization . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Previous Approaches
2.1 Color Indexing . . . .
2.1.1 Histograms . .
2.1.2 Correlograms .
2.2 Color, Shape, Texture
2.2.1
2.2.2
2.2.3
2.2.4
2.2.5
2.2.6
2.3
2.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
VisualSEEk . . . . . . . . . . . . . . . .
2.4.2 Flexible Templates . . . . . . . . . . . .
2.4.3 Multiple Instance Learning and Diverse Density
Database Management Systems . . . . . . . . .
2.5.1
2.5.2
2.6
2.7
. . .
. . .
. .
. .
. . .
. . .
.
.
.
.
Wavelets . . . . . . . . . . . . . . . . . . . . . .
Templates . . . . . . . . . . . . . . . . . . . . .
2.4.1
2.5
Q B IC . .
CANDID
BlobWorld
Photobook
JACOB .
CONIVAS
.
.
.
.
Chabot . . . . . . . . . . . . . . . . . .
SCORE . . . . . . . . . . . . . . . . . .
Summary of Previous Approaches
Related Work . . . . . . . . . . .
2.7.1 Information Retrieval . .
2.7.2 Object Recognition . . . .
2.7.3 Segmentation . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2.7.4
Line Drawing Interpretation . .3
. . .
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
36
37
38
38
39
40
41
42
42
42
4 Learning Queries Online
4.1 Image Retrieval as a Classification Task . . .
4.2 Aggregating Weak Learners . . . . . . . . . .
4.3 Boosting Weak Learners . . . . . . . . . . . .
4.4 Learning a Query . . . . . . . . . . . . . . . .
4.5 Observations . . . . . . . . . . . . . . . . . .
4.6 Relevance Feedback . . . . . . . . . . . . . .
4.6.1 Eliminating Initial False Negatives . .
4.6.2 Margin Boosting by Querying the User
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
44
44
45
45
46
46
48
49
49
3
Selective Measurements
3.1 Motivation . . . . . . . . . . . . . . . .
3.1.1 Preprocessing vs. Postprocessing
3.1.2 Image Representations . . . . . .
3.1.3 Sparse Representations . . . . .
3.2 Design . . . . . . . . . . . . . . . . . . .
3.2.1 F ilters . . . . . . . . . . . . . . .
3.2.2 Filtering Tree . . . . . . . . . . .
3.2.3 Color Space . . . . . . . . . . . .
3.2.4 Normalization . . . . . . . . . . .
3.3 Measurement Selection . . . . . . . . . .
3.3.1 Selectivity . . . . . . . . . . . . .
Discussion
6.1 Summary . . . . . . . .
6.2 Applications . . . . . . .
6.3 Research Areas . . . . .
6.4 Connections to Biology.
6.5 Future Research . . . .
A Implementation of
A.1 Architecture . .
A.2 User Interface .
A.3 Online Demo .
.
.
.
.
.
.
.
.
.
.
the Image
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
50
50
50
52
52
52
53
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
64
64
64
65
65
65
Retrieval System
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
67
67
67
5 Experimental Results
5.1 Performance Measures . . . . .
5.2 Natural Scene Classification . .
5.3 Principal Components Analysis
5.4 Retrieval. . . . . . . . . . . . .
5.5 Face Detection . . . . . . . . .
5.6 Digit Classification . . . . . . .
6
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
List of Figures
1-1
An airplane and a race car may be classified as belonging to the class of vehicles using
functional property "transportation" . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-2 Example images from an airplanes and a race cars class defined by visual properties.
1-3 Schematic of an image retrieval system. . . . . . . . . . . . . . . . . . . . . . . . . .
1-4 A set of random images (that can be found on the World Wide Web) that illustrates
the diversity of visual content in images. . . . . . . . . . . . . . . . . . . . . . . . . .
1-5 An example of a scene with many objects and complex relationships between the
objects. For example, is the Golden Gate Bridge the only important object in the
image? Are the mountains in the background important? Should it be classified as a
scene of a coastline? Are the people in the foreground important? There are many
ways to describe this image, the difficulty lies in being able to represent all these
relationships in a tractable manner . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-6 A query on the AltaVista system (which uses color histogram measurements) for
images similar to the sunset image in the upper left corner. Here color is effective
because orange colors dominate in images of sunsets but not in other images. . . . .
1-7 A query on the AltaVista system for images similar to the image of the Eiffel Tower
in the upper left corner. Color is ineffective for a queries such as this one where
background colors (i.e., the blue sky here) dominate. . . . . . . . . . . . . . . . . . .
1-8 Possible renderings of two images generated by picking a few specific itmes from a
large dictionary of visual events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-9 The diagonal pattern of vertical edges (marked in red) arising from the stairs in this
image represent a "staircase" pattern. . . . . . . . . . . . . . . . . . . . . . . . . . .
1-10 The boosting process iteratively constructs a set of classifiers and combines them into
a single boosted classifier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1-11 A query on the system described in this thesis for images similar to the three example
images of airplanes at the top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-1
2-2
2-3
2-4
2-5
3-1
The color histograms for both images are very similar because they both contain
similar amounts of "yellow"," green",and "brown". However most people would agree
that these images represent very different visual concepts. . . . . . . . . . . . . . . .
A hand-crafted template for images of waterfall scenes. Here the waterfall concept is
defined to be a white region in between two green regions with a blue region on top.
An E-R diagram for an image of a sailboat. . . . . . . . . . . . . . . . . . . . . . . .
A query on the AltaVista system with the keyword "Eiffel" that illustrates the rich
semantic content of particular words. Since an image would probably be labeled
"Eiffel" only if it contained an image of the Eiffel Tower, text is effective for this query.
A query on the AltaVista system with the keyword "jets". Although we wanted
images of jet propulsion vehicles, the system retrieved images of the Jets football team.
The image on the right is a permutation of the pixels of the image on the car (left).
Using too general an image representation allows for these kinds of unrealistic images.
6
11
11
13
14
15
17
18
19
20
21
22
25
30
31
33
34
37
On the left are the principal components of image patches. On the right are the
principal components of the responses to the fourth principal component on the left
(a bar-like filter). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-3 The 9 primitive filters used in computing the measurement maps. In practice, these
can be efficiently computed with separable horizontal and vertical convolutions. . . .
3-4 A schematic of the filtering tree where a set of filters is repeatedly applied to an image
to capture more global structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-5 Response of an image of a tiger to a particular filtering sequence. Note that this
feature has detected a strong peak corresponding to the arrangement of the stripes
on the body of the tiger........
...................................
3-6 Histograms for a highly selective (left) and unselective measurement (right). The
highly selective distribution had a selectivity of 0.9521 and a kurtosis of 132.0166.
The unselective measurement had a selectivity of 0.4504 and a kurtosis of 2.3703. . .
3-2
Some typical images exhibiting the "waterfalls" concept. Note the diversity in the
images as some contain sky, some contain flora while others are mostly rock, etc. . .
4-2 The left plot shows how boosting finds measurements which better separate images
of mountains from other images better than choosing random measurements (right).
4-3 The boosting measurements which separate well images of mountains do not discriminant im ages of lakes as well. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-4 At left is the initial performance. On the right is the improved performance after
the most four most egregious false positives were added as negative examples. This
example is on a data set containing five classes with 100 images in each class (see
C hapter 5). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
40
41
42
43
4-1
5-1
5-2
5-3
5-4
5-5
5-6
5-7
5-8
5-9
5-10
5-11
5-12
5-13
5-14
5-15
5-16
5-17
5-18
An example image from each class of sunsets, mountains, lakes, waterfalls, and fields.
Measurements with four layers of filtering (right) performed better than using only
one layer of filtering (left). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A comparison between using color histograms with the chi-square similarity function
(left) and using selective measurements and boosting (right). . . . . . . . . . . . . .
Here we show that boosting can be used with color histograms to give comparable
performance. Boosting achieves this using only half of the histogram measurements.
This results in a substantial computational savings on a large database. . . . . . . .
A query for waterfalls using color histograms and the chi-square function. Note global
color histograms cannot capture the flowing vertical shape of waterfalls. . . . . . . .
A query for sports cars using color histograms and the chi-square function. Note
unsurprisingly that the few cars found are all red-cars of other colors in the database
are not ranked highly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A query for waterfalls using color histograms and boosting. Note that boosting finds
images with similar color distributions, but most of those images are not of waterfalls.
A query for sports cars using color histograms and boosting. Note that boosting finds
images with similar color distributions, but most of those images are not of cars. . .
The principal components correlations (left) are much smaller than the original measurem ent correlations (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance is reduce using the principal components of the measurements. . . . . .
A query for sunsets. The positive examples are shown on the left. . . . . . . . . . . .
A query for waterfalls. The positive examples are shown on the left. . . . . . . . . .
A query for sports cars. The positive examples are shown on the left . . . . . . . . .
A query for cacti. The positive examples are shown on the left. . . . . . . . . . . . .
The average receiver operating curve (left) and precision-recall curve (right) for detecting faces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A query for faces. The positive examples are shown on the left. . . . . . . . . . . . .
Results for images of the ten digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Results for images of the ten digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
44
48
48
49
51
51
52
53
54
55
56
57
57
58
58
59
60
61
61
62
62
63
6-1
An image with various possible classifications based on visual content, context, prior
know ledge, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
A-1 Image retrieval system architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . .
A-2 Image retrieval interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
68
8
List of Tables
. . . . . . . . . . . . . . . . . . .
2.1
A comparison of different image retrieval systems.
4.1
The boosting algorithm for learning a query online. T hypotheses are constructed
each using a single feature. The final hypothesis is a linear combination of the T
hypotheses where the weights are inversely proportional to the training errors.
9
.
.
.
32
47
Chapter 1
Introduction
1.1
Thesis Summary
Today, with digital cameras and inexpensive mass storage devices, digitizing and storing images is
easier than ever. It is not uncommon to find databases with over 500,000 images [33]. Moreover,
multimedia content on the Internet continues to grow dramatically-in 1995, there were an estimated
30 million images on the World Wide Web [47], today that number is certainly much higher. This
explosion of images creates a need to be able to index these databases so that users can easily
and quickly find particular images. The goal of an image retrieval system is to provide querying
capabilities comparable to those which exist for text document databases such as [2]. The problem
involves designing an image representation suitable for retrieval and an algorithm for learning a
query. The representation must somehow capture the visual content in images. Since the images
desired are not known ahead of time, it is difficult to provide the learning algorithm with many
training examples. So training often occurs with only the few training examples provided by the
user. A further requirement of the entire system is that it must operate in real time so that it can
be used interactively.
This thesis presents an approach to the image retrieval problem that is motivated by the statistics
of real images (i.e., images of real scenes, not computer generated ones). The task is: "Given some
description of visual content, find images which fit the description." In other words, given a concept
or class of images, find other images exhibiting the same concept or belonging to the same class. This
has important implications for computer vision because being able to find images which match an
arbitrary description implies being able to concretely describe an image. In other words, there must
be an explicit description of every image so that the description can be compared (by a computer)
to the target concept.
Humans can retrieve images easily and almost unconsciously especially because we can compare
the functional properties of objects in an image. For example both the airplane and race car in
Figure 1-1 could be assigned to a "vehicle" class since both provide transportation. This type
of categorization may be regarded as "higher" level cognition and resembles [39]'s superordinate
categories. Functional categorization requires more than just visual features of an image; some
characterization of the function "transportation" is needed. This type of characterization may also
involve some assessment of the utility of objects. Strategies for using functional properties are not
well developed, even in more constrained domains such as text document retrieval. It is exactly
the ease and mystery with which our brains perform this task that makes it extremely difficult to
program a computer (which currently requires explicit instructions) to do it'. We will sidestep this
problem by only considering information which can be extracted from the "visual content" of images.
For example, although our system should be able to identify the Eiffel Tower as a tower, it is not
expected to infer meta properties such as "(location Paris)". Thus, it is reasonable to expect our
system to be able to categorize airplanes as one class, and race cars as another, since each of those
'This is often a good definition of current artificial intelligence problems.
10
Figure 1-1: An airplane and a race car may be classified as belonging to the class of vehicles using
functional property "transportation".
NOW=
Figure 1-2: Example images from an airplanes and a race cars class defined by visual properties.
classes share many visual characteristics as shown in Figure 1-2. This formulation of the problem is
commonly known as "query by image content". This more closely corresponds to the "basic level"
categorization of [39]. We can thus formulate image retrieval as the following task:
Given very few example images, rapidly learn to retrieve other examples.
For over 20 years, computer vision has tried to develop systems which take an arbitrary input
image and generate a scene description. This general problem has been decomposed into more
specific subproblems such as image segmentation, object detection and object recognition. The idea
is that if we could solve the subproblems and combine the results, we could compute a complete
description of an image and hence compare images by comparing descriptions. Although progress
has been made in each of the subproblems, each remains unsolved. Part of the reason for this
is that the subproblems are intimately related in subtle ways since having a perfect segmentation
would help isolate areas in an image for detection or recognition, and having a perfection detection
or recognition system would help us find the segmented regions corresponding to different objects.
Research in fusing the results of these subsystems is less developed.
We will attack the image retrieval problem without trying to explicitly solve the computer vision
subproblems. This reasoning is motivated by the observation that one should not try to solve a more
difficult intermediate problem if one can more easily solve the problem directly [49]. We will exploit
11
the statistics of real images by presuming that although many possible events (i.e., visual events
such as an image of the Eiffel Tower) may occur in an image, in any particular image, only a small
set of events will actually happen. This observation suggests a sparse distributed representation for
images. We will design a set of features such that for any particular image, only a few key ones
will be active. However over all possible images, each feature should have about equal probability
of being active. These features will necessarily be selective, meaning the average response over
all images is much lower than the strong responses for a few particular images. To measure the
similarity between images, a system can learn which are the important features and compare just
those features.
To select the most relevant subset of features with only a few example images, "boosting" [20]
is used create a series of learners. Each successive learner tries to do better on examples that the
previous one did poorly on. This approach is well suited to image retrieval since the selective features
can be computed offline once and for all. The learning can be performed quickly with just a few
training examples because the representation is sparse. Since only a few relevant features will be
selected, the entire database of images can be searched with fewer computations per image.
Besides being a possible solution for image retrieval, this thesis suggests a theory for useful generic
intermediate level image representations suitable for simple, fast learning with a small number of
training examples.
1.2
1.2.1
Image Retrieval
The Problem Model
Image retrieval can be decomposed into two parts as shown in Figure 1-3:
image indexing storing images in a representation that facilitates retrieval.
querying learning the deserved query concept, and searching the database for images matching the
concept.
Image indices are typically precomputed offline so that they do not have to be recomputed for each
query. Depending on the type of index used, this may limit the kinds of queries which are possible,
but improves retrieval speed. For example if only color information is stored in an index, then
queries based on shape will not be possible. However for queries involving color, only the index is
required, and the image never needs to be rescanned. Querying is typically performed online as this
is part of the interactive user interface of the retrieval system.
In machine learning, a concept is induced or "learned" given a set of training examples. The
concept could be a rule to discriminate between images of airplanes and race cars. Machine learning
is often applied to classification problems where the problem is to assign a label given an input. The
training examples are usually given as a large set of labeled data (x", tn) where X" is the nth input
image and t' is the corresponding target label. The goal is to learn the concept so that unseen
test examples can be labeled correctly. Traditional machine learning methods for classification are
difficult to apply to retrieval because they often require a small number of known classes and many
labeled training examples. Retrieval differs from classification because the target class is unknown
in advance, and only defined at query time. In addition, it is possible for a single image to be
relevant for two different queries (e.g., an image picturing a sailboat off the coast of France may be
relevant both for a query of boats and coasts). Making matters worse, image databases are often
huge (thousands or millions of images) and usually contain a diverse set of images as shown in Figure
1-4.
Formally, the task is to rank all the images in a database according to how similar they are to a
query concept using a ranking function:
r(x, Q) : x -* [0,1]
where x is an image to be ranked,
Q
(1.1)
= q 1 ,. .. qN are example images of the query concept, 0 stands
for most dissimilar, and 1 stands for most similar. Note that r(x, Q) depends on the particular query
12
=
inde imae
image database
representation
user-selected examples
4comlearn
query concept
I
retrieved results
Figure 1-3: Schematic of an image retrieval system.
13
m,
Figure 1-4: A set of random images (that can be found on the World Wide Web) that illustrates
the diversity of visual content in images.
Q. Typically N ranges from 1-5. This situation calls for a way to index the images in the database
to allow for fast (because online users will not tolerate a long response delay) and flexible (because
different users will want to retrieve different types of images) machine learning of r(x, Q) using very
few training examples.
1.2.2
Specifications
Image retrieval is a difficult problem because it is not clear how to best represent images. For text,
because words have such rich semantic content, significant results have been achieved by merely
considering documents as lists of word counts without any regard for word order. An image retrieval
system must also be efficient in order for it to be practical on large databases. Below are the two
primary requirements of an image retrieval system:
fast search through a large database (thousands or millions of images) quickly (milliseconds or
seconds).
flexible handle a large number (hundreds or thousands) of different image queries (concepts) such
as sunsets, cars, crowds of people, etc.
As noted in [46], an image retrieval system must operate in real time. This allows the system to
make use of relevance feedback from the online user. Feedback generally consists of the user choosing
more examples and eliminating unwanted ones.
The method of specifying queries determines the type of user interface an image retrieval system
should have. We have chosen an interface where a query is specified by choosing a set of example
images representative of the class of images implied by the query. This is commonly known as "query
by example". It frees the user from having to manually draw an image or from trying to describe
the visual content with words 2. In particular a system should not demand too much of the user
except for a few simple tasks such as:
2 "A picture is worth more than a thousand words." - anonymous
14
Figure 1-5: An example of a scene with many objects and complex relationships between the objects.
For example, is the Golden Gate Bridge the only important object in the image? Are the mountains
in the background important? Should it be classified as a scene of a coastline? Are the people in
the foreground important? There are many ways to describe this image, the difficulty lies in being
able to represent all these relationships in a tractable manner.
browsing cycling through random images.
positives choosing some positive examples of the desired concept.
negatives possibly choosing some negative examples.
feedback possibly doing a few rounds of relevance feedback. corresponding to picking more positives
and negatives.
1.3
Scene Analysis
Based on the formulation of the image indexing problem, it seems appropriate to represent each
image as a description of the scene that was imaged. This description would include the various
objects in the image and how they are related spatially and by other relations. To determine the
similarity of two images, we simply determine the similarity of the scene descriptions. There are
several reasons that this type of system does not exist today. First, there does not exist a robust
method for extracting scene descriptions given an image. To do that we must first be able recognize
arbitrary objects in an image. Object recognition is an ongoing goal of computer vision and continues
to defy a general solution. Second, it is not clear what should be considered an object (i.e., should
the mountains in Figure 1-5 be recognized as one object or multiple mountains?). Finally given two
scene descriptions, it is not clear how similarity should be measured. Do we find how many objects
images x and y have in common, or are the spatial and other relationships between the objects more
important?
1.4
Image Indexing
The first step to creating an image retrieval system is to develop a representation for images that
facilitates retrieval. Image indexing takes an input image (initially represented as an array of pixel
intensities) and produces a new representation that makes it easy to find a particular set of images
in a database. One way to think of this problem is to consider the analogous task of making indices
for books in a library. For example, a book might be indexed by author, title, publication date,
as well as by an abstract. The goal of the index or representation is to permit quick access of a
particular book or class of books.
15
1.4.1
Measurement Space
To index images, we need to take some measurements from the input image. Let us define a
measurement as some function of the original pixel values of the image. Assume that we have some
way of extracting a set of measurements M from an input image. For example, one measurement
could be the count of the number of pixels of a certain color. Assuming the elements xi of M are
scalars (i.e., xi C R), we can arrange them into a measurement vector x = [x 1 , .. . , Xd]T where d is the
total number of measurements. We can thus consider an image as an element of the abstract space
Rd. This is the multidimensional input (usually called a "feature" vector, although that connotes
binary measurements) commonly assumed by many classical pattern recognition techniques [18].
1.4.2
Similarity Function
We can define the similarity s(x, y) between two images (vectors) x and y simply as a function of
the L 2 distance between the vectors:
s(x,
y)
= e-d(xY)
2
(1.2)
where
d
d(x,Iy) =
(Xi - yi)2
(1.3)
i1
is the L 2 distance. Points x and y are maximally similar if they are represented as the same point
in Rd since the L 2 distance between them will be 0 and their similarity with be the maximum value
of 1 (note that d(x, y) > 0). As the distance between two images in measurement space linearly
increases, the similarity will decrease exponentially. This formulation is equivalent to saying that
similarity is proportional to the density of x under a spherical gaussian centered on y (or vice versa)
since s(-, -) is in the form of a gaussian density:
A~X) =
1/
I |,1/2
V/2 |r
e-(XI)
T
E'(X-L).
(1.4)
Note that the similarity function is probably not the correct or "natural" metric for images in
measurement space. For example, the triangle inequality property of the distance functions may
not be valid (e.g., red is similar to orange and orange is similar to yellow, but the sum of these two
distances can be smaller than the distance between red and yellow since these colors may appear
very dissimilar). In addition, it is unclear whether or not similarity should decrease exponentially
with distance.
Based on sample query results, many commercial image retrieval systems today appear to use
color histograms (pixel color counts as measurements) and a static ranking function such as:
s(x, Y)
r(x, Q)
(1.5)
where
N
E q
=
n
(1.6)
n= 1
is the sample mean of the measurements of the example images. Once again, this is equivalent to
similarity being proportional to the density of the gaussian centered at p. In these systems, first a
color histogram representation would be generated for every image in the database. A user specifies
a query by choosing an example image. The system then retrieves the most similar images by finding
the images which are closest to the example image in the L 2 sense. Color histograms work well when
the amount of color happens to be a discriminating measurement for a particular class of images
16
33 6 KU
into sia!nil
more into
skririb
M
more
more into
SM&
7MUM3 25 KB
mom inlo Jbik
more nmo sip
ODMN=
Wo
LO
moe in
I&J
mote
moreinto
j~jlj
m2 Ludo
rnore iAO
72m72
lft
44 KB
Itik
I KB
morm ino irnik
Figure 1-6: A query on the AltaVista system (which uses color histogram measurements) for images
similar to the sunset image in the upper left corner. Here color is effective because orange colors
dominate in images of sunsets but not in other images.
for a
such as sunsets as shown in Figure 1-6. However as shown in Figure 1-7, color is inadequate
sky
blue
finding
of
job
better
a
done
has
system
the
fact
In
query of images of the Eiffel Tower.
similarity
static
single
a
using
measurements,
useful
with
Even
than tall, steel framed structures.
function may not be appropriate for every type of query. For example, although blue sky was not
an important feature for the Eiffel Tower query, it may be useful in queries for airplanes.
Selective Measurements
1.5
1.5.1
Image Generation
Consider the following observation of real images:
Many possible visual events can occur in images, but in any particularimage, only a few
will actually happen.
visual
Suppose we want to generate some image. We can pick from a very large "dictionary" of
Taj
the
car,
sports
a
dalmatian,
a
ground,
rocky
or
sandy,
events (e.g., clear or cloudy sky, grassy,
background
the
In
beach.
the
on
car
sports
a
of
image
an
Majal, etc.). For example we can generate
an
would be the ocean and perhaps some waves and a few mossy rocks. We could have generated
are
images
these
of
renderings
Possible
buffalos.
grazing
of
entirely different picture such as one
image-it
shown in Figure 1-8. Note however that we cannot have too many things in any single
17
snrrb
Z95x4S3
more rdo
59 KB
sinflr
more inbo
lS KB
stniar
31x23
more inbo
simiik
2sox370
127 KB
more
inbo
Dnx358
more
ino s'mibr
18KB
46
2x23
moreinbo
KB
snhibr
45x3Z5
more -nio
177xZ57
more
inbo
12sx92
more
irno
4m6xo312
more
inio
W KB
smiira
2n
KB
smirB
2 KB
sinr
13
s
Kb
milaf
Figure 1-7: A query on the AltaVista system for images similar to the image of the Eiffel Tower
in the upper left corner. Color is ineffective for a queries such as this one where background colors
(i.e., the blue sky here) dominate.
18
Figure 1-8: Possible renderings of two images generated by picking a few specific itmes from a large
dictionary of visual events.
is very unlikely to have both the buffalos and sports car in the same scene. The statistics of real
images allows for many types of images but not too many types of visual events in one particular
image (e.g., we are very unlikely to find an image of an aircraft carrier and a herd of camels on the
slopes of Mt. Everest).
Now suppose we want to know how similar image x and y are. Since we hypothesize that images
are generated by selecting visual events, one reasonable approach is to determine the similarity
between the events in x and y. So if both x and y contain a sports car on the beach, we might
say x is very similar to y. Of course, remember that we are given images in the form of arrays
of pixel intensities. Since there is no explicit representation of visual events, we will try to make
measurements which respond when these events are present. It is as if we must somehow explain
how an image is produced from the abstract generative model of images previously described. The
key question is how to design our measurements.
1.5.2
Measurement Design
Our approach is based on extracting "selective measurements" of images which correspond to the
visual events that we are interested in. Intuitively these are relatively global, high level structural
organizations of more primitive measurements. A primitive measurement such as an edge can occur
3
in many places in a particular image . We are more concerned with higher order events such as
a specific pattern of edges as in tiger or zebra stripes or a "staircase" pattern. Figure 1-9 shows
how a diagonal arrangement of vertical edges correspond to a staircase pattern. We believe that
a key characteristic of these events is that they are sparse. Measurement for these events will be
selective since they will only respond when a group of pixels are arranged in a specific pattern. A
measurement which detects whether a pixel is green or not will not be selective since many pixels
of an image may be green and furthermore many images will contain green pixels.
We have designed a filtering sequence that attempts to extract selective measurements. The first
stage of filtering uses masks which signal primitive events such as oriented edges and bars. Each
successive stage of filtering uses the same set of filters but of increasing support (the ratio of the
filter and image sizes) and is applied to the response images of the previous stage. The idea is to
capture the structural organization of lower level responses.
Note that by itself, the ability to uniquely identify every image is not useful. In fact the array of
pixels representation is already adequate for that purpose. However it does not suggest an obvious
way to generalize (i.e., grouping a set of images as members of a class). We can view this most
primitive representation as "overfitting" particular images. In machine learning, a classifer is a
system which when given an input outputs the corresponding class label. Just as we can trivially
build a classifier with zero training error by simply memorizing the training samples with a lookup
3
For example, in any region where there is a strong intensity gradient.
19
Figure 1-9: The diagonal pattern of vertical edges (marked in red) arising from the stairs in this
image represent a "staircase" pattern.
table, we can just as easily assign a unique label (e.g., 1, 2, ... , N) to each sample. However it is
unlikely that two images will ever have the exact same array of pixel values, so this method will
not allow us to label new images. For example, a slight translation of all the pixels to the right
will cause a change in almost every one of the array values. However, we would still regard the
slighted shifted image as very similar to the original image. We need measurements which occur at
an intermediate level of representation. This will enable our system to compare these measurements
and use them to identify a variety of image classes. We have designed our selective measurements
to fill this need. An added benefit of selective measurements is that they can speed up learning.
Selective measurements will induce a sparse representation for images. In other words, only a few
measurements will be active for a particular class of images. Thus a learning algorithm need only
focus on those particular measurements useful for identifying images in the class.
1.6
Learning a Query
To take advantage of the selective measurements representation of images, we will use a "boosting"
framework to learn the concept implied in a class of images. In our case we would like a classifer
which will tell us whether an image belongs to say the class of waterfall images (i.e., is similar to a
set of example images of waterfalls). Often it is quite easy to construct a classifer which performs
better than chance simply by using some sort of information in the image such as whether or not
there exists a striped pattern. Using other measurements we can build more weak classifiers which
perform better than random, but not much better. Boosting [20] is a general method for improving
a weak classifier by iteratively constructing a set of them and linearly combining their outputs into a
single boosted classification. Since only a few key measurements will be meaningful for a particular
image class, the goal of the learning algorithm is to find these key measurements. Boosting starts
by constructing a classifier which uses only one measurement. It then modifies the distribution of
training example images so that incorrectly classified ones are more heavily weighted. The process
is then repeated by selecting another measurement and building a second classifier. After a few
iterations, we have a set of classifiers, each of which is like a rule of thumb for classifying an image.
Boosting ensures that when all the classifiers are linearly combined, the weighted classification is
more accurate. Figure 1-10 shows a schematic of the boosting process.
Boosting takes advantage of the learning algorithm's ability to adapt to different training data
and explicitly reweights specific training examples to alter the distribution of the training data.
Boosting enables the learning algorithm to select the key measurements most relevant for the given
query and ignores the rest. After this initial training phase, querying the entire database only requires
20
training examples
measurements
I
select a new measurement
-* classifier I
boosted
classifier
classifer 2
build classifier
14 classifer M
P--
test classifier
adjust weights
weights
Figure 1-10: The boosting process iteratively constructs a set of classifiers and combines them into
a single boosted classifier.
computations with T measurements, where T is the number of boosting iterations. Empirically we
have found that 10 to 50 measurements gives reasonable results. As a practical advantage, this
makes storing the image representations in an inverted file more efficient. An inverted file indexes
the database by measurement instead of by image. For example file 1 contains measurement 1 for
all images, file 2 contains measurement 2 for all images, etc. This type of organization is more useful
for the boosting procedure which looks at one measurement in multiple images at a time.
1.7
Performance Evaluation
It is difficult to evaluate image retrieval performance because ground truth does not exist. Intuitively
we all know how a good system should perform, but there does not exist an image database such
that for any given query, the optimal results are well defined. The enormity of databases and the
multitude of possible queries make constructing a standard test database difficult. Ultimately, the
best measure is correlation with human judgments of similarity. Figure 1-11 shows an example query
of the image retrieval system using selective measurements and boosting. Many people would agree
that most of the retrieved images are similar to the example images. We will carefully construct
some test databases and evaluate performance using standard measures borrowed from information
retrieval [26].
1.8
Impact
In addition to being a practical tool for image indexing, search, and categorization, an image retrieval
system must necessarily address important theoretical questions in vision and learning. The task of
looking for images similar to an example image implies a definition of similarity for images. This
definition in turn relies on an understanding of how to represent and explain images. Humans can
retrieve images with ease. By formalizing this problem and developing a system to solve it, we can
gain insight into the brain's solution. Note that the different types of possible queries (or concepts)
is very large. The image retrieval system must be general enough to measure similarity for very
different types of images, ranging from images of sunsets to crowds of people. Although developing
measurement detectors for eyes, noses, and mouths will enable queries for faces or people, they will
not work for queries of cars, waterfalls, etc. It is difficult to know how many image concepts need to
be supported. In addition, the desired image class remains unknown until a query is specified. What
21
Figure 1-11: A query on the system described in this thesis for images similar to the three example
images of airplanes at the top.
22
this calls for is a set of measurements that is general enough for a large class of images. The problem
with using many different measurements is that learning structure in large measurement spaces is
more difficult. This problem is exacerbated by the fact that we can only expect users to provide a
few example images of the class. Many traditional machine learning techniques require hundreds or
thousands of examples to learn a concept. Thus the image retrieval problem is a practical need, and
presents important challenges to computer vision and machine learning.
1.9
Why Try to Solve Image Retrieval Today?
Our approach to image retrieval describes images with visual content measurements. Finding the
measurements themselves is a difficult problem and currently most representations use undiscriminating measurements such as color histograms. In addition, learning the query with only a few
example images is a difficult task. Despite these shortcomings the advantage of this approach is that
it deals with visual content directly making the system more intuitive and natural. It also allows
for an investigation into the statistics of real images and learning with only a few examples. Thus
instead of waiting for a complete general theory of vision to be found or remaining content with text
annotation for images, our exploration with image retrieval may yield some interesting results and
point out other problems that need to be solved.
1.10
Thesis Organization
In this introduction we provided the reader with an overview of the image retrieval problem and
proposed an approach motivated by the statistics of real images. We also briefly introduced ideas
in computer vision, machine learning, and information retrieval which are relevant to this research.
The rest of the thesis details the ideas presented here.
Chapter 2 surveys the current approaches to image retrieval. Both historical approaches and
current state of the art methods will be examined. We will compare our approach to previous
methods and point out where our primary contribution lies. We will also briefly mention related
work.
Chapter 3 begins the discussion of the approach to the problem. We describe a method of extracting selective measurements from images and selecting particular measurements for the retrieval
task.
In Chapter 4, we describe how the image representation developed in Chapter 3 is used for image
retrieval. We will present the approach for learning image concepts with very little training data.
In Chapter 5, we will show the results of experiments using our approach for image retrieval. We
also discuss our method of selecting and using image data sets and various performance measures.
Chapter 6 summarizes our work and discusses future research directions.
23
Chapter 2
Previous Approaches
Approaches to image retrieval have been primarily exploratory. There is no satisfactory theory
for why certain measurements and classifiers should be used. Although there has been an explosive growth of digital image acquisition and storage technology, researchers have only begun to
improve image retrieval technology. These systems can be divided into feature/measurement based
approaches such as [46, 23, 6, 29, 38, 19, 43, 25, 10, 1, 34, 24] and database management system
approaches such as [33, 3]. Some common characteristics of many previous approaches are:
* Only a single positive example is used.
" Negative examples are often not considered.
" The user is required to adjust the weights (importance) of various features.
" The user often needs to be experienced with an intricate query language.
* Various features are used without a principled way of combining them.
* There is no query learning, only a fixed similarity function.
These characteristics put a heavy burden on the user to be familiar with the intricacies of the
retrieval system. Often a particular example image and set of weights will work well, but a slightly
different setting of the weights or a different example will drastically alter the retrieval. Many of
the early systems were tested on very small databases (on the order of a hundred images). This is a
small number compared to the typical size of current image collections, and a miniscule fraction of
the number of images on the World Wide Web.
2.1
2.1.1
Color Indexing
Histograms
One of the earliest approaches to image indexing used color histograms [46]. A color space such as
RGB (red, green, blue) is discretized into a fixed set of colors c 1 , ... , cd to be used as the bins of the
histogram. The color histogram for an image is simply a table of the counts of the number of pixels
in each color bin. [46] used the following "histogram intersection" similarity function:
s(x, t) = Z
min(xi, ti)
zi=1 ti
(2.1)
where x is the test image, t is the model image, and i indexes the bins in the histogram. This gives
the sum of the fractional matches between the test image and the model. If objects in x and t are
segmented from the background, then this is equivalent to the L, distance (sum of absolute values)
of the histograms treated as Euclidean vectors. Color histograms work well for a database with a
24
Figure 2-1: The color histograms for both images are very similar because they both contain similar
amounts of "yellow"," green",and "brown". However most people would agree that these images
represent very different visual concepts.
known set of segmented objects which are distinguishable by color (i.e., the magnitude of the color
variation of the object under different photometric conditions is within the color quantization size).
The advantages of color histograms that motivate their use are:
" Histograms are easy to compute (in
pixels).
O(N) time using O(d) space where N is the number of
* They are invariant to translation and rotation in the image plane, and to pixel resolution.
* Color is a much simpler and more consistent descriptor of deformable objects (e.g., a cloud, a
sweater) than rigid shape-based representations.
The disadvantage of histograms is that we lose information about the distribution of color or
shape of colored regions. This is because the histogram is global and no attempt is made to capture any spatial information. By definition, the histogram can be computed by treating all pixels
independently, since color is a pixel level property. These extremely local measurements are then
combined into global counts. The implicit assumption is that the spatial distribution of color is
unimportant. However the simple example in Figure 2-1 shows that this is not the case. Two other
assumptions are: (1) all colors are equally important, (2) colors in the same discrete color bin (uniformly discretized) are similar. [46] performed experiments in which the model image was assumed
to be segmented so that the histograms were not corrupted by noisy backgrounds. They used a
small database of 66 images of objects well differentiated by color (e.g., different cereal boxes and
shirts). For a practical system, users would be required to segment objects in the model image. This
would require that the histograms be computed online, slowing down the overall retrieval process.
Note that no attempt was made to use multiple examples and negative examples. Also there was
no machine learning of queries. Although there are an exponential number of possible histograms
(in the number of colors m), for real images, this space is effectively much smaller because some
combinations of colors are highly unlikely. Also many dissimilar objects such as red cars and red
apples will be tightly clustered, while similar objects such as red cars and black cars will be distantly
separated in the color histogram space. So color histograms naturally only work well when the query
is adequately described by global counts of colors, and these counts are unique to images relevant
to the query.
2.1.2
Correlograms
Color correlograms augment global color histograms to describe the spatial distributions of colors
[23]. The correlogram of an image is the set of pairwise distance relationships between colors:
9ci,cjk(x) =
Pr
pi Eci,P2
[p2 Ecc,|p1 - p 2 1= k].
(2.2)
It gives the probability that any pixel pi colored ci is within an L,, distance (i.e., maximum vertical
or horizontal distance) k from any pixel P2 colored cj. Note that the correlogram is still global
25
in a way because it describes how a particular color ci is distributed across the image. It is a
generalization of the simple color histogram because we can always get the pixel color counts by
marginalizing the "auto correlogram" gc,c ,k over all distances k. To keep the correlograms small
and tractable, in practice k ranges discretely from a set D such as {1, 3, 5, 7} (i.e., the assumption
is that large spatial correlations are not useful for similarity). In this way, color correlograms can
be computed in O(M 2NIDI) time (without any optimizations) where IDI is the cardinality of D.
[23] also augmented the simple L 1 distance by considering a "relative" L 1 distance where the degree
of difference in a bin is inversely proportional to the average size of the bin counts (to account for
Weber's Law 1):
d(x, t)
=
Z
j','k(X) -i,,k (
ij.k gci,cj,k (t) + 9ci,cj,k (X) +
.
1
(2.3)
[23] demonstrated reasonable retrieval results on experiments where a scene was imaged under different photometric conditions and underwent small transformations such as translation and
rotation. Since color is fairly stable under different lightning conditions and translation and rotation, correlograms work well for retrieving images of the same scene. However for general queries
of different scenes which are similar (e.g., images of cars of different colors), color alone is not a
discriminative enough measurement as previously discussed.
2.2
2.2.1
Color, Shape, Texture
QBIC
QBIC (Query By Image Content) [19] is a highly integrated image retrieval system that incorporates
many types of features/measurements. Queries may be specified with an example image or by a
sketch. Text annotations can be used as well. QBIC can also process video sequences by breaking
them into representative frames and using the still image tools to process each frame. Users define a
query using average color, color histograms, shape, and texture (although these can only be selected
from a predefined set of sample textures). Users are also required to weight the relative importance
of these features. Shape is described by first doing a foreground/background analysis to extract
objects. Then edge boundaries of the object are found. Comparing shapes is fairly expensive even
though QBIC uses dynamic programming [12]. In addition, histogram similarity is measured using
a quadratic distance function
d(x, y)= (X
y)TA(x - y)
(2.4)
where the matrix A specifies some notion of similarity between pairs of colors. A few prefiltering
schemes are used to attempt to speed up retrieval since computing a quadratic distance can be slow.
Although QBIC offers a slew of features, they are not much more discriminative than those used by
color indexing approaches. Users must also be intimately acquainted with the system to properly
weight the relative importance of features.
2.2.2
CANDID
CANDID (Comparison Algorithm for Navigating Digital Image Databases) [25] attempts to probabilistically model the distribution of color, shape, and texture in an image. At each pixel in an
image, localized color, shape, texture features are computed. A mixture of gaussians probability
density is used to describe the spatial distribution of these features in the image. A mixture density
'Results from psychophysics experiments on sensory discrimination show that for a wide range of values, the ratio
of the "just noticeable difference" to the stimulus intensity is constant [27].
26
is
M
p(x) = Zp(xj)P(j)
j=1
(2.5)
where P(j) is the prior probability of x being generated from gaussian j. In particular P(j) must
P(j) = 1. The parameters for the mixture model are estimated using the K-means
satisfy E,
clustering algorithm [8]. The similarity of two images I, and 12 is measured using a normalized
inner product similarity function
nsi(R
=
f1
P2)
[fP
P11 (x)PI2 (x)dx
(x) dx fR PI2(xd]/
(2.6)1
which is the cosine of the angle between the two distribution functions. The mixture of gaussians
model avoids the need to arbitrarily designate discrete bins when using histograms. It does require
choosing in advance the number of components or clusters M. An added advantage of modeling
the distribution is that it is possible to visualize the relative contribution of individual pixels to the
overall similarity score. [25] achieved good results on restricted databases of satellite and pulmonary
CT scan images with 100 and 200 total images respectively. The primary disadvantage of CANDID
is that both the density estimation and querying phases are relatively slow and will not scale well
to larger databases.
2.2.3
BlobWorld
In the BlobWorld [6] system, images are represented as a set of "blobs" or 2D ellipses of arbitrary
eccentricity, orientation, and size. Blobs are constrained to be approximately homogeneous in color
and texture. This representation is designed to reduce the dimensionality of the input image while
retaining discriminatory information (i.e., the assumption that homogeneous blobs are adequate for
discrimination). By using blobs of coherent color instead of global color histograms, some local spatial information is preserved. In this respect, blobs are similar to correlograms with small distances.
Texture is measured by the moment matrix of the horizontal and vertical image gradients, and a
''polarity" term which measures the extent to which the gradients agree.
To cluster points under the color and texture dimensions the expectation-maximization (EM)
algorithm [8] is used to fit a mixture of gaussians model. The algorithm iteratively tries to find
the parameters for each gaussian such that the log likelihood of the data under the parameters is
maximized. The number of clusters chosen ranges from two to five, and is selected based on the
minimum number of clusters which fit the data adequately (this means using one fewer cluster if
the log likelihood does not drop too much). In effect, this clustering determines a discrete set of
prototype colors and textures. After clusters are found, a majority voting and connected components
algorithm is used to group pixels into blobs. EM is then used again to find the two dominant colors
and mean texture within a blob. The spatial centroid and scatter of the blob is also computed.
To formulate a query, the user first submits an image. The system returns the blob representation
of the image. The user then chooses some blobs to match and weights their relative importance.
Each blob is ranked using a diagonal Mahalanobis distance, and the total score is combined using
fuzzy logic operations on the blob matches.
[6] shows experiments in which BlobWorld outperforms the simple color histogram approach of
[46].
2.2.4
Photobook
The philosophy of Photobook [34] is to index images in a way that preserves enough information to
reconstruct the original image. To achieve this, the system is divided into three separate subsystems.
The first subsystem attempts to capture the overall "appearance" of the image. This is done using
principal components analysis [48]. This method describes an image by the deviation from an average
27
image along the principal axes of variation. The second subsystem captures information about an
image's 2D shape. Color differencing is used to extract the foreground object. An interconnected
shape model is then built for the object. A canonical shape is determined from the stiffness matrix
using finite element methods [34]. The eigenvectors of this matrix encode the deformations from
the canonical shape. The texture subsystem encodes images with a Wold decomposition [34]. This
roughly corresponds to the periodicity, directionality and randomness of the texture.
Although Photobook has been shown to perform well, the tests were on limited types of images
such as faces of white males over 40, mechanical tools, and cloth samples for curtains. The primary
disadvantage of Photobook is that an image must first be assigned strictly to one of the subsystems.
Queries are limited within a subsystem. This essentially turns the image retrieval problem into a
classification problem where the three umbrella classes of appearance, 2D shape, and texture are
defined a priori. This pigeonholes many types of images such as those of animals which have both
characteristic shapes and textures. Photobook resorts to complicated detection algorithms which
try to identify particular types of images such as faces in order to accomplish this classification.
Sophisticated techniques for aligning and scaling objects to establish correspondences between pixel
locations and image features are also critical for principal component analysis. The representation
used by Photobook actually preserves enough information to reconstruct the image. However, much
of this information may not be relevant for a query.
2.2.5
JACOB
JACOB (Just A COntent-Based query system for video databases) [10] is a two step retrieval system
using color histograms and normalized axial moments to describe shape. Queries are specified by
example. The first step uses a weighted L 1 distance on the histograms to filter the database and prune
very "dissimilar" images. The second step compares shape using a normalized shape correlation
function [10]. An edge density measure that is simply the proportion of intensity gradients above
a certain threshold was proposed as a texture feature. Much of the research with JACOB involved
identifying representative image frames in video sequences using motion information and a neural
network.
2.2.6
CONIVAS
CONIVAS (CONtent-based Image and Video Access System) [1] lets users specify queries using an
example or a sketch. Local color and edges are used as features in an information theoretic similarity
function [1].
2.3
Wavelets
Wavelets represent a computer graphics approach to image retrieval. Wavelet representations are
an effective way to compress images [44]. [24] used the Haar wavelet transform to index images.
Treating an image as a function x(m, n) that specifies the pixel value at location (m, n), the transform
decomposes the image into a set of basis coefficients c each of which represent the value of basis
0(). Each basis is a dilation and translation of a single "mother" wavelet. This makes the transform
particularly simple and efficient to compute. The Haar wavelet is defined as:
0(x)
1
-1
if 0 < x < 1/2
if 1/2 < x <1
0
otherwise
The Haar basis consists of the functions J(x) = 2i/ 2 h(2jx - k) for j, k = -2, -1,0,1,2,.... These
are the translated and dilated versions of the mother wavelet. Note that the Haar transform essentially computes the intensity difference between the two halves of the image. The different basis
functions compute these differences at different locations and scales.
28
[24] truncated small coefficients to zero and used a weighted L 1 distance of the vectors of coefficients:
d(x,y)
=
Wj, kIC, k - cykI.
(2.7)
j,k
The weights wk were determined by optimizing the distances with respect to a preselected set of
training images.
The primary motivation for using wavelets as an image representation is that the largest coefficients can be used to compress the image. Thus the intuition is that we can measure the similarity
between images by calculating the difference between their largest coefficients. The Haar basis approximates an "edge" detector by assigning large coefficients to high frequency areas of the image.
This is done across all scales of the image. The decomposition is also fairly efficient to compute in
O(N) time where N is the number of pixels in the image.
Although Haar wavelets are well suited for storing and reproducing images at multiple resolutions, retrieving images does not explicitly require generating them. Also, the link between optimal
compression and optimal discrimination is tenuous. One of the major problems of the Haar transform is that it is not invariant to translation. Thus shifting an image will make it very "different"
from the unshifted image. Besides that, edges may not be discriminating enough for image retrieval.
2.4
2.4.1
Templates
VisualSEEk
VisualSEEk [43] compares images using spatial arrangements of colored regions. A query is broken
into a fast phase using only color features and a slow phase which compares spatial relationships
between colored regions. Instead of color histograms, VisualSEEk uses "color sets" which are bit
vectors which index a color space. A color set can be computed from a histogram by thresholding
the counts in each bin to {0, 1}. The use of color sets is predicated on the assumption that only
a few colors dominate an image region. It also speeds up distance calculations. Regions for a
color set are found by labeling pixels of the image according to its color bin. This labeled image is
then filtered and then minimum bounding rectangles are designated as regions. The first phase of
querying acts primarily as a filter which prunes away images which do not satisfy the color criterion.
It uses a quadratic distance function similar to that of QBIC. Queries may be formed using the
area, spatial extent, or absolute location of a region based on its centroid and minimum bounding
rectangle. Matching spatial relationships in the second phase is made tractable by using a "2D
string" representation. An example is "(region 1 < region 2, region 3 > region 4; region 1 > region
3 )" where "< / >" may mean "left/right of" in the first part of the string and "above/below" in
the second part.
In VisualSEEk, the user is required to sketch the regions and spatial relationships between
regions. For some queries in a 3,100 image database, using color sets alone performed almost as well
as color histograms. Adding spatial relationship constraints improved queries for images such as
sunsets where well defined colored regions exist. However tests with a synthetic database of images
with uniform color regions found that the method of finding colored regions and extracting spatial
relationships performed significantly worse than using manually entered ground truth regions and
relationships.
2.4.2
Flexible Templates
The absolute values of the intensities in an image vary greatly under different illumination conditions. This makes simple template matching ineffective for recognition. However, the intensity
ratios between different regions in an image are more stable. This idea of a ratio template has been
used to define image invariants for object recognition [42]. It has also been applied to natural scene
classification where the templates capture the qualitative spatial and color relationships between
29
Figure 2-2: A hand-crafted template for images of waterfall scenes. Here the waterfall concept is
defined to be a white region in between two green regions with a blue region on top.
image regions [29]. The reasoning behind flexible templates is that global, low frequency qualitative
measurements are adequate for describing many image concepts. For example, a waterfall template
can be defined as a white region in between two green regions with a blue region on top as shown
in Figure 2-2. Specifically, an image is first divided into regions of coherent color. Then the differential color and spatial relationships between these regions are extracted as measurements for
classification. [29] hand-crafted flexible templates for natural scene classification (e.g., mountains,
fields, waterfalls) and found that simple relationships were sufficient for effective results. This approach can be used with other measurements of image regions besides color such as dominant edge
orientation. Interestingly we can also get image log intensity ratios by taking the difference between
log intensities: log x - log y = log x/y
2.4.3
Multiple Instance Learning and Diverse Density
To use flexible templates for image retrieval, templates must be learned online because it is impractical to predict what types of image concepts will be queried. [38] used a multiple instance learning
framework to automatically construct templates for natural scene classification. They observed that
an image can be considered a "bag" of instances. In particular, rows, blobs, and blobs with neighbors (all of pixels) were used as the possible types of instances. Images relevant to a concept are
labeled positive because somewhere in the image is an instance of the target concept (e.g., a blob of
white in between two blobs of green for a waterfall). Bags are labeled negative if all the instances
are irrelevant to the concept. The goal is to find the instances consistent across the positive bags
and different from the negative bags. To do this the learning algorithm searches for the point in
instance space with highest "diverse density". This point is close to at least one instance in every
positive bag and far from every negative instance. A test image is relevant if one of its instances is
close to this diverse density point. More formally, the maximum diverse density point t is found as:
argmaxf JPr(tlBft) fi Pr(tIB-)
t
(2.8)
%
i
where B+ is a positive bag and B- is a negative bag. This is equivalent to maximizing the likelihood
, B+, B-, ... , B-It) under the assumption of a uniform prior on the maximum diverse
Pr(B ....
and conditional independence of the bags given the point. In practice, the conditional
point
density
modeled as gaussians, and t is found by using gradient ascent from multiple starting
are
probabilities
that their results improved if there was feedback from the user indicating further
found
points. [38]
examples. They also found that more complex instances improved the number
negative
positive and
but took a very long time to learn. A key observation made by [38] is
retrieved,
of relevant images
make concept learning simpler.
measurements
that more discriminating
30
>1.~
sailboat Io<
(whte
n
ocean
~
below
laI)
-
--
sky
azr
Figure 2-3: An E-R diagram for an image of a sailboat.
2.5
2.5.1
Database Management Systems
Chabot
Chabot [33] is primarily a database management system that incorporates color histograms. All the
other features are manually entered keywords such as name, place, and short descriptions. Its goal
was to integrate textual meta data with visual content. By using a powerful relational database, the
system enables complex queries using a rich query language. However, the database only returns
exact matches and does not rank images according to similarity. Since the only type of direct visual
content is color, the system is often effective only when used in conjunction with keywords. In
fact, precision was as low as 5.8% indicating very low discrimination when using color alone. Using
keywords exclusively often made the system too stringent with recalls as low as 18.1%. In practice,
it was found that users were required to have experience with the database and knowledge of how
to formulate effective queries.
2.5.2
SCORE
SCORE (System for COntent-based approximate image REtrieval) is also primarily a database
management system. In fact, it does not use any direct visual properties. The visual content of
an image is described with manually entered attributes in the form of an entity-relationship (E-R)
diagram, an example of which is shown in Figure 2-3. Although the E-R diagram appears descriptive,
note that it has to be manually entered. Also the diagrams are often inconsistent in that some users
will describe the image in Figure 2-3 as a "ship on the sea". Queries must also contain the exact
words used in the E-R diagram such as "azure sky" instead of "blue sky". Some success in grouping
adjectives which describe similar visual content has been achieved with the use of a thesaurus. As
with Chabot, example-based queries are not possible. Instead users are required to be familiar with
SCORE's specific query language.
2.6
Summary of Previous Approaches
The primary limitations of the approaches described is that the visual measurements are not very
discriminative. For example, in arbitrary real images, red can occur in scenes of sunsets, red sports
cars, roses, clothing, etc. Color is not discriminative enough with respect to a large set of real images.
So even complex relationships between colors will be inappropriate for many query concepts. Other
low level features such as horizontal and vertical edges are also undiscriminative. In addition they
31
system
[46]
correlograms
BlobWorld
flexible templates
features
color
color
color, shape, texture
color
spatial
global
global
local
relationships
queries
example
example
example, weighting
positives/negatives
QBIC
color, shape, texture
global
example, weighting
VisualSEEk
CANDID
JACOB
CONIVAS
Photobook
color
color, shape, texture
color, shape
color, edges
appearance, shape
texture
texture
color, keywords
keywords
relationships
local
global
local
global
sketch
example
example
example, sketch
example
local
global
global
example, sketch
database query
database query
wavelets
Chabot
SCORE
Table 2.1: A comparison of different image retrieval systems.
are also unselective since almost all images will contain some horizontal and vertical edges. Many
early approaches used a static similarity function. This makes it difficult to automatically tune
the system for a particular query. More recent approaches apply some sort of query learning, but
usually the learning is necessarily complicated and therefore slow because the measurements are not
discriminating enough. Our approach computes highly selective measurements and leverages the
strong discriminatory nature of these measurements to use simple learning with boosting.
2.7
2.7.1
Related Work
Information Retrieval
One way to represent an image is via a list of words. For example, an image of the Eiffel Tower
might be indexed with words such as 'Eiffel', 'Tower', and 'Paris'. Other possible words are 'tourism',
'tour' (French for 'Tower'), '1889' (year built). In accordance with the familiar phrase, "a picture
is worth more than a thousand words", many thousands of words can be associated with an image.
Unfortunately, currently there is no method of extracting all the associated words given an image.
Moving from the visual input of an image to a textual representation will necessarily result in a loss
of visual information. We could label images with words in this way if we had a system to perform
object recognition on an image. However even with this approach the words would only give a list
of the objects in the image and for an image of the Eiffel Tower, the words that such a system may
provide could be "tower, sky, ground." Descriptions such as "metal, skeletal structure located in the
middle of a plaza" would be hard to cone by. Developing a rich word-based description of images
would require extracting the relationships between objects in an image, describing the background,
etc. An entire article on the Eiffel Tower would be needed for a single image. Evidence for the
inadequacy of using only text to describe an image was shown in the database management systems
approaches to image retrieval. Ultimately text may neither be rich enough'nor are users consistent
enough in choosing effective keywords.
Much of the success in text document information retrieval is due to the rich semantic content of
certain words. For example, given the word "Eiffel" in a document, there is a very high probability
that the document is relevant to the Eiffel Tower. For example an image search on the commercial
AltaVista [2] system with the keyword "Eiffel" is shown in Figure 2-4. Although the query is
successful, a further query for other images which are "visually similar" to the first image of the
Eiffel Tower returns an unsatisfying result as shown in Figure 1-7. We may try to use text indexing
32
ampEiff ..pg
io
2KB
or
more into
more
Eifel Tower.jpeg
3NxR54
more irdo
121 KB
srnika(
11530069.jpg
9 KB
2zx222
more irdo
17
more into
KB
I I KB
into
James andth ...pg
11477045.
pg
_81 KB3
448xno
more into
0o KB
I7oxzs
more irdo
11530085.jpg
211 x296 a KB
11427845.jpg
12 KB
17on2a
more
10133661.jpg
255XI67
Z4ar
into
11477051.jpg
15 KB
more irdo
2wmlIs
more
into
11412228.jpg
172XZ56 7 KB
moe into
Figure 2-4: A query on the AltaVista system with the keyword "Eiffel" that illustrates the rich
semantic content of particular words. Since an image would probably be labeled "Eiffel" only if it
contained an image of the Eiffel Tower, text is effective for this query.
for image retrieval but often intuitive keywords for searching turn out to be unsuccessful. An example
of a query using the keyword "jets" on the same system is shown in Figure 2-5. From this example
we can see that the word "jets" is not descriptive enough to select images of air transport vehicles.
Since images are digitally represented as collections of pixel intensities, we cannot say much about an
image given the value of a single pixel. Like the word "jets", but in a more extreme way, whether or
not a particular pixel is green will not tell us much about the visual content of the image. However
given a more selective image measurement, we can approximate the kind of semantic content present
in words more closely. So instead of trying to create a text-based description of an image, we will
try to generate a representation based directly on visual content.
2.7.2
Object Recognition
A related area of computer vision research is object recognition. Indeed this is still the subject of
much work today. The two primary approaches to this problem are: (1) building 3D models for
recognition [30], and (2) recognizing objects directly using only 2D views [35].
Unfortunately much of this research is only applicable when the image being considered contains
a single object. Object recognition systems often assume segmentation as a preprocessing step
where each object in the image has been segmented into its own spatial region. Within a single
real image, many objects may be found and objects often occlude one another. Also, some objects
33
jets4.jpq
ISx217
8 KB
145x213
more into
JW%*4.Iy9
9KB
srniL
Iso, 15 sKB
more into sinrfle
lets line.jpg
more
into shnitr
vik1123a.jpg
3x2AO
more into
15 KB
1tgJir
lets.gif
1285KB
s'nibr
s7I5
moreinto
14ox I75 9 KB
more
into
ninri
etal.lpg
216xIOS
Io KS
more into
145r110
more into
I5KB
s nilar
I2DGOH
more into
10 KB
siniku
jets.jpg
142 2o
more Into
233m
33r7
more into
2
KB
KB
snhiar
Figure 2-5: A query on the AltaVista system with the keyword "jets". Although we wanted images
of jet propulsion vehicles, the system retrieved images of the Jets football team.
34
are perhaps better considered as mass objects such as sand, grass, and sky. They defy a simple 3D
model and may be more appropriately labeled as texture. In addition, heavily deformable objects
such as clothing can be extremely difficult to model. Even with a perfect object recognition system,
one stills need to develop a good representation for the relationships between different objects in an
image in order to provide a complete scene description.
Work on object detection such as face detection has met with much success, achieving higher than
90% accuracy rates on various test data [45, 40]. However these systems are often quite complicated
and finely tuned for detecting only one type of object such as faces. Building a detector for every
type of object may be possible, but current approaches make it impractical to combine all of these
detectors (we might need thousands of these) into an image retrieval system fast enough for online
querying.
2.7.3
Segmentation
Another area of historical computer vision research that remains active today is image segmentation
[30]. This can be considered the dual of edge detection because the perfect edge detector should
find all the boundaries between different regions while the perfect segmentation algorithm should
label all the different regions bounded by the edges. For example, in a world with only polyhedral
objects, the segmentation algorithm should find all the regions which correspond to the faces of the
polyhedrons. In real images of scenes such as the one shown in Figure 1-5 it is often difficult to
precisely formulate the segmentation problem. Should each suspension bridge cable be segmented
into a different region or should they all be considered part of a single spatially disconnected region?
Should each leaf of the bushes be its own region or should the entire bushy area be grouped into
one region? Often the "right" segmentation is not well defined and depends on the subsequent use
of the segmentation. If the goal is to find a particular leaf, then each leaf should be segmented into
its own region. However if the goal is to identify bushy areas then all the leaves should be combined
into a single region. Segmentation may aid an image retrieval system by breaking down the overall
similarity into a sum of the similarities between image regions, but the desired segmentation is ill
defined and comparing multiple regions may still be a difficult task.
2.7.4
Line Drawing Interpretation
There is a large body of research in computer vision on problems that are closely related to the
abstract problem of image understanding. Much of the early work in computer vision concentrated on
finding edges within images-with the assumption that edges provided the most useful information
for explaining an image. Indeed this is true for line drawings, and success was achieved in interpreting
scenes of polyhedral objects. With information about the intensities of the regions between lines, it
is possible to extract 3D models of polyhedral objects [30]. This is mainly due to the simplifications
introduced by considering only polyhedral (as opposed to curved) objects with faces (regions) of
uniform brightness. The constraints in the polyhedral image world allow for a relatively complete
and compact description of scenes and is well suited for edge detection methods which depend on
sharp intensity discontinuities at an edge. We have a good theory of polyhedral scenes and practical
applications such as the Copy Demo (a system that analyzed images of toy blocks and was able to
control a manipulator to move these blocks). However, it is difficult to apply these methods to image
retrieval because real images are much more complicated than line drawings of polyhedral objects.
Most objects in the world are not strictly polyhedral, and although we may be able interpret a line
drawing to some degree, producing the line drawing from an actual image remains unsolved. In real
images, regions seldom have constant brightness (i.e., they often have highly textured regions) and
discontinuities between regions are often smoothed out by noise. Successes in scene analysis have
been limited to image domains such as line drawings and images of artificial objects such as toy
blocks [18] making this research not directly applicable to the image retrieval problem.
35
Chapter 3
Selective Measurements
3.1
Motivation
Chapter 2 reviewed various approaches to image retrieval. The primary measurements used to represent images often centered on color (e.g., histograms, correlograms, spatial relationships between
regions) [46, 23, 29, 38, 19, 43]. Some methods also extracted measurements based on textures
and shapes [19, 25, 10, 1, 34, 24]. These measurements were of limited sophistication and often
heavily constrained to particular types of images such as faces. Various methods of computing the
similarity between images were used, from simple Euclidean distances to dot products of probability
distributions. Different methods were also used to learn what measurements were important to a
particular query from hand-weighting to multiple instance learning. And although these approaches
show promise in many areas, image retrieval technology is still not at the level demanded by users.
This situation calls for an approach that does not automatically assume that color is the best
measurement to make. It is no surprise that different approaches using similar measurements have hit
an impasse. One conclusion is that no matter how sophisticated the classifier, retrieval effectiveness
is inherently limited by the discriminative power of the measurements. Thus, it may be fruitful to
search for better measurements. With more powerful measurements, perhaps even a very simple
classifier might do the job. Our approach builds on the ideas in [16].
3.1.1
Preprocessing vs. Postprocessing
Before an image is presented to a classifier, it is often preprocessed into a more suitable representation. This is commonly done to reduce the input to a more tractable dimensionality 1 . Preprocessing
often takes the form of feature/measurement extraction. This is usually guided by the domain
knowledge specific to the problem. For example, it might be useful to detect edges in an image.
Usually the more preprocessing we less complex we have to make the classifier. If the measurements
correspond exactly to the class labels, the classifier is trivial. On the other hand, if the classifer
must work with the raw input, it might need more complex computations to achieve a similar level
of performance.
This issue manifests itself in image retrieval as how much image indexing should be done offline
versus at query time. Since many image processing operations are computationally expensive even
for today's computers, it is impractical to defer all measurements until a query is made. This
would force the same measurements to be recomputed for every image in the database each time
a query is made. Thus the image database is really a database of image re-presentations, often
in a very different form than the raw "array-of-pixels" representation. Besides deciding how much
preprocessing to do, it is more important to decide what type of preprocessing should be done. Many
current image retrieval systems do not stress the importance of having the right measurements in
the first place. For example, systems which only use color histograms automatically limit retrieval
1
A 512x512 pixel image has 262,144 dimensions, one for each pixel.
36
Figure 3-1: The image on the right is a permutation of the pixels of the image on the car (left).
Using too general an image representation allows for these kinds of unrealistic images.
effectiveness because color alone cannot distinguish between say an image of a red sports car and a
red rose. In the limit, we could design measurements to detect particular visual concepts such as
cars or flowers. However this would render our system capable of handling only those two concepts.
So although hard, specific decisions at the outset may make image retrieval more efficient, it reduces
flexibility.
3.1.2
Image Representations
A completely generic representation of images is of little use. A very general representation is to
choose each pixel as a dimension in Euclidean Rd space where d is the number of pixels. The
2
problem with this representation is that it is overly general because it can model arbitrary images
not just real ones'. For example, we can take an image of a car and permute the pixels as shown
in Figure 3-1. The result is a valid point in Rd space but bears no resemblance to any image one
would normally see. This representation is also completely local since it considers only pointwise
measurements (where we take each pixel or dimension as a measurement). We stated that a highly
specific representation could make machine learning easy, but would limit the system to only the
concepts implied in the measurements. A general representation is highly underconstrained and
makes learning more difficult.
Using a single measurement for each concept severely limits the total number of distinguishable
4
concepts. With d measurements, only d concepts are represented . The pattern for a particular
concept is represented as completely local activity in one and only one measurement. Since the
classifier only has to check which measurement is active to identify the concept, the measurements
must take into account the entire image (i.e., the support for the measurement is global). If the
measurement is only influenced by part of an image (i.e., local support), then it cannot guarantee
that the visual content in the other parts do not affect the image concept. In addition, the concepts
must be known a priori in order to construct detectors for them. This information is not available
in image retrieval. Too much a priori modeling of the measurements drastically limits the capacity
of the representation to capture new objects. In fact, an image can only belong to one concepthierarchical concepts are not possible. These measurements are too specific and overconstrained.
At the other extreme is a completely distributed representation. This maximizes the capacity of
the measurement space since concepts are defined by activity across all the measurements. With a
completely distributed binary code, 2 d concepts are possible. The measurements can have completely
local support because the classifier will take into account information from all the measurements.
The problem with this type of representation is that it can make learning a discriminant difficult
because the classifier must always consider all of the measurements. Some measurements may be
correlated while others may not be discriminating. The crosstalk between patterns is larger since
all measurements are likely to be active.
2
1n fact, it is maximally general.
Each pixel dimension can be adjusted independently to yield all possible images.
4
This is the proverbial "grandmother cell" which fires only when you see your grandmother [22].
3
37
3.1.3
Sparse Representations
One solution is to develop a set of measurements that is neither too local nor too global in terms
of image support and neither too constrained nor too general in terms of representation. Each
measurement should measure some part of the visual content in an image. Concepts should be
based on some of these measurements but not all of them. These are the characteristics of a sparse
representation where an image is coded as the activity of a set of measurements. In a sparse code,
an image is represented with only a few measurements. For the code to be useful in distinguishing
between images, it must also be distributed. That is for one image measurement A might return
large values, but for another image A could be small. The average value of each measurement over
all images is then approximately equal. This type of intermediate representation is motivated by
the sparse causal structure of images. Note that of all the possible visual items that may occur in
an image, for any particular image, only a few will be present. For example, although an image
could be of a car, a boat, an airplane, etc., a particular image will probably contain only one of
those objects. If we regard the items as causes, we should design our measurements to respect
the statistics of these causes. This is based on the assumption that the similarity between images
corresponds to the similarity between the causes. A sparse representation closely approximates a
parts-based model that can account for many objects but not all possible things since we want to
restrict ourselves to real images. As an analogy, consider a text document. Individual characters are
very local measurements, while the entire string of characters is a completely global measurement. It
turns out that an intermediate level representation such as words is more useful for retrieval. Many
possible words (tens of thousands) can occur in a document, but a particular document will only
contain a few (hundred) unique words. Our goal is to capture the visual events (words) in an image
(document).
In image retrieval, the classifier must be trained at query time because the target class remains
unknown until a query is defined. Since human users will only tolerate a certain amount of response
delay, training must be fast. Thus we can choose which measurements to use not only by their
discriminative power but also by how well they facilitate fast learning. We can also try to use
classifiers which will learn quickly for many types of measurements. In other words, we will use
domain knowledge to guide both the measurements we use and our choice of classifier. A sparse
representation trades off the advantages and disadvantages of strictly local-specific and global-general
representations. The image retrieval problem may not permit a small set of dense measurements
to represent the many possible image concepts. A high dimensional representation will increase the
capacity but might make learning too difficult because only a few example images are available. Our
strategy will be to start off will a large number of possible measurements and then quickly pick a
few which are discriminating for the desired concept.
3.2
Design
The goal of selective measurements is to signify interesting visual content in an image. Clearly a
single pixel does not convey much interesting information about the image (i.e., many images will
contain a green pixel). We can consider image patches of pixels (e.g., 9x9 neighborhood) and look
at their principal components [22]. These are the linear projection axes that maximize the variance
over all image patches and for a gaussian distribution also preserve the most information about
the patches [28]. These measurements are local (with respect to the entire image) and linear, and
so do not explicitly capture the global structure of the image. We can capture more global structure by computing the principal components of the rectified responses to the small scale principal
components. To capture more large scale structure, the responses can be subsampled before being
processed. This operation can be repeated until we have captured the global structure of the image.
So instead of just finding horizontal and vertical edges, we measure particular arrangements of these
edges so that we have good detectors for gratings and other more complex and interesting visual
content. We have observed that the principal components of the first level responses are qualitatively
similar to the first level principal components (i.e., horizontal arrangements of horizontal edges are
common since these signify horizontal lines in an image). This was done by obtaining 1000 random
38
Figure 3-2: On the left are the principal components of image patches. On the right are the principal
components of the responses to the fourth principal component on the left (a bar-like filter).
9x9 patches from a set of natural images. The patches were packed into 81 dimensional vectors. The
principal components were found by taking the eigenvectors corresponding to the largest eigenvalues
of the covariance matrix of the vectors. For a vectors x 1,..., xN, the sample covariance matrix is:
1N
E= N
(x
)(x - y)T
(3.1)
n=1
where Y is the sample mean. The eigenvectors u are the vectors that satisfy:
Eu = Au
(3.2)
where A is the corresponding eigenvalue. A set of response images was generated by filtering (convolving) the same set of images with the fourth principal component. The principal components of
these response images were then found as before. Figure 3-2 shows the similarities between the principal components and the first level response principal components. Similar results were obtained
with filtering with other principal components. Since the filters at different levels are similar, the
entire computation is simplified because the same filters and computations can be repeated.
3.2.1
Filters
The filters that we use are local linear averaging and difference filters as shown in Figure 3-3.
We chose the microstructure filters described in [36] which have been shown to be useful for texture
discrimination. They are generated by all possible combinations of three basis filters: [1,2,1]/4, [1,0,1]/2, and [1,-2,1]/4. These are qualitatively similar to the principal components we and [4] measured
in natural images. Intuitively this amounts to finding edges and bars in an image. The bases are
normalized so that the total energy passed remains unchanged. The coefficients of the edge and bar
bases sum to zero so that constant regions do not produce a response. These filters are separable
so that the computations can be done more efficiently. A 2D function x(i, j) is separable if it can
to separable rowwise
be written as x(i, j) = y(i)z(j) [27]. Separable filters reduce a full convolution
2
2
operations are
arithmetic
)
M
1)
M
+
O((N
convolution,
full
a
For
and columnwise convolutions.
filters, only
separable
With
filter.
the
of
size
the
is
M
and
image
the
of
needed where N is the size
computations
of
number
the
reduces
This
needed.
are
operations
0(2 * M(N + M - 1)2) arithmetic
by about O(M/2). To eliminate spurious responses at image boundaries, only the valid part of the
convolution was kept (i.e., the response image size is N - M + 1).
39
MC
Figure 3-3: The 9 primitive filters used in computing the measurement maps. In practice, these can
be efficiently computed with separable horizontal and vertical convolutions.
3.2.2
Filtering Tree
An image is processed by a filtering tree which consists of convolving the image with each of M
filters. The M output images x 1, ... , X are then rectified by taking the absolute value. Then it is
subsampled by a factor of two since we want to capture larger scale structure on the next round of
filtering. The same filtering process is then applied to each response image. For the resolution of
our images, four levels of filtering were possible. Using 9 local filters over the red, green, and blue
channels of an image gave a total of 3*94 = 19, 683 response images. A sum was then taken over the
pixel values of the final response images for a total of 19,683 measurements. The final summation
is a measure of the total response energy of the measurement. Usually, energy is defined as the
sum of the squares of the measurements, but since our measurements are strictly positive, we have
implicitly used a sum of the absolute values of the measurements to define energy. The steps of the
algorithm in summary are listed below:
1. convolve
2. rectify
3. subsample by 2
4. summation (after four levels of steps 1-3)
Formally, the filter tree computes each measurement as:
(3.3)
Ti,j,k,l,
gi,j,k,l(x) =
pixels
where
(If
i,j,k
)
(3.4)
Xi,j,k,l
=
12
Xi,j,k
=
xi,j
=
12 (k * Xi,j)
12 (If* xij)
(3.5)
(3.6)
12 (Ifi * xl);
(3.7)
i=
*
x is the image; i, j, k, 1 are indices over the repeated linear filters; I is the absolute value function;
12 is subsampling by a factor of two; and xi, xij, Xi,j,k, Xi,j,k,l are the response images. Figure 3-4
is a schematic of the filtering tree.
40
rectify
subsample
. .
filter.
rectify
subsample
filter
rsos
m
e
e
ma
m
filteri
n
Figure 3-4: A schematic of the filtering tree where a set of filters is repeatedly applied to an image
to capture more global structure.
For many of the features we found that the response histogram within an image was surprisingly
kurtotic with kurtoses much larger than 3 (gaussian). The sample kurtosis is defined as:
(xn-
x
,
(3.8)
n= 1
where N is the total number of samples, I is the sample mean, and a- is the sample standard
deviation [37]. A gaussian distribution is often used as a baseline measure of kurtosis. Its kurtosis
5
can be
(x= 3) is designated as 'low' since the tails fall off rapidly . The kurtosis of a distribution
measures
kurtosis
Intuitively,
deviation.
compared to the one for a gaussian with the same standard
the 'peakedness' of a distribution. Thus a measurement with high kurtosis is sparse within the
image. This is a different type of sparsity than the sparse representation mentioned earlier because
it is within an image instead of between images. We believe it is also important because it signals
an interesting visual event in the image. A measurement of a green pixel in an image would not
as the
be sparse since many pixels might be green. This motivates taking the sum of the pixels
Summing
respones.
final measurement of the image because most of the pixels have negligible
also provides desired translation invariance since we assume the location of the response to be
unimportant. Figure 3-5 shows the sequence of response images generated by a particular feature
on the image of a tiger. The filtering sequence has responded strongly to the the body of the tiger
where the stripes are prominent.
3.2.3
Color Space
The red, green, and blue channels of RGB color space are non-negligibly correlated [27]. Overall
intensity variations will affect all of the channels. To reduce this correlation we decompose color
is the
images into the opponent color axes: (R+G±B)/3, R-G, 2*B-R-G. The first component
less
are
they
channels,
RGB
between
differences
are
components
intensity. Since the other two
intensity.
in
changes
sensitive to overall
5
The tails of a gaussian fall off ase
Ts2
41
Figure 3-5: Response of an image of a tiger to a particular filtering sequence. Note that this feature
has detected a strong peak corresponding to the arrangement of the stripes on the body of the tiger.
3.2.4
Normalization
As further preprocessing, we normalize our measurements by dividing each measurement for an
image by the sum of all the measurements for the image. This normalizes the total measurement
energy in an image to unity. In practice we have found that this type of normalization helps account
for large variations in the overall contrast in images.
3.3
3.3.1
Measurement Selection
Selectivity
We now possess a method for extracting selective measurements from an image which signify interesting visual events in the image such as horizontal arrangements of vertical lines as in the tiger
image. Our primary goal is to find measurements which will be useful in discriminating different
image concepts. If we have large sets of labeled data, then we could use many conventional machine learning feature selection techniques to find discriminating projections. Since we do not have
labels before a query is made online, we need a new criterion for choosing which measurements to
keep. The task is to eliminate measurements which are not discriminating. To this end, we choose
measurements which have sparse distributions over all images. Intuitively, although a measurement
might be interesting in an image, it might occur in many images, thus the response histogram over
all images will not be sparse. So we can once again use kurtosis as a guide in choosing which
measurements to keep. Measurements for which most images have no response, but a few images
have large responses will have high kurtosis. These measurements will allow a learning algorithm
to separate most images from the few images that respond strongly. A highly kurtotic distribution
also has low entropy since most of the probability is tightly concentrated around the peak. One can
show that for a fixed variance, the gaussian distribution has the highest entropy [14]. Thus we want
measurements with non-gaussian distributions over all images. Low entropy codes and non-gaussian
distributions have been proposed as interesting projections and useful in factorial coding [5, 17].
Thus we will choose measurements with the largest kurotoses.
Another measure of sparseness is the selectivity defined in [7]. Selectivity is defined as:
selectivity = 1 -
max(x)
.
(3.9)
Selectivity ranges from 0 to 1. It is high when the measurement does not respond appreciably to
most inputs (resulting in a small average), but responds very strongly to a few particular inputs
(increasing max(x)). Empirically we have found that choosing measurements with high selectivity
results in similar performance as using highly kurtotic measurements. Figure 3-6 shows the response
42
200
25
1
180SO
20
120
15
1000
10--
so-
40
0
5
10
15
20
25
30
35
40
45
0
50
5
10
15
20
25
30
35
40
45
50
Figure 3-6: Histograms for a highly selective (left) and unselective measurement (right). The highly
selective distribution had a selectivity of 0.9521 and a kurtosis of 132.0166. The unselective measurement had a selectivity of 0.4504 and a kurtosis of 2.3703.
histograms for a highly selective and an unselective measurement. Note that most of the values for
the highly selective measurement are close to zero with a few large ones at the tail. The unselective measurement has large values for most of images. Emprically, using the 5000 most selective
measurements gave good results.
43
Chapter 4
Learning Queries Online
The work by [38] shows how learning a query can lead to much better results than using a static
similarity function. The disadvantage of supervised learning is that it can be slow and require many
training examples. This is especially true when the measurements have little discriminatory power.
On the other hand, using highly selective measurements can make it possible to learn with only a
few training examples. It can also simplify and speed up training time.
4.1
Image Retrieval as a Classification Task
We can think of image retrieval in terms of classification by defining a positive (relevant) class C1
and a (negative) irrelevant class Co. The two classes are related so that Co = ,C1. For example, if
C1 is "images of waterfalls", then Co is "images without waterfalls". However, a priori, C1 and Co
are undefined-the concept implied in these classes is unknown until some examples are chosen. So
although C1 might be "images of the waterfalls" for one query, it might be "images of mountains"
for another. Thus classification into positive and negative classes is possibly only after a query has
been specified.
All of the images x' in the positive class C1 are assumed to have common visual properties
such as an image of a waterfall. However the set of images in the negative class CO is usually
heterogeneous. That is Co may contain images of mountains, fields, sunsets, etc.-as long as there
are no waterfalls. So although all items in Co must be assigned to the same class, they will not
necessarily have many features in common 1 . This often makes the negative class more difficult to
model directly. Figure 4-1 shows example images from a "waterfall" concept.
1
The positive class can be regarded as a simple hypothesis while the negative class is a composite hypothesis.
Figure 4-1: Some typical images exhibiting the "waterfalls" concept. Note the diversity in the images
as some contain sky, some contain flora while others are mostly rock, etc.
44
Recently, there has been success using simple classifiers such as hyperplanes in a high dimensional
measurement space [49]. These measurements are a more complex representation of the input.
Simple classifiers such as nearest neighbor [18] can delineate arbitrary decision boundaries even with
naive measurements. However both of these techniques often require a large number of labeled
training data which is unavailable in image retrieval. Armed with selective measurements, we will
use a classifier and learning method that works well with a small number of training examples. We
also want training to be fast.
4.2
Aggregating Weak Learners
Recall that in machine learning, we are given a set of classified training examples and would like
to induce a rule that is useful in classifying unseen test examples. Often the classifier is trained
by adjusting some internal parameters based on the training data. For example, let the data be
{xIt' = 0} and
a vector x" and its corresponding classification be t" C {0, 1}. Define Co
C, = {xnIt" = 1}. Let the internal parameters be vectors Ao and pia with the same dimensionality
as x. The classifier is trained by setting
x"
7iC=ii.
for i = 0, 1
(4.1)
In other words, parameters yo and p, are set to the averages of the training examples in Co and C1
respectively. A new example is classified as:
XE Co
C1
if I - Pol < IX -pti I
otherwise
so that it is assigned to the class with the average that is closest (most "similar").
Although the above classifier is both simple to train and simple to use, it may not classify unseen
test examples or even the training examples well. This might be because there are very few training
examples (such as in image retrieval) so that the average is biased. It could also be that the classifier
model as determined by the classification rule and parameters do not fit the data. Often though,
it is relatively easy to build classifiers which perform better than chance. These are called weak
learners.
A natural extension is to train a set of classifiers and aggregate their outputs. The questions are
how to build a set of classifiers and how should the classifications be combined. Two deterministic
classifiers will learn the same rule if they are trained on the same data. One method that has
been empirically shown to improve classifier error rate is Bagging (Bootstrap Aggregation) [9]. The
training data is randomly sampled with replacement to create different bootstrap training examples.
This enables us to generate a set of different classifiers. The final classification is taken as a majority
vote of all the classifiers.
4.3
Boosting Weak Learners
Another approach is to train a set of classifiers iteratively. On each round of constructing a new
classifier, every training example is reweighted to effectively change the distribution of the training
data. The final classifier is a linear combination of the component classifiers. This is the method of
Boosting [20].
Consider a set of weak hypotheses (classifiers) hi(-), . .. , hT(-) (i.e., each has error ei <_ 1/2) with
outputs y. = hi(x) where hi(-) E {0, 1}. If none of the hypotheses alone achieve acceptable error
rates, we can use a linear combination of all of them:
T
h(x)
=
5caihi(x)
i= 1
45
(4.2)
where h(x) is the boosted hypothesis. If all the a are equal then this is a majority vote. The key with
boosting is that the training distribution is reweighted before the next hypothesis is constructed.
Incorrectly classified examples are reweighted more heavily so that the next hypothesis chosen will
tend to classify these examples correctly. The ac are set inversely proportional to the Ei so that
the votes of more accurate hypotheses are more heavily weighted. The efficacy of boosting critically
depends on the ability of the hi(.) to overfit the training data. By changing the training distribution
via reweighting we can get a series of different classifiers so that each successive one is driven to
correctly classify the examples the preceding one classified incorrectly.
Theoretically, [20] have shown that boosting will eventually drive the training error to zero,
exponentially fast in the number of boosting iterations. In practice, we have found this to be true
after approximately 20 iterations. Although it might seem that this would drastically overfit the
training data, empirically, the boosted classifier seems to perform well on the unseen test data. It
appears that boosting may be increasing the margin or separation between the positive and negative
examples. As shown by [49], this generally improves performance on test data.
4.4
Learning a Query
Recall that we chose our measurements in an unsupervised fashion using a high selectivity criterion.
At query time, we have a set of training images so we can find measurements which are relevant for
the particular query. The idea is to build a series of hypotheses, each one using a single measurement.
Many image retrieval systems use only a single example image to define a query and never consider
negative examples. This is an artificially imposed constraint since it is often the case that the user
can pick a few positive and negative examples. To enlarge the set of training examples, we can
also augment the examples designated by the user by an initial set of presumed negative, random
examples. Although it is possible that we may randomly chose an image that should really be
positive, for large databases, the number of images in any particular image concept is small. This
gives us a set of labeled training images from which to build our classifier. We initially weight the
positive images (which are chosen by the user) higher so that the sum of the weights of the positive
examples equals the sum of the weights of the random negative examples. This primes the system
for correctly classifying the positive examples.
We chose a simple minimum distance classifier in which an image is assigned to the closest concept
in the single measurement dimension. Thus x is in class 1 if Ix - p, < IX - P2 where /1 k is the
sample mean of images in class k. To construct the first hypothesis (choose the first measurement to
keep), we build hypotheses for each measurement and choose the one with the lowest training error.
After the first hypothesis is found, the training examples are reweighted in order to emphasize the
incorrectly labeled examples. In particular we use the AdaBoost algorithm [20] shown in Figure 4.1.
The x, are images, t, is the corresponding class label for the image and wi, is the weighting for
image n on boosting round i. The process is repeated for the desired number of iterations T. In
our experiments, 10 to 50 rounds of boosting were sufficient for reasonable retrieval. Note that our
boosting measurement selection requires O(Td) time since each of d measurements must be checked
on each round of boosting (except for those that are already chosen). However after this initial
learning process, only T measurements are needed to query the database. For large databases this
uses much less computation than using all of the measurements.
It is possible that using k > 1 measurements for each classifier may lead to better performance.
However this creates n!/k!(n - k)! possible subsets of measurements to check. Choosing two measurements out of a thousand (n = 1000, k = 2) would require looking at 499,500 subsets. This is too
slow for a practical image retrieval system.
4.5
Observations
By looking at the measurements selected for various queries, we can see that different measurements
are indeed needed to handle different types of queries. Figure 4-2 shows how the measurements
chosen by boosting better separate images of mountains from other natural images than using
46
Boosting Algorithm
" Given examples images (xi),
.. .
, (xN, tN)
where tj = 0, 1 for negative and positive examples
respectively.
" Initialize weights w 1 ,,
1 -,1--fort, = 0,1 respectively, where L and M are the number of
negatives and positives respectively.
" For i = 1,... , T:
1. Train one hypothesis hj for each feature
2. Choose hi(-) such that Vj
#
i, E <
j
using
wi with error E6= Pr" (hj (x,) ,
tn).
Ej (i.e., the hypothesis with the lowest error).
3. Update
Wi+1,n =
W~
where en = 0, 1 for example Xn classified correctly or incorrectly respectively, and
so that wj+ 1 is a distribution.
'i . Normalize Wi+1,n
W"
i =
Output the final hypothesis:
T
T
h(x) = Zaihi(x) ;
i=i
where ai = log
EZai
i=i
A
Table 4.1: The boosting algorithm for learning a query online. T hypotheses are constructed each
using a single feature. The final hypothesis is a linear combination of the T hypotheses where the
weights are inversely proportional to the training errors.
47
rand, muntain
boost, mountain
30
120-
.
25
100-
20
80-
15
60
0
2000
4000
6000
5000
0
1200
10
-
30
10
40
50
60
70
00
0
Figure 4-2: The left plot shows how boosting finds measurements which better separate images of
mountains from other images better than choosing random measurements (right).
.%b008111m0un1i,
0
0
'0k0s
0
0.5
1
1.5
2
2.5
3
3.5
4
Figure 4-3: The boosting measurements which separate well images of mountains do not discriminant
images of lakes as well.
randomly selected measurements. Although these particular measurements may be useful in queries
for mountains, Figure 4-3 shows that the same measurements are not as discriminating for images
of
lakes. Thus different measurements are needed to handle different queries, and boosting selects
those which are useful for a particular query.
4.6
Relevance Feedback
Since the image retrieval system is used interactively, there is an opportunity for the user to improve
the query by iteratively adding positive and negative examples. In information retrieval this is called
relevance feedback and is a natural way for the user to obtain further information from the user.
Research in this area of image retrieval is becoming very active [31, 15]. We view this situation as an
opportunity for the image retrieval system to do some active machine learning [11]. That is, instead
of the system waiting for the user to hand it some training examples, the system can suggest some
examples for the user to label. It is often useful for the system to query those examples it is most
unsure of. One simple example of useful feedback in image retrieval is that the initial query results
often contain many highly ranked false positives. Since these are readily seen by the user, the user
can add the most egregious (highly ranked) of these as negative examples. Figure 4-4 shows how
48
laws4, nr
um,
hiselect
laws4, normum, hiselect, 20feats, egregius
20feats
1
0.9
0.9
08
0.8
0.7
0.7
0.6
0.5
0.5
-
--------0,2
0.2
0.1
0.1
0
0.1
0.2
0.3
0.4
r0.5
0.6
0.7
0.8
0.9
01
1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
09
1
Figure 4-4: At left is the initial performance. On the right is the improved performance after the
most four most egregious false positives were added as negative examples. This example is on a data
set containing five classes with 100 images in each class (see Chapter 5).
performance improves with feedback from the user.
4.6.1
Eliminating Initial False Negatives
Sometimes a few of the randomly chosen initial negative examples turn out to be true positives.
To eliminate these we return the negatives in the training set which are closest to the threshold for
being classified as positive. This allows the user to eliminate those images from the randomly chosen
negative example training set.
4.6.2
Margin Boosting by Querying the User
Another source of feedback is to present users with images in the test set which are near the threshold.
The user can then add any true positives which are below threshold in order to provide a wider range
of examples for the learning algorithm. This is useful because any positive examples added which
are already well above threshold will not change the input distribution as much as one which is still
classified as negative.
49
Chapter 5
Experimental Results
5.1
Performance Measures
The standard measures of retrieval performance are precision and recall. Let N be the number of
images returned for a query, N, of which are relevant. Let Nt be the total number of relevant images
in the database for this query. Recall is defined as:
_Nr
re =
Ni
.
(5.1)
Precision is defined as:
pr
= N.
(5.2)
So recall is the proportion of relevant images returned, while precision is the proportion of images
returned which are relevant. For image retrieval, precision may be more important than recall since
users will often only want to see the top 10-20 ranked images. To maximize the number of relevant
images returned, we want to maximize the precision within the top ranked images. It is useful to
combine the two performance measures into a single precision-recall graph which shows the precision
achieved as a function of recall. In an ideal system, precision would remain maximal at 1.0 from 0
to 1.0 recall. Random performance would stay at steady low precision across recalls. In practice,
retrieval typically starts at high precision with low recall. As recall increases, precision usually
drops since the threshold for returning images as relevant is successively lowered until all images are
returned.
5.2
Natural Scene Classification
To test the classification performance of the system we constructed five classes of natural images
(sunsets, mountains, lakes, waterfalls, and fields) using the Corel Stock Photo1 image sets 1, 26,
27, 28, and 114 respectively [13, 38]. The images had dimensions of 72x108 or 108x72. Each class
contains 100 images. Each class was split into 10 subsets of 10 training examples. In addition, one
example from each negative class was added for a total for 4 negative examples. The results shown
are the average of the 10 trials. The training examples were not considered during testing. Figure
5-1 shows a representative image from each class.
We tested to see if more selective measurements improved performance. Measurements were
computed using one and four layers of filtering. Each experimental trial used the same number of
features. Figure 5-2 shows how using more layers of filtering improved performance. For natural
'This publication includes images from the Corel Stock Photo images which are protected by the copyright laws
of the U.S., Canada and elsewhere. Used under license.
50
Figure 5-1: An example image from each class of sunsets, mountains, lakes, waterfalls, and fields.
lawsi1,
laws4,
20feats
0.7
0.6
0.6
a
2Wleals
0.7
1
-------------
hlIelect
0.8 -
d
0.8 -
normm.
0.5 -0.5
0.4
-t..
0.4
.
0.3-
... --.
0.3-
..
0.2-
.
0.1
0.1
0
01
02
03
04
05
06
07
0.8
0
0.9
01
0.2
03
04
05
06
07
0.8
09
1
Figure 5-2: Measurements with four layers of filtering (right) performed better than using only one
layer of filtering (left).
scenes we have used the following abbreviations in the figures: ss(sunsets), mt(mountains), lk(lakes),
wf(waterfalls), and fd(fields). The selectivities for four layer measurements ranged from a minimum
of 0.5086 to a maximum of 0.8623. The selectivities for one layer measurements ranged from a
minimum of 0.4242 to a maximum of 0.8366.
Figure 5-3 compares the performance of image retrieval system proposed in this thesis compared
to the global color histogram using a chi-square measure of similarity. The chi-square measures
the difference between two distributions and is based on the distribution of the sum of squares of
gaussian random variables [37]. The RGB color space was discretized into 64 bins for the histogram.
For this limited test set of five classes of natural images, color turns out to be fairly discriminative.
This is obvious for the sunsets class which contains mostly orange color. Also, the fields class contains
mostly green, while the mountains class contains mostly blue. Interestingly, we can do just as well
using color histograms and boosting as shown in 5-4. The advantage here is that instead of the full
64 bins, boosting gives comparable performance using only 32 bins. This enables the system to rank
each image with half as many arithmetic operations. On a very large database, reducing the number
of computations by two is a substantial savings.
For selective measurements, best performance is achieved for the sunsets class. The images in
this class are very homogeneous, typified by a global orange color. Besides an image of a sunset and
perhaps the horizon, very little other significant visual content is found in these images. The waterfall
images are moderately homogeneous since the an image of a waterfall is fairly salient. Similarly the
mountains class contains mainly snow-capped mountains. Lower performance is achieved for the
fields and lakes classes because the images in those classes are very heterogeneous. Visually, a field
may contain flowers, bushes, grass, and hills. Many images of lakes had what most people would
consider mountains and fields in the background.
51
C
0or,
64, chisq
laws4, norm
-
-
-4
0.9-n
0.9--
0.7
0.2
01
,
hi80lect, 64feats
------
-
0.1
0
01
02
03
04
06
07
08
09
0
01
02
03
04
0
06
07
08
09
1
Figure 5-3: A comparison between using color histograms with the chi-square similarity function
(left) and using selective measurements and boosting (right).
Although color is fairly discriminative for these natural images, it is not adequate for other classes
such as sports cars. In fact, most man-made objects may be from the same class but be colored
very differently. Also, with a larger database, more images will also have orange and blue colors,
thus making color undiscriminative. Figures 5-5 and 5-6 show poor results using color histograms
and the chi-square function on a 3000 image database. Figures 5-7 and 5-8 show poor results even
when color histograms are used with boosting on the same queries.
5.3
Principal Components Analysis
Principal components analysis (PCA) is a technique rotates data onto the axes of highest variance
[8]. We saw in Chapter 3 how it was used to guide the design of the selective measurements. PCA
removes second-order correlations and is often used to find a lower-dimensional representation of
data (intrinsic dimensionality). The correlation matrix is defined as:
C
=
0
-
,
(5.3)
where Ei is the (i, j)'th element of the covariance matrix. We performed PCA on our selective
measurements and tested performance with the reduced dimension projections. Figure 5-9 shows
the reduction in correlation between the measurements after PCA. This reduction in correlation and
dimensionality however also results in a reduction in performance as shown in Figure 5-10.
5.4
Retrieval
To test the performance of our system on a larger database, we selected Corel image sets 1-30 for
a 3000 image database. The images had dimensions of 72x108 or 108x72. Figure 5-11-5-14 show
the results from some example queries. The typical number of images relevant to a particular query
is about 100 (3% of the images in the database). Typically 40 rounds of boosting were used to
select 40 measurements for querying. An initial set of 100 randomly selected negative examples were
used. The example queries shown typically used 1-5 extra example images selected from the most
egregious (highly ranked) false positives from an initial query using just the positive images.
52
color, boost, 32teats
t
0.9
0.8
0.70.6-.
0.5 --
0.2
0.1
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0.9
1
Figure 5-4: Here we show that boosting can be used with color histograms to give comparable
performance. Boosting achieves this using only half of the histogram measurements. This results in
a substantial computational savings on a large database.
5.5
Face Detection
To test if selective measurements and boosting are useful for detection, we tested the system on a
database of 923 images of faces and 923 images of non-faces. Initially a large number of images were
collected from the World Wide Web. Each frontal face image was obtained by hand segmenting the
original image. A non-face image was generated by selecting a random patch of the same size as the
face segment from the the same image. This ensures that the set of non-face images comes from a
similar distribution of images as that of the face images. All face and non-face images were then
resized to 64x64. We split the data set into 92 sets of 10 face training images and 10 non-face training
images. Once again, the 10 most egregious false positives were added as additional negative examples
after an initial round of querying. The average results for the 92 trials are shown in Figure 5-15.
Performance was quite satisfying with only 10% false alarms for over 95% detection. In addition,
only 20 measurements were used. Figure 5-16 shows the results of a typical query using only three
example face images and an initial set of 100 random non-face images. It would be interesting future
research to test the ability of selective measurements and boosting to detect other types of objects
across different scales and poses. For a more generic detection system, only the key measurements
selected by boosting need to be computed in the first place.
5.6
Digit Classification
We tested performance of our image retrieval method on a database of ten digits with 100 examples
per class. One example image from each class is shown in Figure 5-17. These images are part of a
set of training examples of the National Institute of Standards and Technology (NIST) handwriting
database. Precision on this data set degrades very quickly as recall increases. It seems that our
measurements are unable to capture the nature of the digit classes and is not invariant to the various
transformations from one image to another.
53
I
Figure 5-5: A query for waterfalls using color histograms and the chi-square function. Note global
color histograms cannot capture the flowing vertical shape of waterfalls.
54
Figure 5-6: A query for sports cars using color histograms and the chi-square function. Note
unsurprisingly that the few cars found are all red-cars of other colors in the database are not
ranked highly.
55
Figure 5-7: A query for waterfalls using color histograms and boosting. Note that boosting finds
images with similar color distributions, but most of those images are not of waterfalls.
56
I
I
Figure 5-8: A query for sports cars using color histograms and boosting. Note that boosting finds
images with similar color distributions, but most of those images are not of cars.
200
400
Goo
800
1DOD
1200
1400
1600
1800
Figure 5-9: The principal components correlations (left) are much smaller than the original measurement correlations (right).
57
lw3,pc,
nmum,
Mfeats
0.9
0.8
0.7
0.3
0.2
0.1
0
0.1
0.2
0.3
0.4
05
0.6
0.7
0.5
0.9
1
Figure 5-10: Performance is reduce using the principal components of the measurements.
Figure 5-11: A query for sunsets. The positive examples are shown on the left.
58
I
Figure 5-12: A query for waterfalls. The positive examples are shown on the left.
59
- I -
- .
--- -- -
- -- -- -- -;
-
zz
. - --
U
U
...
-..
Figure 5-13: A query for sports cars. The positive examples are shown on the left.
60
I - I -- ME25
=±i!d"
-
~;
-
--
--.-
ip
.
Figure 5-14: A query for cacti. The positive examples are shown on the left.
lawa3, toaces, 20teats, egreglious
laws3, faces, 20feat, egregious
0.9
0.0
0.8
0.8
0.7
0.7
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
'
0.1
0.2
0.3
0.4
0.5
Pr(false
atarm)
046
0.7
0.8
0.9
0.1
1
0.2
0.3
0.4
0.5
0.6
07
0.8
0.9
recal
Figure 5-15: The average receiver operating curve (left) and precision-recall curve (right) for detecting faces.
61
, j
t49
"tit
Figure 5-16: A query for faces. The positive examples are shown on the left.
Figure 5-17: Results for images of the ten digits.
62
nist, laws4, 40feats, egregious
0
-
0.9
-
S--
0.8_ _
0.7
9
1
2
3
4
7
0.6 -
8
-
0.5 0.4 0.3
--
0.2
--
-
.
0.1 -
0
0.1
0.2
0.3
0.4
0.5
recall
0.6
0.7
0.8
0,9
1
Figure 5-18: Results for images of the ten digits.
63
Chapter 6
Discussion
6.1
Summary
We have described a system for image retrieval which uses highly selective measurements and boosting to learn queries. The measurements are computed by a tree of repeated filtering with simple
local difference filters. This leads to a large number of potentially useful measurements for various
types of queries. Boosting learns which measurements are discriminative for a particular class of
images and selects only those for querying the database. By using many measurements the system
is flexible enough to handle many different types of queries. And since boosting finds a small set of
discriminative features, searching the database of images can be made more efficient. Although it
is difficult to evaluate the performance of any retrieval system, we have made an attempt here by
testing the system against a variety of queries.
Our main contribution lies in outlining a theory for image retrieval. This theory suggests that
many measurements may be needed to explain different images. However, for any particular class
of images, only a few measurements may be important. This idea mirrors the sparse nature of real
images. We proposed to find these discriminative measurements with a two step process: (1) selecting
measurements with sparse distributions, (2) choosing the few measurements relevant to a particular
query with boosting. In summary, we have tried to finesse the image retrieval problem by trying to
solve it directly instead of solving object recognition or resorting to text-based representations.
6.2
Applications
There are many applications which can arise from an image retrieval system. Some possible ones
are listed below:
" art galleries and museums
" architecture and manufacturing design
" remote sensing and resource management
" geographic information systems
" scientific database management
* retailing
" fabric and fashion design
" trademark and copyright database management
" law enforcement
" library archiving.
64
6.3
Research Areas
Image retrieval borrows techniques and ideas from knowledge based systems, cognitive science, user
modeling, computer graphics, image processing, pattern recognition, database management systems,
and information retrieval. For a commercial system, many practical issues such as user interface
and database organization need to be address. Some of these research areas are listed below:
* query specification
* file structures for storage and retrieval
" relevance feedback
" distributed databases
" performance measures
" level of abstraction
* domain independence.
6.4
Connections to Biology
Qualitatively our approach is in rough accord with the biology of the human visual system up to the
striate cortex. Light enters the eye and impinges upon the photoreceptors which in turn activate
retinal ganglion cells. There are about 106 ganglion cells. The nerve impulses travel to the roughly
1.5x10 6 cells in the lateral geniculate nucleus (LGN). These signals then move upstream to the
striate cortex which contains about 2x10 8 neurons. If we think of each cell as a dimension, then
the image is being represented with a higher dimensional representation each time the signal moves
upstream. There is also evidence that cells at higher levels are more selective than those at lower
levels. For example, simple cells in striate cortex only respond to bars, while retinal ganglion cells
respond to spots or parts of bars. In addition to increase selectivity, higher-level neurons also have
larger receptive fields. The subsampling operations in our filtering tree mirrors this structure. The
response histograms also show selectivity increasing as measurements become more complicated. We
have used the opponent color axes mirroring those in the visual system [32].
Although the visual system as a whole is non-linear, many simple cell responses can be modeled
as roughly linear. Thus the succession of linear filters we use are in accord with this observation. It
is also known that orientation selective cells are arranged in hypercolumns so that the same visual
area projects to all the cells in a hypercolumn. Thus we also use filters at different orientations in
each image area. Of course an image retrieval system not need mirror the architecture used by the
brain. However, it would be foolish to ignore any advances made in neuroscience since we know that
the brain can perform image retrieval very successfully.
Of course the visual cortex contains more sophisticated machinery than just simple cells. For example, neurons have been found in the inferotemporal cortex which appear to respond preferentially
to shapes of hands [21]. Regardless of how objects may be coded in the brain, it is clear that some
aspects of object recognition are useful in determining whether two images are similar. Although
we are still far from being able to engineer a computer to recognize objects, there is experimental
evidence of a preattentive type of similarity [41]. Certain patterns appear similar even before we
recognize any objects. We have tried to start our research from this level of similarity. Future
reserach will entail moving to more advanced forms of similarity.
6.5
Future Research
One conclusion we can make is that image database retrieval research is still in its infancy. And
since image retrieval is intrinsically tied to so many other areas of research, there is much room for
future research. The primary goal of such work would be to formulate a quantitative theory for
65
Figure 6-1: An image with various possible classifications based on visual content, context, prior
knowledge, etc.
what type of measurements to compute and how to best use them. We have hinted at using highly
selective measurements and boosting. Below is a list of directions for future research:
" restructuring measurements for faster and more accurate retrieval
" new datasets and criteria for evaluating performance
" analyzing the relationship to psychophysics
* clustering and organizing the image database
" learning associations from the overall nature of database
Although it may take some time to get there, our goal is to be able to interpret Figure 6-1 as
an image of snowy mountains, pine trees, a ski lodge, daytime, cold weather, skiing, snow sports,
vacation, serenity, etc.
66
Appendix A
Implementation of the Image
Retrieval System
A.1
Architecture
Figure A-1 shows our implementation of an image retrieval system based on the ideas in this thesis.
The retrieval engine is a self-contained program which runs as a server. Multiple clients can connect
to the server and request similar images. Users interact exclusively with the client program to
designate queries and view retrieved results.
A.2
User Interface
Figure A-2 shows the user interface to our image retrieval system. Users can cycle through random
images until some relevant examples are found. These may be added as positive or negative examples.
The interface also allows the user to select how many measurements to use. The client program also
has the ability to connect to different servers. Retrieved results as well as other information from
the image retrieval server engine are display in various windows in the client.
A.3
Online Demo
An online demo of the system is available on the World Wide Wide at:
http://www.ai.mit.edu/projects/lv/projects/ImageDatabase.html.
client
serve
:.a
image display
stored representations
user interface
retrieval computations
client
Figure A-1: Image retrieval system architecture.
67
I
UtJo0
Figure A-2: Image retrieval interface.
68
Bibliography
[1] M. Abdel-Mottaleb, N. Dimitrova, R. Desai, and J. Martino. Conivas: Content-based image
and video access system. In A CM Multimedia Conference, pages 427-428, 1996.
[2] AltaVista. Altavista. Web: http://www.altavista.com.
[3] Y. A. Aslandogan, C. Thier, and C. Yu. A system for effective content based image retrieval.
In A CM Multimedia Conference, pages 429-430, 1996.
[4] R. J. Baddeley and P. J. B. Hancock. Principal components of natural images. Network: comp.
in neur. sys., 3:61-70, 1992.
[5] H. B. Barlow. Unsupervised learning. Neur. Comp., 1(1):295-311, 1989.
[6] S. Belongie, C. Carson, H. Greenspan, and J. Malik. Color-and texture-based image segmentation using em and its application to content-based image retrieval. In Int. Conf. Comp. Vis.,
1998.
[7] E. L. Bienenstock, L. N. Cooper, and P. W. Munro. Theory for the development of neuron
selectivity: orientation specificity and binocular interaction in visual cortex. J. Neuro., 42:3248, 1982.
[8] C. M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995.
[9] L. Breiman. Bagging predictors. Technical Report 421, UCB, Dept. Stat., 1994.
[10] M. La Cascia and E. Ardizzone. Jacob: Just a content-based query system fro video databases.
In Int. Conf. Acous., Speech, Sig. Proc., 1996.
[11] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Mach. Learn.,
15(2):201-221, 1994.
[12] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, 1990.
[13] Corel Corporation. Corel stock photo images. http://www.corel.com.
[14] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.
[15] I. Cox, M. Miller, T. Minka, and P. N. Yianilos. An optimized interaction strategy for bayesian
relevance feedback. In IEEE Comp. Vis. Patt. Recog. Conf., pages 553-558, 1998.
[16] J. S. DeBonet and P. Viola. Structure driven image database retrieval. In Adv. Neur. Info.
Proc. Sys., volume 10, 1998.
[17] P. Diaconis and D. Freedman. Asymptotics of graphical projection pursuit. Ann. Stat., 12:793815, 1984.
[18] R. 0. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons,
1973.
69
[19] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner,
D. Lee, D. Petkovic, D. Steele, and P. Yanker. Query by image and video content: The qbic
system. IEEE Computer, 28(9):23-32, 1995.
[20] Y. Freund and R. E. Schapire. A decision-theoretic generalization of online learning and an
application to boosting. J. Comp. & Sys. Sci., 55(1):119-139, 1997.
[21] C. G. Gross, C. E. Rocha-Miranda, and D. B. Bender. Visual properties of neurons in inferotemporal cortex of the macaque. J. Neurophys., 35:96-111, 1972.
[22] J. Hertz, A. Krogh, and R.G. Palmer.
Addison-Wesley, 1991.
Introduction to the Theory of Neural Computation.
[23] J. Huang, S.R. Kumar, M. Mitra, W. Zhu, and R. Zabih. Image indexing using color correlograms. In IEEE Comp. Vis. Patt. Recog. Conf., 1997.
[24] C. E. Jacobs, A. Finkelstein, and D. H. Salesin. Fast multi-resolution image querying. In
SIGGRAPH, 1995.
[25] M. Kelly, T. M. Cannon, and D. R. Hush. Query by image example: the candid approach. In
SPIE Stor. and Ret. Image & Video Databases III, volume 2420, pages 238-248, 1995.
[26] R. R. Korfhage. Information Storage and Retrieval. John Wiley & Sons, 1997.
[27] J. S. Lim. Two-Dimesional Signal and Image Processing. Prentice-Hall, 1990.
[28] R. Linsker. Self-organization in a perceptual network. Computer, pages 105-117, March 1988.
[29] P. Lipson, E. Grimson, and P. Sinha. Context and configuration-based scene classification. In
IEEE Comp. Vis. Patt. Recog. Conf., 1997.
[30] V. S. Nalwa. A Guided Tour of Computer Vision. Addison-Wesley, 1993.
[31] C. Nastar, M. Mitschke, and C. Meilhac. Efficient query refinement for image retrieval. In
IEEE Comp. Vis. Patt. Recog. Conf., pages 547-552, 1998.
[32] J. G. Nicholls, A. R. Martin, and B. G. Wallace. From Neuron To Brain. Sinauer Associates,
Inc., 3 edition, 1992.
[33] V. Ogle and M. Stonebraker. Chabot: Retrieval from a relational database of images. IEEE
Computer, 28(9):40-48, 1995.
[34] A. Pentland, R. W. Picard, and S. Sclaroff. Photobook: content-based manipulation of image
databases. Int. J. Comp. Vis., 18(3):233-254, 1996.
[35] T. Poggio and S. Edelman. A neural network that learns to recognize three-dimensional objects.
Nature, 343:263-266, 1990.
[36] W. Pratt. Digital Image Processing. John Wiley & Sons, 1991.
[37] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical Recipes in C.
Cambridge Univ. Press, 2 edition, 1992.
[38] A. L. Ratan and 0. Maron. Multiple instance learning for natural scene classification. In Int.
Conf. Mach. Learn., pages 341-349, 1998.
[39] E. Rosch. Principles of categorization. In Cognition and Categorization. Lawrence Erlbaum
Assoc., Inc., 1978.
[40] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Patt.
Anal. Mach. Intell., 20(1):23-28, 1998.
70
[41] S. Santini and R. Jain. Gabor space and the development of pre-attentive similarity. In Int.
Conf. Patt. Recog., 1996.
[42] P. Sinha. Image invariants for object recognition. Invest. Opth. & Vis. Sci., 34(6), 1994.
[43] J. R. Smith and S.-F. Chang. Visualseek: a fully automated content-based image query system.
In A CM Multimedia Conference, pages 87-98, 1996.
[44] G. Strang and T. Nguyen. Wavelets and Filter Banks. Wellesley Cambridge Press, 1996.
[45] K. Sung and T. Poggio. Example-based learning for view-based human face detection. Technical
Report 1521, MIT AI Lab Memo, 1994.
[46] M. J. Swain and D. H. Ballard. Color indexing. Int. J. Comp. Vis., 7(1):11-32, 1991.
[47] M. J. Swain, C. Frankel, and V. Athitsos. Webseer: an image search engine for the world wide
web. Technical Report TR-96-14, Univ. of Chicago, 1996.
[48] M. Turk and A. Pentland. Eigenfaces for recognition. J. Cog. Neuro., 3(1):71-86, 1991.
[49] V. Vapnik. StatisticalLearning Theory. John Wiley & Sons, 1998.
71
Download