retrieval - Computer Sciences User Pages

advertisement
Indexing and Retrieval
James Hill, Ozcan Ilikhan, Mark Lenz
{jshill4, ilikhan, mlenz} @cs.wisc.edu
CS 766: Computer Vision
Computer Sciences Department, University of Wisconsin-Madison
1
Presentation Outline
1- Introduction
2- Common methods used in the papers
* SIFT descriptor
* k-means clustering
* TF-IDF weight
3- Video Google
4- Scalable Recognition with a Vocabulary Tree
5- City-Scale Location Recognition
2
Introduction
Find identical objects in multiple images
Difficulties with changes in
– Scale
– Orientation
– Viewpoint
– Lighting
Search time and storage space
3
Indexing and Retrieval
Common Solutions
Invariant features (e.g. SIFT)
kd-trees
Best Bin First
4
SIFT - Scale-Invariant Feature
Transform
Key Steps
1)Difference of Gaussians in scale space
2)Maxima and minima are feature points
3)Remove low-contrast and non-robust edge points
4)Assign each point an orientation
5)Create a descriptor from windowed region
5
SIFT - Scale-Invariant Feature
Transform
Key Benefits

Feature points invariant to scale and translation

Orientations provide invariance to rotation


Distinctive descriptors are partially invariant to changes
in illumination and viewpoint
Robust to background clutter and occlusion
6
k-means clustering
Motivation (what are we trying to do)
We want to develop a method for finding the
centers of different clusters in a set of data.
7
k-means clustering
8
k-means clustering
9
k-means clustering
10
k-means clustering
11
k-means clustering
How do we find these means?
We need to perform a minimization on:
k
 x
i 1 x j S i
j
 i
2
12
k-means clustering
How do we extend this?
With Hierarchical k-means Clustering!
13
k-means clustering
14
k-means clustering
15
k-means clustering
16
k-means clustering
Now that we can cluster our data, how can we
use this information to quickly find the closest
vector in our data given some test vector?
17
k-means clustering
We will build a vocabulary tree using this
clustering method.
Each vector in our data (including the means)
will be considered a “word” in our vocabulary.
We will build a tree using the means of our data.
18
k-means clustering
19
k-means clustering
20
k-means clustering
21
TF-IDF
Term frequency–inverse document
frequency (tf–idf): is a statistical measure
used to evaluate how important a word is to a
document in a collection or corpus.
A standard weight often used in information
retrieval and text mining.
22
TF-IDF
nid : the number of occurrences of word i in document d.
nd : the total number of words in document d.
Ni : the number of documents containing term i.
N : the total number of documents in the whole database.
23
TF-IDF
word frequencyX inverse document
frequency
Each document is represented by a vector
Then vectors are organized as an inverted file.
24
TF-IDF
Image credit:
http://www.lovdata.no/
litt/hand/hand-19912.html
25
Video Google
A Text Retrieval Approach to Object Matching
in Videos
Josef Sivic and Andrew Zisserman
Visual Geometry Group,
Department of Engineering Science
University of Oxford, United Kingdom
Proceedings of the International Conference on
Computer Vision (2003)
26
Video Google
Efficient Visual Search of Videos Cast as Text Retrieval
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume
31, Number 4, page 591--606, 2009
Fundamental idea of paper:
Retrieve key frames and shots of a video containing a
particular object with ease, speed, and accuracy with
which Google retrieves text documents (web pages)
containing particular words.
27
Video Google
Recall Text Retrieval (preprocessing)
1. Parse documents into words
2. Stemming: “walk" = { “walk,” “walking,” “walks”,…}
3. Stop list to reject very common words , such as “the” and “an”.
4. Each document is represented by a vector with components given by
the frequency of occurrence of the words the document contains
5. Store vector in an inverted file.
28
Video Google
Can we treat video the same way?
What and where are
the words of a
video?
29
Video Google
The Video Google algorithm:
a) Pre-processing (off-line):
1. Detect affine covariant regions in each key frame of video
2. Reject unstable regions.
3. Build visual vocabulary
4. Remove stop listed words
5. Compute weighted document frequency
6. Build the index (inverted file).
30
Video Google
Building a Visual Vocabulary
Step 1. Calculate viewpoint invariant regions:
Shape Adapted (SA) region: centered on corner-like features
Maximally Stable (MS) region: correspond to blobs of high
contrast with respect to their surroundings such as a dark window
on a gray wall.
720 x 576 pixel video frame ≈ 1200 regions
Each region is represented by a 128-dimentional vector using SIFT
descriptor
31
Video Google
32
Video Google
Step 2. Reject unstable regions:
Any region that does not survive for more than 3 frames is rejected.
This “stability check” significantly reduces the number of regions to
about 600 per frame.
33
Video Google
Step 3. Build Visual Vocabulary:
Use K-Means clustering to vector quantize descriptors into clusters
Mahalanobis distance:
34
Video Google
Step 4. Remove stop-listed visual words:
The most frequent visual words that occur in almost all images,
such as highlights which occur in many frames, are rejected.
35
Video Google
Step 5. Compute tf-idf weighted document frequency vector:
Variations of tf-idf may be used.
Step 6. Build inverted-file indexing structure:
36
Video Google
The Video Google algorithm:
b) Run-time (on-line):
1. Determine the set of visual words within the query region
2. Retrieve keyframes based on visual word frequencies
3. Re-rank the top keyframes using spatial consistency
37
Video Google
Spatial consistency:
Matched covariant regions in the retrieved frames should have a similar
spatial arrangement to those of the outlined region in the query image.
38
Video Google
How it works:
Query region and its close-up.
39
Video Google
How it works:
Original matches based on visual words
40
Video Google
How it works:
Original matches based on visual words
41
Video Google
How it works:
Matches after using the stop-list
42
Video Google
How it works:
Final set of matches after filtering on spatial consistency
43
Video Google
44
Video Google
45
Video Google
Real-time demo
46
Scalable Recognition
With a Vocabulary
Tree
James Hill, Ozcan Ilikhan, Mark Lenz
{jshill4, ilikhan, mlenz} @cs.wisc.edu
CS 766: Computer Vision
Computer Sciences Department, University of Wisconsin-Madison
47
The Paper
Scalable Recognition with a vocabulary tree
David Nister and Henrik Stewenius
Center for Visualization and Virtual Environments
Department of Computer Science, University of Kentucky
Published in 2006
Appeared in: 2006 IEEE Computer Science Conference on Computer
Vision and Pattern Recognition
48
What are we trying to do.
Provide an indexing scheme that:
Scales to large image databases (1 million).
Retrieves images in an acceptable amount of
time.
49
Inspiration
Sivic and Zisserman (what you just saw)
Used k-means to partition the descriptors in
several pictures.
Used TF-IDF to score an image and find a
close match.
50
What’s new?
The idea of a vocabulary tree.
Using a larger vocabulary tree speeds things
up and improve match quality
Can use many more training images (35000 vs
400)
Can insert new images into the Database
quickly (0.2s vs 10s)
51
How do we do it?
Follow these three steps:
1. Build the vocabulary tree using the image
descriptors.
2. Generate a score for a given query image.
3. Find the images in the database that best
match that score.
52
Recap the Vocabulary Tree
1. For each image in our database, we calculate a
set of feature point descriptors.
2. Each of these descriptors is a vector of
numbers which exists in some space (128).
3. Consider each of these vectors to be a “word”
in the vocabulary of our database.
53
Recap the Vocabulary Tree
Build the vocabulary tree using hierarchical kmeans clustering.
54
Recap the Vocabulary Tree
55
Recap the Vocabulary Tree
56
Recap the Vocabulary Tree
57
What’s it good for?
Now that we have a vocabulary tree, we can
generate a path down the vocabulary tree
which is stored in an integer for scoring.
At each level of the tree, the descriptor is
compared to each of the k children using a dot
product. The closest is the path that is
followed.
58
Scoring
We have a bunch of paths through the tree, how
do we compare the query image to a database
image?
At each node, we define a weight wi.
The paper suggests two methods
• Use a constant weighting scheme.
• Use an entropy weighting scheme such as
N
ln  
 Ni 
59
Scoring (continued)
N
ln  
 Ni 
Where
N is the number of images in the database
Ni is the number of images in the database
with at least one descriptor vector path
through node i.
60
Scoring (continued)
This scoring mechanism results in a TF-IDF
scheme.
So we should see a higher score if more nodes
are shared by more descriptors.
61
Scoring (continued)
To compare two scores, we use the normalized
difference between the query score and the
database score.
q
d
s ( q, d ) 

q
d
62
Scoring (continued)
Researchers found that the most important
factors to quality where.
• A large vocabulary tree.
• Stronger weights towards the leaves of the
tree.
• Using the L1 norm in the previous equation.
63
Scoring Implementation
Scoring is implemented using inverted files
• At each node create an inverted file
• Each file contains a list of images in which the
current node appears.
• The inverted file of inner nodes is simply the
concatenation of it’s children’s inverted files.
• Database image scores are pre-computed and
pre normalized.
64
Testing
This method was tested using a a database of
40000 CD album covers.
Pictures of cd album covers where then used as
query images and run against the database.
Also tested using 6376 images in groups of 4.
Each image was queried in the hopes that the
other 3 images would produce the top scores.
Have tested on databases with image counts as
high as 1 million (highest at time of writing)
65
Testing
66
Results
67
Conclusions
The main conclusions of the paper are:
• Using a larger vocabulary tree makes things
better.
• Using an L1 norm in the normalized
difference of the scores produces better
results than the L2 norm
• This method can scale up to 1 million images
and still run in near real time.
68
City-Scale Location
Recognition
James Hill, Ozcan Ilikhan, Mark Lenz
{jshill4, ilikhan, mlenz} @cs.wisc.edu
CS 766: Computer Vision
Computer Sciences Department, University of Wisconsin-Madison
69
City-Scale Location Recognition
Estimate location by matching features from a large set of images
70
City-Scale Location Recognition
City-wide database of photos labeled with location
71
Image Features
SIFT features invariant to
– Translation
– Scale
– Orientation
– Illumination (partially)
72
Difficulties Matching Features
Storage space
–
30,000 images ≈ 100,000,000 SIFT features ≈ 12 GB
Search time
kd-trees and Best Bin First require descriptors
73
Method
Cluster features into visual words
Build vocabulary tree from clusters
Search tree to score matches
Location of image with top score
74
Method
Build trees with informative features
Create trees of varying branching factor
Vary number of comparisons during search
75
Vocabulary Tree
Visual word = region of an object
Just need the distance between a query feature
and each node
Only leaf nodes are words
76
Informative Features
Cluster small subsets into visual words
Compute information gain of features
Select most informative features to build tree
77
Information Gain
Informative Feature
– Found in all images of a location
– Not in any image of another location
Information gain: measure of how much new
information reduces uncertainty
78
Information Gain
N DB= number of images in database
N L= number of images at location l i
a= number of images visual word w j occurs at location l i
b= number of images visual word w j occurs at other locations
79
Building the Tree
Hierarchical k-means to cluster features
Nodes are the centroids
Leaves are the visual words
80
Branching Factor
Vary number of nodes compared to increase
search accuracy
Fixed vocabulary size M
Branching factor k, depth L
kL≈M
81
Greedy N-Best Paths
Approximate nearest neighbor
Similar to Best Bin First
Generalization of vocab tree search
Search multiple branches at each level
82
Greedy N-Best Paths
k + kN(L-1) comparisons
83
Matching
Votes for image d = Cd
Computed in linear time in # of features
84
Results
30,000 images covering 20 km
278 GPS-labelled query images
Performance = % query images within 10m of
ground truth
85
Results
Informative Features vs. Uniform
86
Results
Greedy N-Best Paths vs. Best Bin First
87
Results
Top n matches
88
Conclusion
Vocabulary tree structure affects performance
in recognition tasks
Structure becomes more critical as database
size increases
Number of comparisons drives performance,
not branching factor
89
Q & A, Discussion
Monday, November 29, 2010
CS 766: Computer Vision
Computer Sciences Department, University of Wisconsin-Madison
Acknowledgements
Many thanks to Prof. Andrew Zisserman and
Dr. Josef Sivic for providing us with extra
materials for presentation.
CS 766: Computer Vision
Computer Sciences Department, University of Wisconsin-Madison
91
Download