Approximate Nearest Neighbor -
Applications to Vision & Matching
Lior Shoval
Rafi Haddad
Approximate Nearest Neighbor
Applications to Vision & Matching
1.
2.
Object matching in 3D
Recognizing cars in cluttered scanned images
A. Frome, D. Huber, R. Kolluri, T. Bulow, and
J. Malik
Video Google
A Text Retrieval Approach to object
Matching in Videos
Sivic, J. and Zisserman, A
Object Matching
Input:
An object and a dataset of models
Output:
The most “ similar ” model
Two methods will be presented
1.
Voting based method
2.
Cost based method
Object S q
Model S
1
Model S
2
… Model S n
A descriptor based Object matching - Voting
Every descriptor vote for the model that gave the closet descriptor
Choose the model with the most votes
Problem
The hard vote discards the relative distances between descriptors
Object S q
Model S
1
Model S
2
… Model S n
A descriptor based Object matching - Cost
Compare all object descriptors to all target model descriptors cos t ( S q
, S i
)
k
{ 1 ,.., K } min m
{ 1 ,.., M } dist ( q k
, p m
)
Object S q
Model S
1
Model S
2
… Model S n
Application to cars matching
Matching - Nearest Neighbor
In order to match the object to the right model a NN algorithm is implemented
Every descriptor in the object is compared to all descriptors in the model
The operational cost is very high.
Experiment 1 – Model matching
Experiment 2 – Clutter scenes
Matching - Nearest Neighbor
E.g:
Q – 160 descriptors in the object
N – 83,640 [ref. desc.] X 12 [rotations]
~ 1E6 descriptors in the models
Exact NN - takes 7.4 Sec on 2.2GHz processor per one object descriptor
Speeding search with LSH
Fast search techniques such as LSH
(Locality-sensitive hashing) can reduce the search space by order of magnitude
Tradeoff between speed and accuracy
LSH – Dividing the high dimensional feature space into hypercubes, devided by a set of parallel hyperplanes & hypercubes k randomly-chosen axis l different sets of
LSH – k=4; l=1
LSH – k=4; l=2
LSH – k=4; l=3
LSH - Results
Taking the best
80/160 descriptors
Achieving close results with fewer descriptors
Descriptor based Object matching
– Reducing Complexity
Approximate nearest neighbor
1.
2.
Dividing the problem to two stages
Preprocessing
Querying
Locality-Sensitive Hashing (LSH)
Or...
Video Google
A Text Retrieval Approach to object
Matching in Videos
Query
Results
Interesting facts on Google
The most used search engine in the web
Who wants to be a Millionaire?
How many pages Google search?
a. Around half a billion c. Around 10 billions b. Around 4 billions d. Around 50 billions
How many machines do Google use?
a. 10 c. Few thousands b. Few hundreds d. Around a million
Video Google: On-line Demo
Samples
Run Lola Run:
Supermarket logo (Bolle)
Frame/shot 72325 / 824
Red cube logo:
Entry frame/shot 15626 / 174
Rolette #20
Frame/shot 94951 / 988
Groundhog Day:
Bill Murray's ties
Frame/shot 53001/294
Frame/shot 40576/208
Phil's home:
Entry frame/shot 34726/172
Query
Occluded !!!
Video Google
Text Google
Analogy from text to video
Video Google processes
Experimental results
Summary and analysis
Text retrieval overview
Word & Document
Vocabulary
Weighting
Inverted file
Ranking
Words & Documents
Documents are parsed into words
Common words are ignored (the, an, etc)
This is called ‘ stop list ’
Words are represented by their stems
‘ walk ’ , ‘ walking ’ , ‘ walks ’ ’ walk ’
Each word is assigned a unique identifier
A document is represented by a vector
With components given by the frequency of occurrence of the words it contains
Vocabulary
The vocabulary contains K words
Each document is represented by a K components vector of words frequencies
(0,0, … 3, … 4, … . 5, 0,0)
Example:
“…… Representation, detection and learning are the main issues that need to be tackled in designing a visual system for recognizing object. categories …….”
Parse and clean represent detect learn
Representation, detection and learning are the main issue tackle design main issues that need to be tackled in designing visual system recognize category a visual system for recognizing object categories.
…
Creating document vector ID
Assign unique id to each word
Create a document vector of size K with word frequency:
(3,7,2, ……… )/789
Or compactly with the original order and position
Word represent
Position
1,12,55 detect learn
2,32,44,..
.
3,11
……
Total 789
3
… .
ID
1
2
Weighting
The vector components are weighted in various ways:
Naive - Frequency of each word.
Binary – 1 if word appear 0 if not.
tf-idf ‘ Term Frequency – Inverse
Document Frequency ’ t i
n id log n d
N n i
n id n d log
N
Weighting t i n i
V d n id n
N n i d
t
1
,..., t i
,..., t k
T
- Number of occurrences of word i in document
- Total number of words in the document
- The number of documents in the whole database
- The number of occurrences of term i in the whole database
=> “ Word frequency ” X “ Inverse document frequency ”
=> All documents are equal!
Inverted File – Index
Crawling stage
Parsing all documents to create document representing vectors
Creating word Indices
An entry for each word in the corpus followed by a list of all documents (and positions in it)
Word
ID
1
2
3
…
K
2
3
Doc.
ID
1
…
N
1.
2.
3.
4.
Querying
Parsing the query to create query vector
Query : “ Representation learning ”
Query Doc ID = (1,0,1,0,0, … )
Retrieve all documents ID containing one of the Query words ID (Using the invert file index)
Calculate the distance between the query and document vectors (angle between vectors)
Rank the results
Ranking the query results
2.
3.
1.
Page Rank (PR)
Assume page A has page T
1
,T
2
… T n links to it
Define C(X) as the number of links in page X d is a weighting factor ( 0≤d≤1)
PR ( A )
( 1
d )
d i n
1
PR ( Ti )
C ( Ti )
Word Order
Font size, font type and more
The Visual Analogy
Word
Stem
Document
Corpus
Text
???
Frame
Film
Visual
Detecting “ Visual Words ”
“ Visual word ” Descriptor
What is a good descriptor?
Invariant to different view points, scale, illumination, shift and transformation
Local Versus Global
2.
1.
How to build such a descriptor ?
Finding invariant regions in the frame
Representation by a descriptor
Finding invariant regions
1.
Two types of ‘ viewpoint covariant regions ’ , are computed for each frame
SA – Shape Adapted
2.
MS - Maximally Stable
•
•
•
1. SA – Shape Adapted
Finding interest point using Harris corner detector
Iteratively determining the ellipse center, scale and shape around the interest point
Reference - Baumberg
2. MS - Maximally Stable
Intensity water shade image segmentation
Iteratively determining the ellipse center, scale and shape
Reference - Matas
Why two types of detectors ?
They are complementary representation of a frame
SA regions tends to centered at corner like features
MS regions correspond to blobs of high contrast
(such as dark window on a gray wall)
Each detector describes a different
“ vocabulary ” (e.g. the building design and the building specification)
MS - MA example
MS – yellow
SA - cyan Zoom
Building the Descriptors
SIFT – Scale Invariant Feature
Transform
Each elliptical region is represented by a
128-dimensional vector [Lowe]
SIFT is invariant to a shift of a few pixels
(often occurs)
Building the Descriptors
Removing noise – tracking & averaging
Regions are tracked across sequence of frames using “ Constant Velocity Dynamical model ”
Any region which does not survive for more than three frames is rejected
Descriptors throughout the tracks are averaged to improve SNR
Large covariance ’ s descriptors are rejected
The Visual Analogy
Word
Stem
Document
Corpus
Text
Frame
Film
Visual
Building the “ Visual Stems ”
Cluster descriptors into K groups using
K-mean clustering algorithm
Each cluster represent a “ visual word ” in the “ visual vocabulary ”
Result:
10K SA clusters
16K MS clusters
K-Mean Clustering
Input
A set of n unlabeled examples D={x d-dimensional feature space
1
,x
2
, … ,x n
} in
j
j
1
D i
x j 1 j
D j x
D
Find the partition of D into K non-empty disjoint subsets
D
K j
1
D j j
D j
i
j
So that the points in each subset are coherent according to certain criterion
K-mean clustering - algorithm
Step 1: Initialize a partition of D a.
Randomly choose
K equal size sets and calculate their centers m
1
D={a,b, …,k,l) ; n=12 ;
K=4 ; d=2
K-mean clustering - algorithm
Step 1: Initialize a partition of D b.
For other point y, it is put into subset D j
, if x j is the closest center to y among the K centers m
1
D
1
={a,c,l} ; D2={e,g} ;
D3={d,h,i} ; D4={b,f,k)
K-mean clustering - algorithm
Step 2: Repeat till no update a.
b.
Compute the mean (mass center) for each cluster D j
,
For each x assign x i i
: to the cluster with the closest center m
1
D
1
={a,c,l} ; D2={e,g} ;
D3={d,h,i} ; D4={b,f,k)
K-mean algorithm
Final result
K-mean clustering Cons
Sensitive to selection of initial grouping and metric
Sensitive to the order of input vectors
The number of clusters, K, must be determined before hand
Each attribute has the same weight
K-mean clustering - Resolution
Run with different grouping and ordering
Run for different K values
Problem ?
Complexity !
SA
MS
MS and SA “ Visual Words ”
The Visual Analogy
Word
Stem
Document
Corpus
Text
Descriptor
Centroid
Frame
Film
Visual
Visual “ Stop List ”
The most frequent visual words that occur in almost all images are suppressed
Before stop list
After stop list
Ranking Frames
1.
2.
Distance between vectors (Like in words/Document)
Spatial consistency
(= Word order in the text)
Visual Google process
Preprocessing:
Vocabulary building
Crawling Frames
Creating Stop list
Querying
Building query vector
Ranking results
Vocabulary building
Subset of 48 shots is selected
10k frames =
10% of movie
Clustering descriptors using k-mean algo.
Regions construction
(SA + MS)
10k frames * 1600
= 1.6E6 regions
SIFT descriptors representation
Frames tracking
1.6E6
~200k regions
Rejecting unstable regions
done with the ground truth set
Crawling Implementation
•
To reduce complexity – one keyframe per second is selected (100-150k frames 5k have not been included in forming the clusters
frames)
Descriptors are computed for stable regions in each key frame
Mean values are computed using two frames
each side of the key frame
Vocabulary: Vector quantization – using the nearest neighbor algorithm (found from the ground truth set)
Crawling movies summary
Key frames selection
5k frames
Nearest neighbored for vector quantization
Regions construction
(SA + MS)
SIFT descriptors representation
Frames tracking
Rejecting unstable regions
Stop list
Tf-idf weighting
Indexing
“ Google like ” Query Object
Generate query descriptor
Rank results
Use nearest neighbor algo ’ to build query vector
Doc vectors are sparse small set
Use inverse index to find relevant frames
Calculate distance to relevant frames
0.1 seconds with a Matlab
Experimental results
The experiment was conducted in two stages:
Scene location Matching
Object retrieval
Scene Location matching
Goal
Evaluate the method by matching scene locations within a closed world of shots
(= ‘ ground truth set ’ )
Ground truth set
164 frames, from 48 shots, were taken at 19 3D location in the movie ‘ Run Lola
Run ’ (4-9 frames from each location)
There are significant view point changes in the frames for the same location
Ground Truth Set
Location matching
The entire frame is used as a query region
The performance is measured over all
164 frames
The correct results were determined by hand
Rank calculation
Location matching
Rank
1
NN rel
N i rel
1
R i
N rel
N rel
1
2
Rank - Ordering quality (0 ≤Rank
≤ 1) ; 0 - best
N
N rel
- number of relevant images
- the size of the image set (164)
R i
- the location of the i-th relevant image
N rel
(1 ≤ R i
N rel
1
2
N i
≤ N) in the result rel
1
R i if all the relevant images are returned first
Location matching - Example
– Frame 6 is the current query frame
– Frames 13,17,29,135 contain the same scene location N rel
= 5.
– The result was: {17,29,6, 142,19 ,135,13, …
Frame number
6 13 17 29 135 Total
Location matching
N rel
N i rel
1
R i
N rel
1
5
1
2
15
2
3
7
1
2
6
19
Best Rank
Query Rank
Rank
1
164
5
19
15
0 .
00487
Rank
1
NN rel
i
N rel
1
R i
N rel
N rel
2
1
Rank of relevant frames
Frames 61 - 64
Object retrieval
Goal
Searching for objects throughout the entire movie
The object of interest is specified by the user as a sub part of any frame
Object query results (1)
Run Lola Run results
Object query results (2)
• The expressive power of the visual vocabulary
The visual word learnt for ‘Lola’ are used unchanged for the ‘groundhog day’ retrieval!
Groundhog Day results
Object query results (2)
Analysis:
Both the actual frame returned and the ranking are excellent
No frames containing the object are missed
No false negative
The highly ranked frames all do contain the object
Good precision
Google Performance Analysis
Vs Object macthing
Q – Number of queried descriptors (~10 2 )
M – Number of descriptors per frame (~10 3 )
N – Number of key frames per movie (~10 4 )
D – Descriptor dimension (128~10 2 )
K – Number of “ words ” in the vocabulary (16X10 3 ~10 3 )
α - ratio of documents that does not contain any of the Q “ words ” (~.1)
Brute force NN: Cost = QMND ~ 10 11
Google:
Query Vector quantization + Distance
=
QKD + KN
Sparse
QKD + Q(αN)
10 7 + 10 5
~
Improvement factor ~ 10 4 -:10 6
Video Google Summary
Immediate run-time object retrieval
Visual Word and vocabulary analogy
Modular frame work
Demonstration of the expressive power of the visual vocabulary
Open issues
Automatic ways for building the vocabulary are needed
Ranking of retrieval results method as
Google does
Extension to non rigid objects, like faces
Future thoughts
Using this method for higher level analysis of movies
Finding content of a movie by the “ words ” it contains
Finding the important (e.g. a star) object in a movie
Finding the location of unrecognized video frames
More ?
What is the meaning of the word Google?
$1 Million!!!
a. The number 1E10 c. The number 1E100 b. Very big data d. A simple clean search
1.
2.
3.
4.
5.
6.
7.
Reference
Sivic, J. and Zisserman, A., Video Google: A Text Retrieval Approach to Object Matching in Videos.
Proceedings of the International Conference on Computer Vision (2003)
Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In 7th Int. WWW
Conference, 1998.
K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. ECCV. Springer-
Verlag, 2002.
A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using
Regional Point Descriptors. To appear in European Conference on Computer Vision, Prague, Czech
Republic, 2004
D. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, pages 1150 – 1157, 1999.
F. Schaffalitzky and A. Zisserman; Automated Location Matching in Movies
J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable external regions. In Proceedings of the British Machine Vision Conference, pages 384.393, 2002.
Parameter tuning
K – number of clusters for each region type
The initial cluster center values
Minimum tracking length for stable features
The proportion of unstable descriptors to reject, based on their covariance
Locality-Sensitive Hashing
(LSH)
Divide the high - dimensional feature space into hypercubes, by k randomly chosen axis-parallel hyperplanes
Each hypercube is a hash bucket
The probability that 2 nearby points are separated is reduced by independently choosing l different sets of hyperplanes 2 hyperplanes
ε-nearest-neighbor
ε-Nearest Neighbor Search
• d(q, p) ≤ (1 + ε) d(q, P)
•
•
•
• d(q, p) is the distance between p and q in the euclidean space
•
Normalized distance
• d(q, p) = (Σ (x
(i)
– y
(i)
) 2 ) (1/2)
Epsilon is the maximum allowed 'error' d(q, P) distance of q to the closest point in P
Point p is the member of P that is retrieved (or not)
ε-Nearest Neighbor Search
Also called approximate Nearest
Neighbor searching
Reports nearest neighbors to the query point (q) with distances possibly greater than the true nearest neighbor distances
d(q, p) ≤ (1 + ε) d(q, P)
Don't worry, the math is on the next slide
ε-Nearest Neighbor Search
Goal
•
The goal is not to get the exact answer, but a good approximate answer
•
Many applications of nearest neighbor search where an approximate answer is good enough
ε-Nearest Neighbor Search
•
•
•
•
What is currently out?
Arya and Mount presented an algorithm
•
•
Query time
•
O(exp(d) * ε -d log n)
Pre-processing
•
O(n log n)
Clarkson improved dependence on ε
• exp(d) * ε -(d-1)/2
Grows exponentially with d
ε-Nearest Neighbor Search
•
•
Striking observation
•
“ Brute Force ” algorithm provides a faster query time
•
•
Simply computes the distance from the query to every point in P
Analysis: O(dn)
Arya and Mount
•
“… if the dimension is significantly larger than log n (as it for a number of practical instances), there are no approaches we know of that are significantly faster than brute-force search ”
High Dimensions
•
What is the problem?
•
•
Many applications of nearest neighbor (NN) have a high number of dimensions
Current algorithms do not perform much better than brute force linear searches
•
Much work has been done for dimension reduction
Dimension Reduction
•
Principal Component Analysis
•
•
Transforms a number of correlated variables into a smaller number of uncorrelated variables
Can anyone explain this further?
•
Latent Semantic Indexing
•
•
Used with the document indexing process
Looks at the entire document, to see which other documents contain some of the same words
Descriptor based Object matching - Complexity
Finding for each object descriptor, the nearest descriptor in the model, can be a costly operation min m
{ 1 ,.., M } dist ( q k
, p m
)
Descriptor dimension ~ 1E2
1000 object descriptors
1E6 descriptors per model
56 models
Brute force nearest neighbor ~1E12