Video Google

advertisement

Approximate Nearest Neighbor -

Applications to Vision & Matching

Lior Shoval

Rafi Haddad

Approximate Nearest Neighbor

Applications to Vision & Matching

1.

2.

Object matching in 3D

Recognizing cars in cluttered scanned images

A. Frome, D. Huber, R. Kolluri, T. Bulow, and

J. Malik

Video Google

A Text Retrieval Approach to object

Matching in Videos

Sivic, J. and Zisserman, A

Object Matching

Input:

An object and a dataset of models

Output:

The most “ similar ” model

Two methods will be presented

1.

Voting based method

2.

Cost based method

Object S q

Model S

1

Model S

2

… Model S n

A descriptor based Object matching - Voting

Every descriptor vote for the model that gave the closet descriptor

Choose the model with the most votes

Problem

The hard vote discards the relative distances between descriptors

Object S q

Model S

1

Model S

2

… Model S n

A descriptor based Object matching - Cost

Compare all object descriptors to all target model descriptors cos t ( S q

, S i

)

  k

{ 1 ,.., K } min m

{ 1 ,.., M } dist ( q k

, p m

)

Object S q

Model S

1

Model S

2

… Model S n

Application to cars matching

Matching - Nearest Neighbor

In order to match the object to the right model a NN algorithm is implemented

Every descriptor in the object is compared to all descriptors in the model

The operational cost is very high.

Experiment 1 – Model matching

Experiment 2 – Clutter scenes

Matching - Nearest Neighbor

E.g:

Q – 160 descriptors in the object

N – 83,640 [ref. desc.] X 12 [rotations]

~ 1E6 descriptors in the models

Exact NN - takes 7.4 Sec on 2.2GHz processor per one object descriptor

Speeding search with LSH

Fast search techniques such as LSH

(Locality-sensitive hashing) can reduce the search space by order of magnitude

Tradeoff between speed and accuracy

LSH – Dividing the high dimensional feature space into hypercubes, devided by a set of parallel hyperplanes & hypercubes k randomly-chosen axis l different sets of

LSH – k=4; l=1

LSH – k=4; l=2

LSH – k=4; l=3

LSH - Results

Taking the best

80/160 descriptors

Achieving close results with fewer descriptors

Descriptor based Object matching

– Reducing Complexity

Approximate nearest neighbor

1.

2.

Dividing the problem to two stages

Preprocessing

Querying

Locality-Sensitive Hashing (LSH)

Or...

Video Google

A Text Retrieval Approach to object

Matching in Videos

Query

Results

Interesting facts on Google

The most used search engine in the web

Who wants to be a Millionaire?

How many pages Google search?

a. Around half a billion c. Around 10 billions b. Around 4 billions d. Around 50 billions

How many machines do Google use?

a. 10 c. Few thousands b. Few hundreds d. Around a million

Video Google: On-line Demo

Samples

Run Lola Run:

Supermarket logo (Bolle)

Frame/shot 72325 / 824

Red cube logo:

Entry frame/shot 15626 / 174

Rolette #20

Frame/shot 94951 / 988

Groundhog Day:

Bill Murray's ties

Frame/shot 53001/294

Frame/shot 40576/208

Phil's home:

Entry frame/shot 34726/172

Query

Occluded !!!

Video Google

Text Google

Analogy from text to video

Video Google processes

Experimental results

Summary and analysis

Text retrieval overview

Word & Document

Vocabulary

Weighting

Inverted file

Ranking

Words & Documents

Documents are parsed into words

Common words are ignored (the, an, etc)

This is called ‘ stop list ’

Words are represented by their stems

‘ walk ’ , ‘ walking ’ , ‘ walks ’  ’ walk ’

Each word is assigned a unique identifier

A document is represented by a vector

With components given by the frequency of occurrence of the words it contains

Vocabulary

The vocabulary contains K words

Each document is represented by a K components vector of words frequencies

(0,0, … 3, … 4, … . 5, 0,0)

Example:

“…… Representation, detection and learning are the main issues that need to be tackled in designing a visual system for recognizing object. categories …….”

Parse and clean represent detect learn

Representation, detection and learning are the main issue tackle design main issues that need to be tackled in designing visual system recognize category a visual system for recognizing object categories.

Creating document vector ID

Assign unique id to each word

Create a document vector of size K with word frequency:

(3,7,2, ……… )/789

Or compactly with the original order and position

Word represent

Position

1,12,55 detect learn

2,32,44,..

.

3,11

……

Total 789

3

… .

ID

1

2

Weighting

The vector components are weighted in various ways:

Naive - Frequency of each word.

Binary – 1 if word appear 0 if not.

tf-idf ‘ Term Frequency – Inverse

Document Frequency ’ t i

 n id log n d

N n i

 n id n d log

N

tf-idf

Weighting t i n i

V d n id n

N n i d

 t

1

,..., t i

,..., t k

T

- Number of occurrences of word i in document

- Total number of words in the document

- The number of documents in the whole database

- The number of occurrences of term i in the whole database

=> “ Word frequency ” X “ Inverse document frequency ”

=> All documents are equal!

Inverted File – Index

Crawling stage

Parsing all documents to create document representing vectors

Creating word Indices

An entry for each word in the corpus followed by a list of all documents (and positions in it)

Word

ID

1

2

3

K

2

3

Doc.

ID

1

N

1.

2.

3.

4.

Querying

Parsing the query to create query vector

Query : “ Representation learning ”

 Query Doc ID = (1,0,1,0,0, … )

Retrieve all documents ID containing one of the Query words ID (Using the invert file index)

Calculate the distance between the query and document vectors (angle between vectors)

Rank the results

Ranking the query results

2.

3.

1.

Page Rank (PR)

Assume page A has page T

1

,T

2

… T n links to it

Define C(X) as the number of links in page X d is a weighting factor ( 0≤d≤1)

PR ( A )

( 1

 d )

 d i n 

1

PR ( Ti )

C ( Ti )

Word Order

Font size, font type and more

The Visual Analogy

Word

Stem

Document

Corpus

Text

???

???

Frame

Film

Visual

Detecting “ Visual Words ”

“ Visual word ”  Descriptor

What is a good descriptor?

Invariant to different view points, scale, illumination, shift and transformation

Local Versus Global

2.

1.

How to build such a descriptor ?

Finding invariant regions in the frame

Representation by a descriptor

Finding invariant regions

1.

Two types of ‘ viewpoint covariant regions ’ , are computed for each frame

SA – Shape Adapted

2.

MS - Maximally Stable

1. SA – Shape Adapted

Finding interest point using Harris corner detector

Iteratively determining the ellipse center, scale and shape around the interest point

Reference - Baumberg

2. MS - Maximally Stable

Intensity water shade image segmentation

Iteratively determining the ellipse center, scale and shape

Reference - Matas

Why two types of detectors ?

They are complementary representation of a frame

SA regions tends to centered at corner like features

MS regions correspond to blobs of high contrast

(such as dark window on a gray wall)

Each detector describes a different

“ vocabulary ” (e.g. the building design and the building specification)

MS - MA example

MS – yellow

SA - cyan Zoom

Building the Descriptors

SIFT – Scale Invariant Feature

Transform

Each elliptical region is represented by a

128-dimensional vector [Lowe]

SIFT is invariant to a shift of a few pixels

(often occurs)

Building the Descriptors

Removing noise – tracking & averaging

Regions are tracked across sequence of frames using “ Constant Velocity Dynamical model ”

Any region which does not survive for more than three frames is rejected

Descriptors throughout the tracks are averaged to improve SNR

Large covariance ’ s descriptors are rejected

The Visual Analogy

Word

Stem

Document

Corpus

Text

Descriptor

???

Frame

Film

Visual

Building the “ Visual Stems ”

Cluster descriptors into K groups using

K-mean clustering algorithm

Each cluster represent a “ visual word ” in the “ visual vocabulary ”

Result:

10K SA clusters

16K MS clusters

K-Mean Clustering

Input

A set of n unlabeled examples D={x d-dimensional feature space

1

,x

2

, … ,x n

} in

   j

 j

1

D i

 x j 1 j

D j x

D

Find the partition of D into K non-empty disjoint subsets

D

 K j

1

D j j

D j

  i

 j

So that the points in each subset are coherent according to certain criterion

K-mean clustering - algorithm

Step 1: Initialize a partition of D a.

Randomly choose

K equal size sets and calculate their centers m

1

D={a,b, …,k,l) ; n=12 ;

K=4 ; d=2

K-mean clustering - algorithm

Step 1: Initialize a partition of D b.

For other point y, it is put into subset D j

, if x j is the closest center to y among the K centers m

1

D

1

={a,c,l} ; D2={e,g} ;

D3={d,h,i} ; D4={b,f,k)

K-mean clustering - algorithm

Step 2: Repeat till no update a.

b.

Compute the mean (mass center) for each cluster D j

,

For each x assign x i i

: to the cluster with the closest center m

1

D

1

={a,c,l} ; D2={e,g} ;

D3={d,h,i} ; D4={b,f,k)

K-mean algorithm

Final result

K-mean clustering Cons

Sensitive to selection of initial grouping and metric

Sensitive to the order of input vectors

The number of clusters, K, must be determined before hand

Each attribute has the same weight

K-mean clustering - Resolution

Run with different grouping and ordering

Run for different K values

Problem ?

Complexity !

SA

MS

MS and SA “ Visual Words ”

The Visual Analogy

Word

Stem

Document

Corpus

Text

Descriptor

Centroid

Frame

Film

Visual

Visual “ Stop List ”

The most frequent visual words that occur in almost all images are suppressed

Before stop list 

After stop list 

Ranking Frames

1.

2.

Distance between vectors (Like in words/Document)

Spatial consistency

(= Word order in the text)

Visual Google process

Preprocessing:

Vocabulary building

Crawling Frames

Creating Stop list

Querying

Building query vector

Ranking results

Vocabulary building

Subset of 48 shots is selected

10k frames =

10% of movie

Clustering descriptors using k-mean algo.

Regions construction

(SA + MS)

10k frames * 1600

= 1.6E6 regions

SIFT descriptors representation

Frames tracking

1.6E6

 ~200k regions

Rejecting unstable regions

Parameters tuning is

done with the ground truth set

Crawling Implementation

To reduce complexity – one keyframe per second is selected (100-150k frames  5k have not been included in forming the clusters

 frames)

Descriptors are computed for stable regions in each key frame

Mean values are computed using two frames

 each side of the key frame

Vocabulary: Vector quantization – using the nearest neighbor algorithm (found from the ground truth set)

Crawling movies summary

Key frames selection

5k frames

Nearest neighbored for vector quantization

Regions construction

(SA + MS)

SIFT descriptors representation

Frames tracking

Rejecting unstable regions

Stop list

Tf-idf weighting

Indexing

“ Google like ” Query Object

Generate query descriptor

Rank results

Use nearest neighbor algo ’ to build query vector

Doc vectors are sparse  small set

Use inverse index to find relevant frames

Calculate distance to relevant frames

0.1 seconds with a Matlab

Experimental results

The experiment was conducted in two stages:

Scene location Matching

Object retrieval

Scene Location matching

Goal

Evaluate the method by matching scene locations within a closed world of shots

(= ‘ ground truth set ’ )

Tuning the system parameters

Ground truth set

164 frames, from 48 shots, were taken at 19 3D location in the movie ‘ Run Lola

Run ’ (4-9 frames from each location)

There are significant view point changes in the frames for the same location

Ground Truth Set

Location matching

The entire frame is used as a query region

The performance is measured over all

164 frames

The correct results were determined by hand

Rank calculation

Location matching

Rank

1

NN rel



N i rel 

1

R i

N rel

N rel

1

2



Rank - Ordering quality (0 ≤Rank

≤ 1) ; 0 - best

N

N rel

- number of relevant images

- the size of the image set (164)

R i

- the location of the i-th relevant image

N rel

(1 ≤ R i

N rel

1

2

N i

≤ N) in the result rel 

1

R i if all the relevant images are returned first

Location matching - Example

– Frame 6 is the current query frame

– Frames 13,17,29,135 contain the same scene location  N rel

= 5.

– The result was: {17,29,6, 142,19 ,135,13, …

Frame number

6 13 17 29 135 Total

Location matching

N rel

N i rel 

1

R i

N rel

1

 

5

1

2

15

2

3

7

1

2

6

19

Best Rank

Query Rank

Rank

1

164

5

19

15

0 .

00487

Rank

1

NN rel

 i

N rel 

1

R i

N rel

N rel

2

1

 



Rank of relevant frames

Frames 61 - 64

Object retrieval

Goal

Searching for objects throughout the entire movie

The object of interest is specified by the user as a sub part of any frame

Object query results (1)

Run Lola Run results

Object query results (2)

• The expressive power of the visual vocabulary

The visual word learnt for ‘Lola’ are used unchanged for the ‘groundhog day’ retrieval!

Groundhog Day results

Object query results (2)

Analysis:

Both the actual frame returned and the ranking are excellent

No frames containing the object are missed

No false negative

The highly ranked frames all do contain the object

Good precision

Google Performance Analysis

Vs Object macthing

Q – Number of queried descriptors (~10 2 )

M – Number of descriptors per frame (~10 3 )

N – Number of key frames per movie (~10 4 )

D – Descriptor dimension (128~10 2 )

K – Number of “ words ” in the vocabulary (16X10 3 ~10 3 )

α - ratio of documents that does not contain any of the Q “ words ” (~.1)

Brute force NN: Cost = QMND ~ 10 11

Google:

Query Vector quantization + Distance

=

QKD + KN

Sparse

QKD + Q(αN)

10 7 + 10 5

~

Improvement factor ~ 10 4 -:10 6

Video Google Summary

Immediate run-time object retrieval

Visual Word and vocabulary analogy

Modular frame work

Demonstration of the expressive power of the visual vocabulary

Open issues

Automatic ways for building the vocabulary are needed

Ranking of retrieval results method as

Google does

Extension to non rigid objects, like faces

Future thoughts

Using this method for higher level analysis of movies

Finding content of a movie by the “ words ” it contains

Finding the important (e.g. a star) object in a movie

Finding the location of unrecognized video frames

More ?

What is the meaning of the word Google?

$1 Million!!!

a. The number 1E10 c. The number 1E100 b. Very big data d. A simple clean search

1.

2.

3.

4.

5.

6.

7.

Reference

Sivic, J. and Zisserman, A., Video Google: A Text Retrieval Approach to Object Matching in Videos.

Proceedings of the International Conference on Computer Vision (2003)

Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In 7th Int. WWW

Conference, 1998.

K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. ECCV. Springer-

Verlag, 2002.

A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik. Recognizing Objects in Range Data Using

Regional Point Descriptors. To appear in European Conference on Computer Vision, Prague, Czech

Republic, 2004

D. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, pages 1150 – 1157, 1999.

F. Schaffalitzky and A. Zisserman; Automated Location Matching in Movies

J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable external regions. In Proceedings of the British Machine Vision Conference, pages 384.393, 2002.

Parameter tuning

K – number of clusters for each region type

The initial cluster center values

Minimum tracking length for stable features

The proportion of unstable descriptors to reject, based on their covariance

Locality-Sensitive Hashing

(LSH)

Divide the high - dimensional feature space into hypercubes, by k randomly chosen axis-parallel hyperplanes

Each hypercube is a hash bucket

The probability that 2 nearby points are separated is reduced by independently choosing l different sets of hyperplanes 2 hyperplanes

ε-nearest-neighbor

ε-Nearest Neighbor Search

• d(q, p) ≤ (1 + ε) d(q, P)

• d(q, p) is the distance between p and q in the euclidean space

Normalized distance

• d(q, p) = (Σ (x

(i)

– y

(i)

) 2 ) (1/2)

Epsilon is the maximum allowed 'error' d(q, P) distance of q to the closest point in P

Point p is the member of P that is retrieved (or not)

ε-Nearest Neighbor Search

Also called approximate Nearest

Neighbor searching

Reports nearest neighbors to the query point (q) with distances possibly greater than the true nearest neighbor distances

 d(q, p) ≤ (1 + ε) d(q, P)

Don't worry, the math is on the next slide

ε-Nearest Neighbor Search

Goal

The goal is not to get the exact answer, but a good approximate answer

Many applications of nearest neighbor search where an approximate answer is good enough

ε-Nearest Neighbor Search

What is currently out?

Arya and Mount presented an algorithm

Query time

O(exp(d) * ε -d log n)

Pre-processing

O(n log n)

Clarkson improved dependence on ε

• exp(d) * ε -(d-1)/2

Grows exponentially with d

ε-Nearest Neighbor Search

Striking observation

“ Brute Force ” algorithm provides a faster query time

Simply computes the distance from the query to every point in P

Analysis: O(dn)

Arya and Mount

“… if the dimension is significantly larger than log n (as it for a number of practical instances), there are no approaches we know of that are significantly faster than brute-force search ”

High Dimensions

What is the problem?

Many applications of nearest neighbor (NN) have a high number of dimensions

Current algorithms do not perform much better than brute force linear searches

Much work has been done for dimension reduction

Dimension Reduction

Principal Component Analysis

Transforms a number of correlated variables into a smaller number of uncorrelated variables

Can anyone explain this further?

Latent Semantic Indexing

Used with the document indexing process

Looks at the entire document, to see which other documents contain some of the same words

Descriptor based Object matching - Complexity

Finding for each object descriptor, the nearest descriptor in the model, can be a costly operation min m

{ 1 ,.., M } dist ( q k

, p m

)

Descriptor dimension ~ 1E2

1000 object descriptors

1E6 descriptors per model

56 models

Brute force nearest neighbor ~1E12

Download