Video Google: A Text Retrieval
Approach to Object Matching in
Videos
Josef Sivic and Andrew Zisserman
Robotics Research Group, Department of
Engineering Science
University of Oxford, United Kingdom
1
Goal
• To retrieve those key frames and shots of a video containing a particular object .
• With the ease, speed and accuracy.
2
Outline
• Introduction
– Object query
– Scene query
• Challenging problem
• Text retrieval overview
• Viewpoint invariant description
– Building the Descriptors
– Building the Visual Word
– The Visual Analogy
• Visual indexing using text retrieval methods
• Experimental evaluation of scene matching using visual words
• Object retrieval
– Stop list
– Spatial Consistency
• Summary and conclusions
• Video Google Demo
3
Introduction - Object query(1/2)
4
Introduction - Scene query(2/2)
5
Challenging problem(1/2)
• Changes in viewpoint, illumination and partial occlusion
• Large data
• Real-world data
6
Challenging problem(2/2)
7
Text retrieval overview (1/2)
• The documents are parsed into words.
• Words are represented by their stems
– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’
• Stop list to filter common words( ‘the’, ‘an’,…)
• Remaining words represent as a vector weighted based on word frequency
8
Text retrieval overview (2/2)
• Inverted file to facilitate efficient retrieval.
– An inverted file is structured like an ideal book index.
• Text is retrieved by computing its vector of word frequencies, return documents with the closest vectors
• Rank the returned documents
9
Viewpoint invariant description(1/2)
• Two types of viewpoint covariant regions are computed for each frame.
1.
SA – Shape Adapted
corner like features
2.
MS – Maximally Stable
blobs of high contrast with respect to their surroundings
• Regions computed in grayscale
10
Viewpoint invariant description(2/2)
The MS regions are in yellow. The SA regions are in cyan.
11
Building the Descriptors(1/2)
• SIFT – Scale Invariant Feature Transform
– Each elliptical region is represented by a 128dimensional vector
– SIFT is invariant to a shift of a few pixels
12
Building the Descriptors(2/2)
• Removing noise – tracking & averaging
– Regions are tracked across sequence of frames using “ Constant Velocity Dynamical model ”
– Any region which does not survive for more than three frames is rejected
– Averaging the descriptors throughout the track
– Large covariance ’ s descriptors are rejected
13
Building the Visual Word(1/2)
• Cluster descriptors into K groups using K-mean clustering algorithm
• Each cluster represent a “ visual word ” in the
“ visual vocabulary ”
• MS and SA regions are clustered separately
– different vocabularies for describing the same scene.
14
Building the Visual Word(2/2)
SA
MS
15
The Visual Analogy
Word
Stem
Document
Corpus
Text
Descriptor
Centroid
Frame
Film
Visual
16
Visual indexing using text retrieval methods(1/2)
• tf-idf - ‘Term Frequency – Inverse Document
Frequency’
• A vocabulary of k words, then each document is represented by a k-vector
17
Visual indexing using text retrieval methods(2/2)
• The query vector is given by the visual words contained in a user specified sub-part of a frame
• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.
18
Experimental evaluation of scene matching using visual words(1/5)
• Goal
– Evaluate the method by matching scene locations within a closed world of shots ( ‘ ground truth set ’ )
• Ground truth set
– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘ Run Lola Run ’ (4-9 frames from each location)
– There are significant view point changes in the frames for the same location
19
Experimental evaluation of scene matching using visual words(2/5)
20
Experimental evaluation of scene matching using visual words(3/5)
• The entire frame is used as a query region
• The performance is measured over all 164 frames
• The correct results were determined by hand
• Rank calculation
21
Experimental evaluation of scene matching using visual words(4/5)
22
Experimental evaluation of scene matching using visual words(5/5)
23
Object retrieval(1/7)
• Goal
– Searching for objects throughout the entire movie
– The object of interest is specified by the user as a sub part of any frame
24
Object retrieval – Stop list(2/7)
• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.
25
Object retrieval – Spatial
Consistency(3/7)
• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.
26
Object retrieval(4/7)
27
Object retrieval(5/7)
28
Object retrieval(6/7)
29
Object retrieval(7/7)
30
Summary and conclusions
• Visual Word and vocabulary analogy
• Immediate run-time object retrieval
• Future work
– Automatic ways for building the vocabulary are needed
• Intriguing possibility
– latent semantic indexing to find content
– automatic clustering to find the principal objects that occur throughout the movie.
31
Video Google Demo
• http://www.robots.ox.ac.uk/~vgg/research/vg oogle/
32