Video Google

advertisement

Video Google: A Text Retrieval

Approach to Object Matching in

Videos

Josef Sivic and Andrew Zisserman

Robotics Research Group, Department of

Engineering Science

University of Oxford, United Kingdom

1

Goal

• To retrieve those key frames and shots of a video containing a particular object .

• With the ease, speed and accuracy.

2

Outline

• Introduction

– Object query

– Scene query

• Challenging problem

• Text retrieval overview

• Viewpoint invariant description

– Building the Descriptors

– Building the Visual Word

– The Visual Analogy

• Visual indexing using text retrieval methods

• Experimental evaluation of scene matching using visual words

• Object retrieval

– Stop list

– Spatial Consistency

• Summary and conclusions

• Video Google Demo

3

Introduction - Object query(1/2)

4

Introduction - Scene query(2/2)

5

Challenging problem(1/2)

• Changes in viewpoint, illumination and partial occlusion

• Large data

• Real-world data

6

Challenging problem(2/2)

7

Text retrieval overview (1/2)

• The documents are parsed into words.

• Words are represented by their stems

– ‘walk’, ‘walking’, ‘walks’ -> ‘walk’

• Stop list to filter common words( ‘the’, ‘an’,…)

• Remaining words represent as a vector weighted based on word frequency

8

Text retrieval overview (2/2)

• Inverted file to facilitate efficient retrieval.

– An inverted file is structured like an ideal book index.

• Text is retrieved by computing its vector of word frequencies, return documents with the closest vectors

• Rank the returned documents

9

Viewpoint invariant description(1/2)

• Two types of viewpoint covariant regions are computed for each frame.

1.

SA – Shape Adapted

 corner like features

2.

MS – Maximally Stable

 blobs of high contrast with respect to their surroundings

• Regions computed in grayscale

10

Viewpoint invariant description(2/2)

The MS regions are in yellow. The SA regions are in cyan.

11

Building the Descriptors(1/2)

• SIFT – Scale Invariant Feature Transform

– Each elliptical region is represented by a 128dimensional vector

– SIFT is invariant to a shift of a few pixels

12

Building the Descriptors(2/2)

• Removing noise – tracking & averaging

– Regions are tracked across sequence of frames using “ Constant Velocity Dynamical model ”

– Any region which does not survive for more than three frames is rejected

– Averaging the descriptors throughout the track

– Large covariance ’ s descriptors are rejected

13

Building the Visual Word(1/2)

• Cluster descriptors into K groups using K-mean clustering algorithm

• Each cluster represent a “ visual word ” in the

“ visual vocabulary ”

• MS and SA regions are clustered separately

– different vocabularies for describing the same scene.

14

Building the Visual Word(2/2)

SA

MS

15

The Visual Analogy

Word

Stem

Document

Corpus

Text

Descriptor

Centroid

Frame

Film

Visual

16

Visual indexing using text retrieval methods(1/2)

• tf-idf - ‘Term Frequency – Inverse Document

Frequency’

• A vocabulary of k words, then each document is represented by a k-vector

17

Visual indexing using text retrieval methods(2/2)

• The query vector is given by the visual words contained in a user specified sub-part of a frame

• And the other frames are ranked according to the similarity of their weighted vectors to this query vector.

18

Experimental evaluation of scene matching using visual words(1/5)

• Goal

– Evaluate the method by matching scene locations within a closed world of shots ( ‘ ground truth set ’ )

• Ground truth set

– 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘ Run Lola Run ’ (4-9 frames from each location)

– There are significant view point changes in the frames for the same location

19

Experimental evaluation of scene matching using visual words(2/5)

20

Experimental evaluation of scene matching using visual words(3/5)

• The entire frame is used as a query region

• The performance is measured over all 164 frames

• The correct results were determined by hand

• Rank calculation

21

Experimental evaluation of scene matching using visual words(4/5)

22

Experimental evaluation of scene matching using visual words(5/5)

23

Object retrieval(1/7)

• Goal

– Searching for objects throughout the entire movie

– The object of interest is specified by the user as a sub part of any frame

24

Object retrieval – Stop list(2/7)

• To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.

25

Object retrieval – Spatial

Consistency(3/7)

• Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangement to those of the outlined region in the query image.

26

Object retrieval(4/7)

27

Object retrieval(5/7)

28

Object retrieval(6/7)

29

Object retrieval(7/7)

30

Summary and conclusions

• Visual Word and vocabulary analogy

• Immediate run-time object retrieval

• Future work

– Automatic ways for building the vocabulary are needed

• Intriguing possibility

– latent semantic indexing to find content

– automatic clustering to find the principal objects that occur throughout the movie.

31

Video Google Demo

• http://www.robots.ox.ac.uk/~vgg/research/vg oogle/

32

Download