Video : A Text Retrieval Approach to Object Matching in Videos

advertisement
Video
:
A Text Retrieval Approach to
Object Matching in Videos
Josef Sivic and Andrew Zisserman
Robotics Research Group, University of Oxford
Presented by Xia Li
1
Motivation
„
Objective
‰
To retrieve objects or
scenes in a movie with
the ease, speed and
accuracy with which
Google retrieves web
pages containing
particular words.
2
Object query
Query
region
Close-up
Retrieved key-frames from three different shots
3
Scene location query
Query frame
Retrieved key-frames from three different shots taken at the same location
4
Why is it difficult?
„
„
„
„
Viewpoint change
Efficient at run time
Difficult real-world data
Example: The same object (clock) appears at various
points in the movie `Run Lola Run'.
Note the partial occlusion and substantial changes in
scale and viewpoint.
5
Outline
„
„
„
„
Viewpoint invariant content based image
retrieval
Visual indexing using text retrieval methods
Experimental evaluation of scene matching
and object retrieval using visual words
Conclusion
6
Problem statement
Retrieve key frames containing the same object
This slide is borrowed from josef
7
Approach
„
Affine invariant regions:
‰
‰
„
„
Regions are detected independently in each
image and cover the same surface area in the
scene
Descriptor vectors are extracted from the region’s
appearance
Match descriptors between frames using
invariant vectors
Disambiguating using spatial consistency
8
Affine invariant regions
9
Local invariance requirement
„
Geometric: 2D affine transformation
„
Photometric: 1D affine transformation
10
Finding invariant regions
„
Two types of ‘viewpoint covariant regions’ are
computed for each frame
‰
‰
SA – Shape Adapted
MS - Maximally Stable
11
SA – Shape Adapted
„
„
„
Constructed by elliptical shape adaptation about an
interest point
Iteratively determining the ellipse center, scale and
shape around the interest point
Reference - Mikolajczyk & Schmid, and Schaffalitzky &
Zisserman ECCV 2002
12
MS - Maximally Stable
„
„
„
Constructed by selecting areas from an intensity
watershed image segmentatioin
The regions are those for which the area is
approximately stationary as the intensity threshold is
varied
Reference - Matas et al BMVC 2002
13
Why two types of regions?
„
Provide complementary representations of a
frame
‰
‰
The SA regions tend to be centered on corner like
features
The MS regions correspond to blobs of high
contrast with respect to their surroundings
14
Building the Descriptors
„
SIFT – Scale Invariant Feature Transform
‰
‰
Each elliptical region is represented by a 128dimensional vector [Lowe]
SIFT is invariant to a shift of a few pixels (often
occurs)
15
Example
Two frames showing the same scene from different camera
viewpoints (from the film `Run Lola Run')
16
Example
The two frames with detected affine invariant regions superimposed
‘Maximally Stable’ (MS) regions are in yellow
‘Shape Adapted’ (SA) regions are in cyan
17
Example
The final matched regions after indexing and the spatial consistency
ranking algorithm
18
Aggregating information from multiple
frames
„
„
„
Any region which does not survive for more
than three frames is rejected
Each region of the track can be regarded as
an independent measurement of a common
scene region
The estimate of the descriptor for this scene
region is computed by averaging the
descriptors throughout the track
19
Lessons from text retrieval –Google
„
„
„
„
„
Word & Document
Vocabulary
Weighting
Inverted file
Ranking
20
Words & Documents
„
„
Documents are parsed into words
Common words are ignored (the, an, etc)
‰
„
Words are represented by their stems
‰
„
„
This is called ‘stop list’
‘walk’, ‘walking’, ‘walks’ Æ’walk’
Each word is assigned a unique identifier
A document is represented by a vector
‰
With components given by the frequency of
occurrence of the words it contains
21
Example
represent
detect
learn
Representation, detection and learning are the
main issue
tackle
design
main issues that need to be tackled in designing
visual system recognize
category
a visual system for recognizing object categories.
…
22
Text retrieval techniques
„
Weighting
‰
tf-idf - ‘Term Frequency – Inverse Document Frequency’
nid
N
ti =
log
nd
ni
„
Inverted file
23
Our case
„
„
The query vector is given by the visual words
contained in a user specified sub-part of a
frame
The other frames are ranked according to the
similarity of their weighted vectors to this
query vector
24
Building the “Visual Stems”
„
„
„
Cluster descriptors into K groups using Kmean clustering algorithm
Each cluster represent a “visual word” in the
“visual vocabulary”
Result:
‰
‰
10K SA clusters
16K MS clusters
25
Implementation
„
„
„
Reject unstable regions
A subset of 48 shots is selected
Distance function
26
MS and SA “Visual Words”
27
Visual “Stop List”
„
The most frequent
visual words that
occur in almost all
images are
suppressed
Before stop listÆ
After stop list Æ
28
Visual “Stop List”
Frequency of MS visual words among all 3768 keyframes of Run Lola Run
(a) before, and (b) after, application of a stoplist
29
Spatial consistency
„
„
„
„
Requiring neighboring matches have the
same spatial layout in the query region and
retrieved frame
A search area is defined by the 15 nearest
neighbors of each match
Each region which also matches within this
area casts a vote for that frame
The total number of votes determines the
rank of the frame
30
Ranking Frames
„
„
Distance between
vectors (Like in
words/Document)
Spatial consistency
(= Word order in the
text)
31
Visual Google process
32
Vocabulary building
Subset of 48
shots is selected
10k frames =
10% of movie
Clustering descriptors
using k-mean algo.
Regions construction
(SA + MS)
10k frames * 1600
= 1.6E6 regions
SIFT descriptors
representation
Frames
tracking
1.6E6 Æ~200k
regions
Rejecting
unstable regions
Parameters tuning is
done with the ground
truth set
33
34
Query object
Generate query
descriptor
Use nearest
neighbor algo’ to
build query vector
Use inverse
index to find
relevant
frames
Doc vectors are
sparse Æ small set
Rank
results
Calculate distance
to relevant frames
0.1 seconds
with a Matlab
35
Experimental evaluation of scene
matching using visual words
(a) Each row shows a frame from three different shots of the same location in
the ground truth data set
36
Experimental evaluation of scene
matching using visual words
(b) Average normalized rank for location matching on the ground truth set. (c)
Average Precision-Recall curve for location matching on the ground truth set.
37
Example : Run Lola Run
38
References
„
„
„
„
„
„
D. Lowe.
Object recognition from local scale-invariant features.
In Proc. ICCV, pages 1150-1157, 1999.
J. Matas, O. Chum, M. Urban, and T. Pajdla.
Robust wide baseline stereo from maximally stable extremal regions.
In Proc. BMVC., pages 384-393, 2002.
K. Mikolajczyk and C. Schmid.
An affine invariant interest point detector.
In Proc. ECCV. Springer-Verlag, 2002.
F. Schaffalitzky and A. Zisserman.
Multi-view matching for unordered image sets, or ``How do I organize my
holiday snaps?''.
In Proc. ECCV, volume 1, pages 414-431. Springer-Verlag, 2002.
F. Schaffalitzky and A. Zisserman.
Automated scene matching in movies.
In Proc. CIVR2002, LNCS 2383, pages 186-197. Springer-Verlag, 2002.
J. Sivic and A. Zisserman.
Video Google: A Text Retrieval Approach to Object Matching in Videos
Proceedings of the International Conference on Computer Vision (2003)
39
Download