Efficient Visual Search of Objects in Videos

advertisement
Efficient Visual Search for
Objects in Videos
JOSEF SIVIC AND ANDREW ZISSERMAN
PRESENTERS:
ILGE AKKAYA & JEANNETTE CHANG
MARCH 1, 2011
Introduction
Text Query
Image Query
Generalize text retrieval methods to
Results: Frames
Results: Documents
non-textual information
State-of-the-Art before this paper…
 Text-based search for images (Google Images)
 Object recognition



Barnard, et al. (2003): “Matching words and pictures”
Sivic, et al. (2005): “Discovering objects and their location in images”
Sudderth, et al. (2005): “Learning hierarchical models of scenes,
objects, and parts”
 Scene classification



Fei-Fei and Perona (2005): “A Bayesian hierarchical model for
learning natural scene categories”
Quelhas, et al. (2005): “Modeling scenes with local descriptors and
latent aspects”
Lazebnik, et al. (2006): “Beyond bag of features: Spatial pyramid
matching for recognizing natural scene categories”
Introduction (cont.)
 Retrieve specific objects vs. categories of
objects/scenes (“Camry” logo vs. cars)
 Employ text retrieval techniques for visual search,
with images as queries and results
 Why Text Retrieval Approach?


Matches essentially precomputed so that no delay at run time
Any object in video can be retrieved without modification of
descriptors originally built for video
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Pre-Processing (Offline)
1.
2.
3.
4.
5.
6.
For each frame, detect affine covariant regions.
Track the regions through video and reject unstable
regions
Build visual vocabulary
Remove stop-listed visual words
Compute tf-idf weighted document frequency
vectors
Built inverted file-indexing structure
Detection of Affine Covariant Regions
 Typically ~1200 regions / frame (720x576)
 Elliptical regions
 Each region represented by 128-dimensional SIFT
vector
 SIFT features provide invariance against affine
transformations
Two types of affine covariant regions:
1. Shape-Adapted(SA):
Mikolajczyk et al.
Elliptical Shape adaptation about a Harris interest point
Often centered on corner-like features
1. Maximally-Stable(MS):
Proposed by Matas et al.
Intensity watershed image segmentation
High-contrast blobs
Pre-Processing (Offline)
1.
2.
3.
4.
5.
6.
For each frame, detect affine covariant regions.
Track the regions through video and reject unstable
regions
Build visual vocabulary
Remove stop-listed visual words
Compute tf-idf weighted document frequency
vectors
Built inverted file-indexing structure
Tracking regions through video and rejecting
unstable regions
 Any region that does not survive for 3+ frames is
rejected
 These regions are not potentially interesting
 Reduces number of regions/frame to approx. 50%
(~600/frame)
Pre-Processing (Offline)
1.
2.
3.
4.
5.
6.
For each frame, detect affine covariant regions.
Track the regions through video and reject unstable
regions
Build visual vocabulary
Remove stop-listed visual words
Compute tf-idf weighted document frequency
vectors
Built inverted file-indexing structure
Visual Indexing Using Text-Retrieval Methods
TEXT
IMAGE
Represent words by the “stems”
‘write’
‘writing’
‘write’
‘written’
mapped to
Cluster similar regions into ‘visual
words’
Stop-list common words ‘a/an/the’
Stop-list common visual words
Rank search results according to how
close the query words occur within
retrieved document
Use spatial information to check
retrieval consistency
Visual Vocabulary
 Purpose: Cluster regions from multiple frames into fewer
groups called ‘visual words’
 Each descriptor: 128-vector
 K-means clustering (explain more)
 ~300K descriptors mapped into 16K visual words
(600 regions/frame x ~500 frames)
 (6000 SA, 10000 MS regions used)
K-Means Clustering
 Purpose: Cluster N data points (SIFT descriptors) into K
clusters (visual words)
 K = desired number of cluster centers (mean points)
 Step 1: Randomly guess K mean points
 Step 2: Calculate nearest mean point to assign each
data point to a cluster center
In this paper, Mahalanobis distance is used to determine ‘nearest cluster
center’.
d(x1, x2 ) = (x1 - x2 )T S-1 (x1 - x2 )
where ∑ is the covariance matrix for all descriptors,
x2 is the length 128 mean vector and
x1’s are the descriptor vectors(i.e. data points)
 Step 3: Recalculate cluster centers and distances,
repeat until stationarity
Examples of Clusters of Regions
Samples of normalized affine covariant regions
Pre-Processing (Offline)
1.
2.
3.
4.
5.
6.
For each frame, detect affine covariant regions.
Track the regions through video and reject unstable
regions
Build visual vocabulary
Remove stop-listed visual words
Compute tf-idf weighted document frequency
vectors
Built inverted file-indexing structure
Remove Stop-Listed Words
Analogy to text-retrieval:
 ‘a’, ‘and’, ‘the’ … are not distinctive words
 Common words cause mismatches
 5-10% of the most common visual words are stopped
 800-1600 / 16000 words are stopped
(Upper row) Matches before stoplisting
(Lower row) Matches after stoplisting
Pre-Processing (Offline)
1.
2.
3.
4.
5.
6.
For each frame, detect affine covariant regions.
Track the regions through video and reject unstable
regions
Build visual vocabulary
Remove stop-listed visual words
Compute tf-idf weighted document frequency
vectors
Built inverted file-indexing structure
tf-idf Weighting
(term frequency-inverse document frequency weighting)
nid :
nd :
Ni :
N :
ti :
#of occurrences of word(visual word) i in document(frame) d
total number of words in document d
total number of documents containing term I
number of documents in the database
weighted word frequency
 Each document(frame) represented by:
where
v = number of visual words in vocabulary
And vd = the tf-idf vector of the particular frame d
Inverted File Indexing
Visual
Word
Index
Found in
Frames:
1
1,4,5
2
1,2,10
…
…
N
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Real-Time Query
Determine the set of visual words found within the
query region
2. Retrieve keyframes based on visual word
frequencies (Ns = 500)
3. Re-rank retrieved keyframes using spatial
consistency
1.
Retrieve keyframes based on visual word frequencies
 vq: vector containing visual word frequencies
corresponding to query region is computed
 the normalized scalar product of vq with vd’s are
computed:
Spatial Consistency Voting
 Analogy: Google text document retrieval
 Matched covariant regions in the retrieved frames
should have a similar spatial arrangement
 Search area: 15 nearest spatial neighbors of each
match
 Each neighboring region which also matches in the
retrieved frame, votes for the frame
Spatial Consistency Voting
Matched pair of words
(A,B)
Each region in defined
search area in both frames
casts a vote
For the match (A,B)
(upper row)Matches after stop-listing
(lower row) Remaining matches after
spatial consistency voting
Query Frame
Sample Retrieved Frame
1:
2:
3-4:
5-6:
7-8:
Query Region
Close-up version of 1
Initial matches
Matches after stop-listing
Matches after spatial
consistency matching
1
2
3
4
5
6
7
8
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Implementation Details
Offline Processing:
 100-150K frames/typical feature length film,
 Refined to 4000-6000 keyframes
 Descriptors are computed for stable regions in each
frame
 Each region is assigned to a visual word
 Visual words over all keyframes assembled into an
inverted file-structure
Algorithm Implementation
Real-Time Process:
 User selects query region
 Visual words are identified within query region
 A short list of Ns = 500 keyframes retrieved based on
tf-idf vector similarity
 Similarity is recomputed considering spatial
consistency voting
Example Visual Search
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Retrieval Examples
Query
Image
A Few
Retrieved
Matches
Retrieval Examples (cont.)
Query
Image
A Few
Retrieved
Matches
Performance of the Algorithm
 Tried 6 object queries
(1) Red Clock
(2) Black Clock
(3) “Frame’s” Sign
(4) Digital Clock
(5) “Phil” Sign
(6) Microphone
Performance of the Algorithm (cont.)
 Evaluated on the level of shots rather than keyframes
 Measured using precision-recall plots
 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
 𝑅𝑒𝑐𝑎𝑙𝑙
=
=
# 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑠ℎ𝑜𝑡𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠ℎ𝑜𝑡𝑠 𝒓𝒆𝒕𝒓𝒊𝒆𝒗𝒆𝒅
# 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑠ℎ𝑜𝑡𝑠
𝑇𝑜𝑡𝑎𝑙 # 𝑜𝑓 𝑠ℎ𝑜𝑡𝑠 𝒘𝒊𝒕𝒉 𝒐𝒃𝒋𝒆𝒄𝒕
 Precision like measure of fidelity or exactness
 Recall like measure of completeness
Performance of the Algorithm (cont.)
 Ideally, precision = 1 for all recall values
 Average Precision (AP) , ideally AP = 1
Examples of Missed Shots
 Extreme viewing angles
Original query object
Low-ranked shot
Examples of Missed Shots (cont.)
 Significant changes in scale and motion blurring
Original query object
Low-ranked shot
Qualitative Assessment of Performance
 General trends
 Higher precision at low recall levels
 Bias towards lightly textured regions
detectable by SA/MS detectors
 Could address these challenges by
adding more covariant regions
 Other Difficulties



Textureless regions (e.g., mug)
Thin or wiry objects (e.g., bike)
Highly-deformable (e.g., clothing)
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Quality of Individual Visual Words
 Using single visual word as query
 Tests the expressiveness of the visual vocabulary
 Sample query
 Given an object of interest, select one of the visual words from
that object
 Retrieve all frames that contain the visual word (no ranking)
 Retrieval considered correct if contains object of interest
Examples of Individual Visual Words
Top row: Scale-normalized close-ups of elliptical regions overlaid on query image
Bottom row: Corresponding normalized regions
Results of Individual Word Searches
 Individual words are “noisy”
 Intuitively because words occur in multiple objects
and do not cover all occurrences of the object
Quality of Individual Visual Words
Unrealistic
 Require each word to
occur on only one
object (high precision)
 Growing number of
objects would result in
growing number of
words
Realistic
 Visual words shared
across objects, with
objects represented by
a combination of words
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Searching for Objects From Outside of the Movie
 Used external query images from the internet
 Manually labeled all occurrences of external queries
in movies
 Results
External
Query Image
No. of
Rankings of
Occurrences Retrieved
Occurrences
AP
(Average
Precision)
Sony logo
3
1st, 4th, 35th
0.53
Hollywood sign
1
1st
1
Notre Dame
1
1st
1
Sample External Query Results
 Potential Applications
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Challenge I: Visual Vocabularies for Very Large
Scale Retrieval
 Current progress: 150000 frame feature movie
reduced to 6000 keyframes and then processed
 Ultimate goal: indexing billions of online images to
build a visual search engine
 Should the vocabulary increase
in size as the image archive
grows?
 How discriminative should the
words be?
 Generalization of images from
one movie to an outside
database of images?
(a) (c) external images downloaded from the
Internet
(b) Correct retrieval frame from the movie
‘Pretty Woman’
(d) Correct retrieval from the movie ‘Charade’
 Learning a universal visual
vocabulary still remains a
challenge
Challenge II: Retrieval of 3D Objects
 Current algorithm covers successful detection
despite slight changes in viewpoint, illumination,
partial occlusion due to SIFT features
 However, 3D retrieval is fundamentally a bigger
challenge
Proposed approach 1:
Automatic association of images using temporal information
 Grouping front-side-back of a car in a video
 Possible either in query and/or database side
 Query-Side Matching: Associated query frames are
computed and used for 3D image search
Query-Side matching of associated frames
Proposed approach 1 (cont.)
 Grouping on database side: Query on a single
aspect is expected to retrieve pregrouped frames
associated with 3D image
(Top Row) Query image
(Bottom Rows)
Matching frames
Proposed approach 2:
Building an explicit 3-D model for each 3-D object in the Video
 Focus is more on model building than detection
 Only rigid objects considered
Challenge III: Verification using Spatial Structure
 Spatial consistency was helpful, but could be
improved
 A few suggestions


Caution with using measures for rigid geometry
Reduce cost using hierarchical approach
 Two complementary methods
 Ferrari et al. (2004): matching deformable objects
 Rothganger et al. (2003): matching 3D objects
Verification Using Spatial Structure (cont.)
 Method 1 (Ferrari)
 Based on spatial overlap of local
regions
 Requires regions to match individually
and pattern of intersection between
neighboring regions to be preserved
 Performance
 Pro: Works well with deformations
 Con: Computationally expensive
Verification Using Spatial Structure (cont.)
 Method 2 (Rothganger)
 Based on 3-D object model
 Requires consistency of local
appearance descriptors and
geometric consistency
 Performance
 Pro: Object can be matched in
diverse (even novel) poses
 Con: 3-D model built offline,
requires up to 20 images of object
taken from different viewpoints
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Conclusion
 Demonstrated scalable object retrieval architecture
which uses


Visual vocabulary based on vector-quantized viewpoint
invariant descriptors
Efficient indexing techniques from text retrieval
 A few notable differences between document and
image bag-of-words retrieval



Spatial information
Numbers of “words” in query
Matching requirements
Looking forward…
 TinEye (May 2008)
 Image-based search engine
 Given a query image, searches for altered versions of that
image (scaled or cropped)
 1.86 billion images indexed
 Google Goggles (2009)
 Use phone to take photo, results from the internet
 Limited categories
Overview of the Talk
 Visual Search Algorithm
 Offline Pre-Processing
 Real-Time Query
 A Few Implementation Details
 Performance
 General Results
 Testing Individual Words
 Using External Images As Queries
 A Few Challenges and Future Directions
 Concluding Remarks
 Demo of the Algorithm
Demo of Retrieval Algorithm
 Live demonstration
Main References
 D. Lowe. Distinctive Image Features from Scale-
Invariant Keypoints. International Journal of
Computer Vision. 2(60):91.110, 2004.
 J. Sivic and A. Zisserman. Efficient visual search for
objects in videos. Proc. IEEE, 96(4):548–566, 2008.
 W. Qian “Video Google: A Text Retrieval Approach
to Object Matching in Videos.”
www.mriedel.ece.umn.edu/wiki/index.php/Weikang_Qian
Download