Video : A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman Robotics Research Group, University of Oxford Presented by Xia Li 1 Motivation Objective To retrieve objects or scenes in a movie with the ease, speed and accuracy with which Google retrieves web pages containing particular words. 2 Object query Query region Close-up Retrieved key-frames from three different shots 3 Scene location query Query frame Retrieved key-frames from three different shots taken at the same location 4 Why is it difficult? Viewpoint change Efficient at run time Difficult real-world data Example: The same object (clock) appears at various points in the movie `Run Lola Run'. Note the partial occlusion and substantial changes in scale and viewpoint. 5 Outline Viewpoint invariant content based image retrieval Visual indexing using text retrieval methods Experimental evaluation of scene matching and object retrieval using visual words Conclusion 6 Problem statement Retrieve key frames containing the same object This slide is borrowed from josef 7 Approach Affine invariant regions: Regions are detected independently in each image and cover the same surface area in the scene Descriptor vectors are extracted from the region’s appearance Match descriptors between frames using invariant vectors Disambiguating using spatial consistency 8 Affine invariant regions 9 Local invariance requirement Geometric: 2D affine transformation Photometric: 1D affine transformation 10 Finding invariant regions Two types of ‘viewpoint covariant regions’ are computed for each frame SA – Shape Adapted MS - Maximally Stable 11 SA – Shape Adapted Constructed by elliptical shape adaptation about an interest point Iteratively determining the ellipse center, scale and shape around the interest point Reference - Mikolajczyk & Schmid, and Schaffalitzky & Zisserman ECCV 2002 12 MS - Maximally Stable Constructed by selecting areas from an intensity watershed image segmentatioin The regions are those for which the area is approximately stationary as the intensity threshold is varied Reference - Matas et al BMVC 2002 13 Why two types of regions? Provide complementary representations of a frame The SA regions tend to be centered on corner like features The MS regions correspond to blobs of high contrast with respect to their surroundings 14 Building the Descriptors SIFT – Scale Invariant Feature Transform Each elliptical region is represented by a 128dimensional vector [Lowe] SIFT is invariant to a shift of a few pixels (often occurs) 15 Example Two frames showing the same scene from different camera viewpoints (from the film `Run Lola Run') 16 Example The two frames with detected affine invariant regions superimposed ‘Maximally Stable’ (MS) regions are in yellow ‘Shape Adapted’ (SA) regions are in cyan 17 Example The final matched regions after indexing and the spatial consistency ranking algorithm 18 Aggregating information from multiple frames Any region which does not survive for more than three frames is rejected Each region of the track can be regarded as an independent measurement of a common scene region The estimate of the descriptor for this scene region is computed by averaging the descriptors throughout the track 19 Lessons from text retrieval –Google Word & Document Vocabulary Weighting Inverted file Ranking 20 Words & Documents Documents are parsed into words Common words are ignored (the, an, etc) Words are represented by their stems This is called ‘stop list’ ‘walk’, ‘walking’, ‘walks’ Æ’walk’ Each word is assigned a unique identifier A document is represented by a vector With components given by the frequency of occurrence of the words it contains 21 Example represent detect learn Representation, detection and learning are the main issue tackle design main issues that need to be tackled in designing visual system recognize category a visual system for recognizing object categories. … 22 Text retrieval techniques Weighting tf-idf - ‘Term Frequency – Inverse Document Frequency’ nid N ti = log nd ni Inverted file 23 Our case The query vector is given by the visual words contained in a user specified sub-part of a frame The other frames are ranked according to the similarity of their weighted vectors to this query vector 24 Building the “Visual Stems” Cluster descriptors into K groups using Kmean clustering algorithm Each cluster represent a “visual word” in the “visual vocabulary” Result: 10K SA clusters 16K MS clusters 25 Implementation Reject unstable regions A subset of 48 shots is selected Distance function 26 MS and SA “Visual Words” 27 Visual “Stop List” The most frequent visual words that occur in almost all images are suppressed Before stop listÆ After stop list Æ 28 Visual “Stop List” Frequency of MS visual words among all 3768 keyframes of Run Lola Run (a) before, and (b) after, application of a stoplist 29 Spatial consistency Requiring neighboring matches have the same spatial layout in the query region and retrieved frame A search area is defined by the 15 nearest neighbors of each match Each region which also matches within this area casts a vote for that frame The total number of votes determines the rank of the frame 30 Ranking Frames Distance between vectors (Like in words/Document) Spatial consistency (= Word order in the text) 31 Visual Google process 32 Vocabulary building Subset of 48 shots is selected 10k frames = 10% of movie Clustering descriptors using k-mean algo. Regions construction (SA + MS) 10k frames * 1600 = 1.6E6 regions SIFT descriptors representation Frames tracking 1.6E6 Æ~200k regions Rejecting unstable regions Parameters tuning is done with the ground truth set 33 34 Query object Generate query descriptor Use nearest neighbor algo’ to build query vector Use inverse index to find relevant frames Doc vectors are sparse Æ small set Rank results Calculate distance to relevant frames 0.1 seconds with a Matlab 35 Experimental evaluation of scene matching using visual words (a) Each row shows a frame from three different shots of the same location in the ground truth data set 36 Experimental evaluation of scene matching using visual words (b) Average normalized rank for location matching on the ground truth set. (c) Average Precision-Recall curve for location matching on the ground truth set. 37 Example : Run Lola Run 38 References D. Lowe. Object recognition from local scale-invariant features. In Proc. ICCV, pages 1150-1157, 1999. J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. BMVC., pages 384-393, 2002. K. Mikolajczyk and C. Schmid. An affine invariant interest point detector. In Proc. ECCV. Springer-Verlag, 2002. F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ``How do I organize my holiday snaps?''. In Proc. ECCV, volume 1, pages 414-431. Springer-Verlag, 2002. F. Schaffalitzky and A. Zisserman. Automated scene matching in movies. In Proc. CIVR2002, LNCS 2383, pages 186-197. Springer-Verlag, 2002. J. Sivic and A. Zisserman. Video Google: A Text Retrieval Approach to Object Matching in Videos Proceedings of the International Conference on Computer Vision (2003) 39