Categorization I 1 Introduction to Computer Vision Object Categorization I (Bag-of-Words Model) Prof. Kyoung Mu Lee Dept. of EECS, Seoul National University Most of the lecture slides are from B. Leibe Topics of This Lecture Categorization I 2 • Indexing with Local Features 9 9 9 9 Inverted file index Visual Words Visual Vocabulary construction Use for content based image/video retrieval • Object Categorization 9 More exact task definition 9 Basic level categories 9 Category representations • Bag-of-Words Representations 9 Representation 9 Learning 9 Use for image classification Introduction to Computer Vision K. M. Lee, EECS, SNU Indexing Local Features Categorization I 3 • Each patch / region has a descriptor, which is a point in some high-dimensional feature space (e.g., SIFT) Figure credit: A. Zisserman Introduction to Computer Vision K. M. Lee, EECS, SNU Indexing Local Features Categorization I 4 • When we see close points in feature space, we have similar descriptors, which indicates similar local content. • This is of interest not only for 3d reconstruction, but also for retrieving images of similar objects. Figure credit: A. Zisserman Introduction to Computer Vision K. M. Lee, EECS, SNU Indexing Local Features Introduction to Computer Vision … Slide credit: Kristen Grauman Categorization I 5 K. M. Lee, EECS, SNU Indexing Local Features Categorization I 6 • With potentially thousands of features per image, and hundreds to millions of images to search, how to efficiently find those that are relevant to a new image? • Low-dimensional descriptors (e.g. through PCA): 9 Can use standard efficient data structures for nearest neighbor search • High-dimensional descriptors 9 Approximate nearest neighbor search methods more practical • Inverted file indexing schemes Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Indexing Local Features: Inverted File Index Categorization I 7 • For text documents, an efficient way to find all pages on which a word occurs is to use an index… • We want to find all images in which a feature occurs. • To use this idea, we’ll need to map our features to “visual words”. Introduction to Computer Vision K. Grauman, B. Leibe K. M. Lee, EECS, SNU Text Retrieval vs. Image Search Categorization I 8 • What makes the problems similar, different? Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words: Main Idea Categorization I 9 • Extract some local features from a number of images … e.g., SIFT descriptor space: each point is 128dimensional Slide credit: David Nister Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words: Main Idea Categorization I 10 Slide credit: David Nister Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words: Main Idea Categorization I 11 Slide credit: David Nister Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words: Main Idea Categorization I 12 Slide credit: David Nister Introduction to Computer Vision K. M. Lee, EECS, SNU Categorization I 13 Each point is a local descriptor, e.g. SIFT vector. Slide credit: David Nister Introduction to Computer Vision K. M. Lee, EECS, SNU Categorization I 14 Slide credit: David Nister to Computer Vision SlideIntroduction credit: D. Nister K. M. Lee, EECS, SNU Indexing with Visual Words Categorization I 15 Map high-dimensional descriptors to tokens/words by quantizing the feature space • Quantize via clustering, let cluster centers be the prototype “words” Descriptor space Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Indexing with Visual Words Categorization I 16 Map high-dimensional descriptors to tokens/words by quantizing the feature space • Determine which word to assign to each new image region by finding the closest cluster center. Descriptor space Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words Categorization I 17 • Example: each group of patches belongs to the same visual word Slide credit: Kristen Grauman Introduction to Computer Vision Figure from Sivic & Zisserman, ICCV 2003 K. M. Lee, EECS, SNU Visual Words: Texture Representation Categorization I 18 • First explored for texture and material representations • Texton = cluster center of filter responses over collection of images • Describe textures and materials based on distribution of prototypical texture elements. Leung & Malik 1999; Varma & Zisserman, 2002; Lazebnik, Schmid & Ponce, 2003; Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words: Texture Representation • Texture is characterized by the repetition of basic elements or textons • For stochastic textures, it is the identity of the textons, not their spatial arrangement, that matters Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Words Categorization I 20 • More recently used for describing scenes and objects for the sake of indexing or classification. Sivic & Zisserman 2003; Csurka, Bray, Dance, & Fan 2004; many others. Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Inverted File for Images of Visual Words Categorization I 21 Word List of image number numbers When will this give us a significant gain in efficiency? Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Vocabulary Formation Categorization I 22 Issues: • Sampling strategy: where to extract features? • Clustering / quantization algorithm • Unsupervised vs. supervised • What corpus provides features (universal vocabulary?) • Vocabulary size, number of words Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Sampling Strategies Categorization I 23 Sparse, at interest points Dense, uniformly Randomly • To find specific, textured objects, sparse sampling from interest points often more reliable. • Multiple complementary interest operators offer more image coverage. Multiple interest operators • For object categorization, dense sampling offers better coverage. [See Nowak, Jurie & Triggs, ECCV 2006] Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Clustering / Quantization Methods Categorization I 24 • K-means (typical choice), agglomerative clustering, meanshift,… • Hierarchical clustering: allows faster insertion / word assignment while still allowing large vocabularies 9 Vocabulary tree [Nister & Stewenius, CVPR 2006] Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Example: Recognition with Vocabulary Tree Categorization I 25 • Tree construction: Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU Vocabulary Tree Categorization I 26 • Training: Filling the tree Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU Vocabulary Tree Categorization I 27 • Training: Filling the tree Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU Vocabulary Tree Categorization I 28 • Training: Filling the tree Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU Vocabulary Tree Categorization I 29 • Training: Filling the tree Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU Vocabulary Tree Categorization I 30 • Training: Filling the tree Slide credit: David Nister Introduction to Computer Vision [Nister & Stewenius, CVPR’06] K. M. Lee, EECS, SNU 30 Vocabulary Tree Categorization I 31 • Recognition RANSAC verification [Nister & Stewenius, CVPR’06] Introduction to Computer Vision K. M. Lee, EECS, SNU 31 Quiz Questions • What is the computational advantage of the hierarchical representation for bag of words, vs. a flat vocabulary? • What dangers does such a representation carry? Introduction to Computer Vision K. M. Lee, EECS, SNU Vocabulary Tree: Performance Categorization I 33 • Evaluated on large databases 9 Indexing with up to 1M images • Online recognition for database of 50,000 CD covers 9 Retrieval in ~1s • Experimental finding that large vocabularies can be beneficial for recognition [Nister & Stewenius, CVPR’06] Introduction to Computer Vision K. M. Lee, EECS, SNU Vocabulary Size Categorization I 34 • Larger vocabularies can be advantageous… Branch factor • But what happens when the vocabulary gets too large? 9 Efficiency? 9 Robustness? Introduction to Computer Vision K. M. Lee, EECS, SNU tf-idf Weighting Categorization I 35 • Term frequency – inverse document frequency • Describe frame by frequency of each word within it, downweight words that appear often in the database • (Standard weighting for text retrieval) Number of occurrences of word i in document d Number of words in document d Total number of documents in database Number of occurrences of word i in whole database Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Application for Content Based Img Retrieval Categorization I 36 • What if query of interest is a portion of a frame? Slide credit: Andrew Zisserman Introduction to Computer Vision [Sivic & Zisserman, ICCV’03] K. M. Lee, EECS, SNU Video Google System Query region Retrieved frames 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Categorization I 37 Sivic & Zisserman, ICCV 2003 • Demo online at : http://www.robots.ox.ac.uk/~vgg/researc h/vgoogle/index.html Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU 37 Example Results Query Retrieved shots Slide credit: Andrew Zisserman Introduction to Computer Vision K. M. Lee, EECS, SNU Collecting Words Within a Query Region • Example: Friends Query region: pull out only the SIFT descriptors whose positions are within the polygon Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Example Results Query Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU More Results Query Retrieved shots Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU More Results Query Retrieved shots Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Topics of This Lecture • Indexing with Local Features 9 9 9 9 Inverted file index Visual Words Visual Vocabulary construction Use for content based image/video retrieval • Object Categorization 9 More exact task definition 9 Basic level categories 9 Category representations • Bag-of-Words Representations 9 Representation 9 Learning 9 Use for image classification Introduction to Computer Vision K. M. Lee, EECS, SNU Revisiting Object Categorization • Find this particular object • Recognize ANY car • Recognize ANY cow Introduction to Computer Vision K. M. Lee, EECS, SNU Object Categorization • Task Description 9 “Given a small number of training images of a category, recognize a-priori unknown instances of that category and assign the correct category label.” • Which categories are feasible visually? 9 Extensively studied in Cognitive Psychology, e.g. [Brown’58] “Fido” German shepherd Introduction to Computer Vision dog animal living being K. M. Lee, EECS, SNU Visual Object Categories • Basic Level Categories in human categorization [Rosch 76, Lakoff 87] 9 The highest level at which category members have similar perceived shape 9 The highest level at which a single mental image reflects the entire category 9 The level at which human subjects are usually fastest at identifying category members 9 The first level named and understood by children 9 The highest level at which a person uses similar motor actions for interaction with category members Introduction to Computer Vision K. M. Lee, EECS, SNU Visual Object Categories • Basic-level categories in humans seem to be defined predominantly visually. • There is evidence that humans (usually) start with basic-level categorization before doing identification. Abstract ⇒ Basic-level categorization is easier levels and faster for humans than object identification! ⇒ Most promising starting point for visual classification Basic level dog German shepherd Individual level Introduction to Computer Vision … … animal … … quadruped … cat cow Doberman “Fido” … K. M. Lee, EECS, SNU How many object categories are there? Source: Fei-Fei Li, Rob Fergus, Antonio Torralba. Introduction to Computer Vision Categorization I 48 Biederman 1987 K. M. Lee, EECS, SNU Categorization I 49 Introduction to Computer Vision K. M. Lee, EECS, SNU Other Types of Categories • Functional Categories 9 e.g. chairs = “something you can sit on” Introduction to Computer Vision K. M. Lee, EECS, SNU Other Types of Categories • Ad-hoc categories 9 e.g. “something you can find in an office environment” Introduction to Computer Vision K. M. Lee, EECS, SNU Challenges: Robustness Illumination Occlusions Object pose Intra-class appearance Clutter Viewpoint Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Challenges: Context and Human Experience Context cues Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Challenges: Learning with Minimal Supervision Less More Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Learning from One Example Slide from Pietro Perona, 2004 Object Recognition workshop Introduction to Computer Vision K. M. Lee, EECS, SNU Slide from Pietro Perona, 2004 Object Recognition workshop Introduction to Computer Vision K. M. Lee, EECS, SNU How to Arrive at a Category Representation? Categorization I 57 • If a local image region is a visual word, how can we summarize an image (the document)? • How can we build a representation that is suitable for representing an entire category? 9 9 9 9 Robust to intra-category variation Robust to deformation, articulation … But still discriminative Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Topics of This Lecture • Indexing with Local Features 9 9 9 9 Inverted file index Visual Words Visual Vocabulary construction Use for content based image/video retrieval • Object Categorization Revisited 9 More exact task definition 9 Basic level categories 9 Category representations • Bag-of-Words Representations 9 Representation 9 Learning 9 Use for image classification Introduction to Computer Vision K. M. Lee, EECS, SNU Analogy to Documents Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was brain, transmitted sensory, point by point to visual centers in the visual,cortex perception, brain; the cerebral was a movie screen,retinal, so to speak, upon which the image cerebral cortex, in the eye was projected. Through the cell, discoveries eye, of Hubel and optical Wiesel we now know that behind the origin of the visual nerve, image perception in the brain there is a Hubel, Wiesel considerably more complicated course of events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. Slide credit: Li Fei-Fei Introduction to Computer Vision China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% rise in imports to $660bn. China, trade,The figures are likely to further annoy the US, which surplus, commerce, has long argued that China's exports are unfairly exports, helped by a deliberately imports, US, undervalued yuan. Beijing agrees the yuan, bank, domestic, surplus is too high, but says the yuan is foreign, increase, only one factor. Bank of China governor Zhou Xiaochuan said thevalue country also trade, needed to do more to boost domestic demand so more goods stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value. K. M. Lee, EECS, SNU Categorization I 60 Object Bag of ‘words’ Source: ICCV 2005 short course, Li Fei-Fei Introduction to Computer Vision K. M. Lee, EECS, SNU Introduction to Computer Vision K. M. Lee, EECS, SNU Bags of Visual Words Categorization I 62 • Summarize entire image based on its distribution (histogram) of word occurrences. • Analogous to bag of words representation commonly used for documents. Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU 62 Bag-of-features steps 1. 2. 3. 4. Extract features Learn “visual vocabulary” Quantize features using visual vocabulary Represent images by frequencies of “visual words” K. M. Lee, EECS, SNU Similarly, Bags-of-Textons for Texture Repr. Histogram Universal texton dictionary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Comparing Bags of Words • We build up histograms of word activations, so any histogram comparison measure can be used here. • E.g. we can rank frames by normalized scalar product between their (possibly weighted) occurrence counts 9 Nearest neighbor search for similar images. [1 8 1 4]’ • r dj [5 1 1 0] r q Introduction to Computer Vision Slide credit: Kristen Grauman K. M. Lee, EECS, SNU Other Functions for comparing histograms N • L1 distance: • χ2 distance: D (h1 , h2 ) = ∑ | h1 (i ) − h2 (i ) | i =1 N D(h1 , h2 ) = ∑ i =1 (h1 (i) − h2 (i) )2 h1 (i ) + h2 (i ) • Quadratic distance (cross-bin distance): D(h1 , h2 ) = ∑ Aij (h1 (i ) − h2 ( j )) 2 i, j • Histogram intersection (similarity function): N I (h1 , h2 ) = ∑ min(h1 (i ), h2 (i )) i =1 K. M. Lee, EECS, SNU Learning/Recognition with BoW Histograms Categorization I 67 • Bag of words representation makes it possible to describe the unordered point set with a single vector (of fixed dimension across image examples) • Provides easy way to use distribution of feature types with various learning algorithms requiring vector input. Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU 67 BoW for Object Categorization {face, flowers, building} • Works pretty well for image-level classification Csurka et al. (2004), Willamowski et al. (2005), Grauman & Darrell (2005), Sivic et al. (2003, 2005) Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Image Classification • Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them? Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Discriminative Methods • Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra Non-zebra Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Classification • Assign input vector to one of two or more classes • Any decision rule divides input space into decision regions separated by decision boundaries Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Nearest Neighbor Classifier • Assign label of nearest training data point to each test data point. Voronoi partitioning of feature space for 2-category 2-D and 3-D data Slide credit: David Lowe Introduction to Computer Vision K. M. Lee, EECS, SNU Recap: Classifier Construction Choices… Neural networks Nearest neighbor 106 examples Shakhnarovich, Viola, Darrell 2003 Berg, Berg, Malik 2005... Support Vector Machines Guyon, Vapnik Heisele, Serre, Poggio, 2001,… LeCun, Bottou, Bengio, Haffner 1998 Rowley, Baluja, Kanade 1998 … Boosting Conditional Random Fields Viola, Jones 2001, Torralba et al. 2004, Opelt et al. 2006,… McCallum, Freitag, Pereira 2000; Kumar, Hebert 2003, … Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU K-Nearest Neighbors • For a new point, find the k closest points from training data • Labels of the k points “vote” to classify • Works well provided there is lots of data and the distance function is good Slide credit: David Lowe Introduction to Computer Vision K. M. Lee, EECS, SNU Recap: Support Vector Machines (SVMs) • Discriminative classifier based on optimal separating hyperplane (i.e. line for 2D case) • Maximize the margin between the positive and negative training examples Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Generative Methods • Model the probability of a bag of features given a class Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU The Naïve Bayes Model • Assume that each feature is conditionally independent given the class N p ( w1 , K, wN | c) = ∏ p ( wi | c) i =1 Slide credit: Li Fei-Fei Introduction to Computer Vision K. M. Lee, EECS, SNU The Naïve Bayes Model • Assume that each feature is conditionally independent given the class N c* = arg max c p (c)∏ p ( wi | c) i =1 MAP decision Prior prob. of the object classes Likelihood of i-th visual word given the class Estimated by empirical frequencies of visual words in images from a given class Slide credit: Li Fei-Fei Introduction to Computer Vision Csurka et al. 2004 K. M. Lee, EECS, SNU BoW for Object Categorization Caltech6 dataset bag of features bag of features Parts-and-shape model • Good performance for pure classification (object present/absent) 9 Better than more elaborate part-based models with spatial constraints… 9 What could be possible reasons why? Slide credit: Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU BoW Representation: Spatial Information • A bag of words is an orderless representation: throwing out spatial relationships between features • Middle ground: 9 9 9 9 9 Visual “phrases” : frequently co-occurring words Semi-local features : describe configuration, neighborhood Let position be part of each feature Count bags of words only within sub-grids of an image After matching, verify spatial consistency (e.g., look at neighbors – are they the same too?) Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Spatial Pyramid Matching Hierarchy of localized BoW histograms Image from Svetlana Lazebnik Introduction to Computer Vision K. M. Lee, EECS, SNU Summary: Indexing features Categorization I 82 … … … Index each one into pool of descriptors from previously seen images Describe features List of positions, scales, orientations Associated list of ddimensional descriptors or Quantize to form bag of words vector for the image … Detect or sample features Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Summary: Bag-of-Words • Pros: 9 9 9 9 Flexible to geometry / deformations / viewpoint Compact summary of image content Provides vector representation for sets Empirically good recognition results in practice • Cons: 9 Basic model ignores geometry – must verify afterwards, or encode via features. 9 Background and foreground mixed when bag covers whole image 9 Interest points or sampling: no guarantee to capture object-level parts. 9 Optimal vocabulary formation remains unclear. Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU Summary • Local invariant features 9 Distinctive matches possible in spite of significant view change, useful not only to provide matches for multi-view geometry, but also to find objects and scenes. 9 To find correspondences among detected features, measure distance between descriptors, and look for most similar patches. • Bag of words representation 9 9 9 9 Quantize feature space to make discrete set of visual words Summarize image by distribution of words Index individual words Inverted index: pre-compute index to enable faster search at query time Slide credit: Kristen Grauman Introduction to Computer Vision K. M. Lee, EECS, SNU References and Further Reading Categorization I 85 • Bag-of-Words approaches are a relatively new development. I will look for a good reference paper and provide it on the webpage… • Details about the Video Google system can be found in 9 J. Sivic, A. Zisserman, Video Google: A Text Retrieval Approach to Object Matching in Videos, ICCV’03, 2003. • Try the Oxford Video Google demo 9 http://www.robots.ox.ac.uk/~vgg/research/vgoogle/index.html • Try the available local feature detectors and descriptors 9 http://www.robots.ox.ac.uk/~vgg/research/affine/detectors.html#binaries Introduction to Computer Vision K. M. Lee, EECS, SNU