790-133 Recognizing People, Objects, & Actions Tamara Berg Object Recognition – BoF models 1 Topic Presentations • Hopefully you have met your topic presentations group members? • Group 1 – see me to run through slides this week or Monday at the latest (I’m traveling Thurs/Friday). Send me links to 2-3 papers for the class to read. • Sign up for class google group (790-133). To find the group go to groups.google.com and search for 790-133 (sorted by date). Use this to post/answer questions related to the class. 2 Bag-of-features models Object Bag of ‘features’ 3 source: Svetlana Lazebnik Exchangeability • De Finetti Theorem of exchangeability (bag of words theorem): the joint probability distribution underlying the data is invariant to permutation. p(x1, x 2 ,..., x N ) = ò æ N ö p(q )çÕ p(x i | q )÷dq è i=1 ø 4 Origin 2: Bag-of-words models • Orderless document representation: frequencies of words from a dictionary Salton & McGill (1983) US Presidential Speeches Tag Cloud http://chir.ag/phernalia/preztags/ 5 source: Svetlana Lazebnik Bag of words for text Represent documents as a “bags of words” 6 Example • Doc1 = “the quick brown fox jumped” • Doc2 = “brown quick jumped fox the” Would a bag of words model represent these two documents differently? 7 Bag of words for images Represent images as a “bag of features” 8 Bag of features: outline 1. Extract features 9 source: Svetlana Lazebnik Bag of features: outline 1. Extract features 2. Learn “visual vocabulary” 10 source: Svetlana Lazebnik Bag of features: outline 1. Extract features 2. Learn “visual vocabulary” 3. Represent images by frequencies of “visual words” 11 source: Svetlana Lazebnik 2. Learning the visual vocabulary … Clustering 12 Slide credit: Josef Sivic 2. Learning the visual vocabulary Visual vocabulary … Clustering 13 Slide credit: Josef Sivic K-means clustering (reminder) • Want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers mk D(X ,M ) ( xi m k ) 2 cluster k point i in cluster k Algorithm: • Randomly initialize K cluster centers • Iterate until convergence: • Assign each data point to the nearest center • Recompute each cluster center as the mean of all points assigned to it 14 source: Svetlana Lazebnik Example visual vocabulary 15 Fei-Fei et al. 2005 Image Representation • For a query image Extract features Visual vocabulary x x x x x Associate each feature with the nearest cluster center (visual word) x x x x x Accumulate visual word frequencies over the image frequency 3. Image representation ….. codewords 17 source: Svetlana Lazebnik frequency 4. Image classification CAR ….. codewords Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them? 18 source: Svetlana Lazebnik Image Categorization What is this? Choose from many categories helicopter Image Categorization SVM/NB Csurka et al (Caltech 4/7) What is this? Choose from many categories Nearest Neighbor Berg et al (Caltech 101) Kernel + SVM Grauman et al (Caltech 101) Multiple Kernel Learning + SVMs Varma et al (Caltech 101) … Visual Categorization with Bags of Keypoints Gabriella Csurka, Christopher R. Dance, Lixin Fan, Jutta Willamowski, Cédric Bray 21 Data • Images in 7 classes: faces, buildings, trees, cars, phones, bikes, books • Caltech 4 dataset: faces, airplanes, cars (rear and side), motorbikes, background 22 Method Steps: – Detect and describe image patches. – Assign patch descriptors to a set of predetermined clusters (a visual vocabulary). – Construct a bag of keypoints, which counts the number of patches assigned to each cluster. – Apply a classifier (SVM or Naïve Bayes), treating the bag of keypoints as the feature vector – Determine which category or categories to assign to the image. 23 Bag-of-Keypoints Approach Interesting Point Detection Key Patch Extraction Feature Descriptors Bag of Keypoints Multi-class Classifier 0 .1 0 .5 . . . 1 .5 Slide credit: Yun-hsueh Liu 24 SIFT Descriptors Interesting Point Detection Key Patch Extraction Feature Descriptors Bag of Keypoints Multi-class Classifier 0 .1 0 .5 . . . 1 .5 Slide credit: Yun-hsueh Liu 25 Bag of Keypoints (1) Interesting Point Detection Key Patch Extraction Feature Descriptors Bag of Keypoints Multi-class Classifier • Construction of a vocabulary – Kmeans clustering find “centroids” (on all the descriptors we find from all the training images) – Define a “vocabulary” as a set of “centroids”, where every centroid represents a “word”. Slide credit: Yun-hsueh Liu 26 Bag of Keypoints (2) Interesting Point Detection Key Patch Extraction Feature Descriptors Bag of Keypoints Multi-class Classifier • Histogram – Counts the number of occurrences of different visual words in each image Slide credit: Yun-hsueh Liu 27 Multi-class Classifier Interesting Point Detection Key Patch Extraction Feature Descriptors Bag of Keypoints Multi-class Classifier • In this paper, classification is based on conventional machine learning approaches – Support Vector Machine (SVM) – Naïve Bayes Slide credit: Yun-hsueh Liu 28 SVM 29 Reminder: Linear SVM x2 Margin g (x) w x b T x+ x+ m inim ize 1 w 2 2 s.t. x- yi ( w x i b ) 1 T Support Vectors x1 Slide 30 of 113 Slide credit: Jinwei Gu Nonlinear SVMs: The Kernel Trick With this mapping, our discriminant function becomes: g (x) w (x) b T i ( x i ) ( x ) b T iSV No need to know this mapping explicitly, because we only use the dot product of feature vectors in both the training and test. A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space: K (xi , x j ) (xi ) (x j ) T 31 Slide credit: Jinwei Gu Nonlinear SVMs: The Kernel Trick Examples of commonly-used kernel functions: K (xi , x j ) xi x j T Linear kernel: Polynomial kernel: Gaussian (Radial-Basis Function (RBF) ) kernel: K ( x i , x j ) (1 x i x j ) K ( x i , x j ) exp( T xi x j 2 2 p 2 ) Sigmoid: K ( x i , x j ) tan h ( 0 x i x j 1 ) T 32 Slide credit: Jinwei Gu SVM for image classification • Train k binary 1-vs-all SVMs (one per class) • For a test instance, evaluate with each classifier • Assign the instance to the class with the largest SVM output 34 Naïve Bayes 35 Naïve Bayes Model C C – Class F - Features F1 F2 Fn P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C) i We only specify (parameters): P(C) prior over class labels P(Fi | C) how each feature depends on the class 36 Example: 37 Slide from Dan Klein 38 Slide from Dan Klein 39 Slide from Dan Klein Percentage of documents in training set labeled as spam/ham 40 Slide from Dan Klein In the documents labeled as spam, occurrence percentage of each word (e.g. # times “the” occurred/# total words). 41 Slide from Dan Klein In the documents labeled as ham, occurrence percentage of each word (e.g. # times “the” occurred/# total words). 42 Slide from Dan Klein Classification The class that maximizes: P(C,W1,...W n ) = P(C)Õ P(W i | C) i = argmax P(c)Õ P(W i | c) c ÎC i 43 Classification • In practice 44 Classification • In practice – Multiplying lots of small probabilities can result in floating point underflow 45 Classification • In practice – Multiplying lots of small probabilities can result in floating point underflow – Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. 46 Classification • In practice – Multiplying lots of small probabilities can result in floating point underflow – Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. – Since log is a monotonic function, the class with the highest score does not change. 47 Classification • In practice – Multiplying lots of small probabilities can result in floating point underflow – Since log(xy) = log(x) + log(y), we can sum log probabilities instead of multiplying probabilities. – Since log is a monotonic function, the class with the highest score does not change. – So, what we usually compute in practice is: c map é ù = arg max log P(c)+å log P(Wi |c) êë úû i c ÎC 48 Naïve Bayes on images 49 Naïve Bayes C C – Class F - Features F1 F2 Fn P(C,F1,F2,...Fn ) = P(C)Õ P(Fi | C) i We only specify (parameters): P(C) prior over class labels P(Fi | C) how each feature depends on the class 50 Naive Bayes Parameters Problem: Categorize images as one of k object classes using Naïve Bayes classifier: – Classes: object categories (face, car, bicycle, etc) – Features – Images represented as a histogram of visual words. Fi are visual words. P(C) treated as uniform. P(Fi | C) learned from training data – images labeled with category. Probability of a visual word given an image category. 51 Multi-class classifier – Naïve Bayes (1) • Let V = {vi}, i = 1,…,N, be a visual vocabulary, in which each vi represents a visual word (cluster centers) from the feature space. • A set of labeled images I = {Ii } . • Denote Cj to represent our Classes, where j = 1,..,M • N(t,i) = number of times vi occurs in image Ii • Compute P(Cj|Ii): Slide credit: Yun-hsueh Liu 52 Multi-class Classifier – Naïve Bayes (2) • Goal - Find maximum probability class Cj: • In order to avoid zero probability, use Laplace smoothing: Slide credit: Yun-hsueh Liu 53 Results Results 55 Results 56 Results Results on Dataset 2 57 Results 58 Results 59 Results 60 Thoughts? • Pros? • Cons? Related BoF models pLSA, LDA, … 62 pLSA document topic word 63 pLSA 64 pLSA on images 67 Discovering objects and their location in images Josef Sivic, Bryan C. Russell, Alexei A. Efros, Andrew Zisserman, William T. Freeman Documents – Images Words – visual words (vector quantized SIFT descriptors) Topics – object categories Images are modeled as a mixture of topics (objects). 68 Goals They investigate three areas: – (i) topic discovery, where categories are discovered by pLSA clustering on all available images. – (ii) classification of unseen images, where topics corresponding to object categories are learnt on one set of images, and then used to determine the object categories present in another set. – (iii) object detection, where you want to determine the location and approximate segmentation of object(s) in each image. 69 (i) Topic Discovery Most likely words for 4 learnt topics (face, motorbike, airplane, car) 70 (ii) Image Classification Confusion table for unseen test images against pLSA trained on images containing four object categories, but no background images. 71 (ii) Image Classification Confusion table for unseen test images against pLSA trained on images containing four object categories, and background images. Performance is not quite as good. 72 (iii) Topic Segmentation P(zk | wi ,d j ) > 0.8 73 (iii) Topic Segmentation 74 (iii) Topic Segmentation 75