Bag-of-features for category recognition Cordelia Schmid Bag-of-features for image classification • Origin: texture recognition • Texture is characterized by the repetition of basic elements or textons Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Texture recognition histogram Universal texton dictionary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003 Bag-of-features for image classification • Origin: bag-of-words • Orderless document representation: frequencies of words from a dictionary • Classification to determine document categories Bag-of-features for image classification SVM Extract regions Compute descriptors Find clusters and frequencies Compute distance matrix Classification [Nowak,Jurie&Triggs,ECCV’06], [Zhang,Marszalek,Lazebnik&Schmid,IJCV’07] Bag-of-features for image classification SVM Extract regions Step 1 Compute descriptors Find clusters and frequencies Step 2 Compute distance matrix Classification Step 3 [Nowak,Jurie&Triggs,ECCV’06], [Zhang,Marszalek,Lazebnik&Schmid,IJCV’07] Bag-of-features for image classification • Excellent results in the presence of background clutter bikes books building cars people phones trees Examples for misclassified images Books- misclassified into faces, faces, buildings Buildings- misclassified into faces, trees, trees Cars- misclassified into buildings, phones, phones Bag-of-features for image classification SVM Extract regions Step 1 Compute descriptors Find clusters and frequencies Step 2 Compute distance matrix Classification Step 3 [Nowak,Jurie&Triggs,ECCV’06], [Zhang,Marszalek,Lazebnik&Schmid,IJCV’07] Step 1: feature extraction • Scale-invariant image regions + SIFT – Selection of characteristic points Harris-Laplace Laplacian Step 1: feature extraction • Scale-invariant image regions + SIFT – Robust description of the extracted image regions 3D histogram image patch gradient x y • SIFT [Lowe’99] – 8 orientations of the gradient – 4x4 spatial grid Step 1: feature extraction • Scale-invariant image regions + SIFT – – – – Selection of characteristic points Robust description of these characteristic points Affine invariant regions give “too” much invariance Rotation invariance in many cases “too” much invariance Step 1: feature extraction • Scale-invariant image regions + SIFT – – – – Selection of characteristic points Robust description of these characteristic points Affine invariant regions give “too” much invariance Rotation invariance in many cases “too” much invariance • Dense descriptors – Improve results in the context of categories (for most categories) – Interest points do not necessarily capture “all” features Dense features - Multi-scale dense grid: extraction of small overlapping patches at multiple scales - Computation of the SIFT descriptor for each grid cells Step 1: feature extraction • Scale-invariant image regions + SIFT (see lecture 2) • Dense descriptors • Color-based descriptors • Shape-based descriptors Bag-of-features for image classification SVM Extract regions Step 1 Compute descriptors Find clusters and frequencies Step 2 Compute distance Classification matrix Step 3 Step 2: Quantization … Step 2:Quantization Clustering Step 2: Quantization Visual vocabulary Clustering Examples for visual words Airplanes Motorbikes Faces Wild Cats Leaves People Bikes Step 2: Quantization • Cluster descriptors – K-mean – Gaussian mixture model • Assign each visual word to a cluster – Hard or soft assignment • Build frequency histogram K-means clustering • We want to minimize sum of squared Euclidean distances between points xi and their nearest cluster centers Algorithm: • • Randomly initialize K cluster centers Iterate until convergence: – Assign each data point to the nearest center – Recompute each cluster center as the mean of all points assigned to it K-means clustering • Local minimum, solution dependent on initialization • Initialization important, run several times – Select best solution, min cost From clustering to vector quantization • Clustering is a common method for learning a visual vocabulary or codebook – Unsupervised learning process – Each cluster center produced by k-means becomes a codevector – Provided the training set is sufficiently representative, the codebook will be “universal” • The codebook is used for quantizing features – A vector quantizer takes a feature vector and maps it to the index of the nearest codevector in a codebook – Codebook = visual vocabulary – Codevector = visual word Visual vocabularies: Issues • How to choose vocabulary size? – Too small: visual words not representative of all patches – Too large: quantization artifacts, overfitting • Computational efficiency – Vocabulary trees (Nister & Stewenius, 2006) • Soft quantization: Gaussian mixture instead of k-means Hard or soft assignment • K-means hard assignment – Assign to the closest cluster center – Count number of descriptors assigned to a center • Gaussian mixture model soft assignment – Estimate distance to all centers – Sum over number of descriptors • Frequency histogram frequency Image representation ….. codewords Bag-of-features for image classification SVM Extract regions Step 1 Compute descriptors Find clusters and frequencies Step 2 Compute distance Classification matrix Step 3 Step 3: Classification • Learn a decision rule (classifier) assigning bag-offeatures representations of images to different classes Decision boundary Zebra Non-zebra Classification • Assign input vector to one of two or more classes • Any decision rule divides input space into decision regions separated by decision boundaries Nearest Neighbor Classifier • Assign label of nearest training data point to each test data point from Duda et al. Voronoi partitioning of feature space for 2-category 2-D and 3-D data Source: D. Lowe K-Nearest Neighbors • For a new point, find the k closest points from training data • Labels of the k points “vote” to classify • Works well provided there is lots of data and the distance function is good k=5 Source: D. Lowe Linear classifiers • Find linear function (hyperplane) to separate positive and negative examples xi positive: xi w b 0 xi negative: xi w b 0 Which hyperplane is best? SVM (Support vector machine) Functions for comparing histograms N • L1 distance D(h1 , h2 ) | h1 (i) h2 (i) | i 1 • χ2 distance N D(h1 , h2 ) i 1 h1 (i) h2 (i) 2 h1 (i) h2 (i) • Quadratic distance (cross-bin) D(h1 , h2 ) Aij (h1 (i) h2 ( j )) 2 i, j Kernels for bags of features • Histogram intersection kernel: N I (h1 , h2 ) min(h1 (i), h2 (i)) i 1 • Generalized Gaussian kernel: 1 2 K (h1 , h2 ) exp D(h1 , h2 ) A • D can be Euclidean distance, χ2 distance, Earth Mover’s Distance, etc. Chi-square kernel •Multi-channel chi-square kernel ● ● Channel c is a combination of detector, descriptor Dc (Hi , H j ) is the chi-square distance between histograms Dc ( H1 , H 2 ) 1 m 2 [ ( h h ) (h1i h2i )] 1 i 2 i i 1 2 ● Ac is the mean value of the distances between all training sample ● Extension: learning of the weights, for example with MKL Pyramid match kernel • Weighted sum of histogram intersections at multiple resolutions (linear in the number of features instead of cubic) optimal partial matching between sets of features Pyramid Match Histogram intersection Pyramid Match Histogram intersection matches at this level matches at previous level Difference in histogram intersections across levels counts number of new pairs matched Pyramid match kernel histogram pyramids number of newly matched pairs at level i measure of difficulty of a match at level i • Weights inversely proportional to bin size • Normalize kernel values to avoid favoring large sets Example pyramid match Level 0 Example pyramid match Level 1 Example pyramid match Level 2 Example pyramid match pyramid match optimal match Summary: Pyramid match kernel optimal partial matching between sets of features difficulty of a match at level i number of new matches at level i Spatial pyramid matching • Add spatial information to the bag-of-features • Perform matching in 2D image space [Lazebnik, Schmid & Ponce, CVPR 2006] Related work Similar approaches: Subblock description [Szummer & Picard, 1997] SIFT [Lowe, 1999] GIST [Torralba et al., 2003] SIFT Szummer & Picard (1997) Lowe (1999, 2004) Gist Torralba et al. (2003) Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0 Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0 level 1 Spatial pyramid representation Locally orderless representation at several levels of spatial resolution level 0 level 1 level 2 Spatial pyramid matching • Combination of spatial levels with pyramid match kernel [Grauman & Darell’05] Scene classification L Single-level Pyramid 0(1x1) 72.2±0.6 1(2x2) 77.9±0.6 79.0 ±0.5 2(4x4) 79.4±0.3 81.1 ±0.3 3(8x8) 77.2±0.4 80.7 ±0.3 Retrieval examples Category classification – CalTech101 L Single-level 0(1x1) 41.2±1.2 1(2x2) 55.9±0.9 57.0 ±0.8 2(4x4) 63.6±0.9 64.6 ±0.8 3(8x8) 60.3±0.9 64.6 ±0.7 Bag-of-features approach by Zhang et al.’07: 54 % Pyramid CalTech101 Easiest and hardest classes • Sources of difficulty: – – – – Lack of texture Camouflage Thin, articulated limbs Highly deformable shape Discussion • Summary – Spatial pyramid representation: appearance of local image patches + coarse global position information – Substantial improvement over bag of features – Depends on the similarity of image layout • Extensions – Integrating different types of features, learning weights, use of different grids [Zhang’07, Bosch & Zisserman’07, Varma et al.’07, Marszalek et al.’07] – Flexible, object-centered grid Evaluation of image classification • Image classification task PASCAL VOC 2007-2009 • Precision – recall curves for evaluation • Mean average precision