Measures of Proximity Mei-Chen Yeh 03/20/2012 Last Week • SIFT features … … … Today • Matching two images f ( X, Y ) ? Strategy 1 1. Convert the feature set to a fixed length feature vector 2. Apply global proximity measurements xi xj f ( xi , xi ) Strategy 1 1. Convert the feature set into a fixed-length feature vector – The bag-of-word representation 2. Apply global proximity measurements xi xj f ( xi , xi ) Image Slide credit: Prof. Fei-Fei, Li Bag of ‘words’ Analogy to documents Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was sensory, brain, thought that the retinal image was transmitted perception, point by point visual, to visual centers in the brain; the cerebral cortex was a movie screen,cortex, so to speak, retinal, cerebral upon which the image in the eye was projected. eye, cell, optical Through the discoveries of Hubel and Wiesel we now know that behind the origin of the visual nerve, image perception in the brain there is a considerably more complicatedHubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a step-wise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. China is forecasting a trade surplus of $90bn (£51bn) to $100bn this year, a threefold increase on 2004's $32bn. The Commerce Ministry said the surplus would be created by a predicted 30% jump in exports to $750bn, compared with a 18% China,The trade, rise in imports to $660bn. figures are likely to further annoysurplus, the US, which has long argued that commerce, China's exports are unfairly helped by a US, deliberatelyexports, undervaluedimports, yuan. Beijing agrees the surplusyuan, is too high, but says the yuan is only bank, domestic, one factor. Bank of China governor Zhou Xiaochuan saidforeign, the countryincrease, also needed to do more to boost domestic demand so more goods trade, value stayed within the country. China increased the value of the yuan against the dollar by 2.1% in July and permitted it to trade within a narrow band, but the US wants the yuan to be allowed to trade freely. However, Beijing has made it clear that it will take its time and tread carefully before allowing the yuan to rise further in value. Bags of visual words • Summarize entire image based on its distribution (histogram) of word occurrences. • Analogous to bag of words representation commonly used for documents. Offline feature detection & representation Online codewords dictionary image representation Database Relevant images Bag of Words Representation 2. 1. feature detection & representation image representation 3. codewords dictionary 1.Feature detection and representation 1.Feature detection and representation • Regular grid – Vogel et al. 2003 – Fei-Fei et al. 2005 1.Feature detection and representation • Regular grid – Vogel et al. 2003 – Fei-Fei et al. 2005 • Interest point detector – Csurka et al. 2004 – Fei-Fei et al. 2005 – Sivic et al. 2005 1.Feature detection and representation • Regular grid – Vogel et al. 2003 – Fei-Fei et al. 2005 • Interest point detector – Csurka et al. 2004 – Fei-Fei et al. 2005 – Sivic et al. 2005 • Other methods – Random sampling (Ullman et al. 2002) – Segmentation based patches (Barnard et al. 2003) 1.Feature detection and representation … Compute SIFT descriptor [Lowe’99] Detect patches [Mikojaczyk and Schmid ’02] [Matas et al. ’02] [Sivic et al. ’03] Slide credit: Josef Sivic 2. Codewords dictionary formation … 128-d space Slide credit: Josef Sivic 2. Codewords dictionary formation … Vector quantization, clustering 128-d space Slide credit: Josef Sivic 2. Codewords dictionary formation … Vector quantization, clustering 128-d space Slide credit: Josef Sivic Clustering and vector quantization • Clustering is a common method for learning a visual vocabulary or codebook • Unsupervised learning process • Each cluster center produced by a clustering approach becomes a codevector • Codebook can be learned on separate training set • Provided the training set is sufficiently representative, the codebook will be “universal” • The codebook is used for quantizing features • A vector quantizer takes a feature vector and maps it to the index of the nearest codevector in a codebook • Codebook = visual vocabulary • Codevector = visual word Slide credit: Prof. Lazebnik Dictionary formation Input: Number of codewords Example: 2-D samples 2 codewords 4 codewords 6 codewords Dictionary formation Example: categorize 2-d data into 3 clusters 2.5 2.5 2 2 1.5 1.5 y 3 y 3 1 1 0.5 0.5 0 0 -2 -1.5 -1 -0.5 0 0.5 1 x Good Clustering 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 x Sub-optimal Clustering Dictionary formation • The Linde-Buzo-Gray (LBG) algorithm – also known as the generalized Lloyd algorithm – also known as the k-means algorithm k-means Clustering Each point is a SIFT feature in db k-means Clustering • Find k reference vectors (codewords) which best represent data • Reference vectors, mj, j =1,...,k • Find the nearest (most similar) reference: x t mi min x t m j j • Compute the reconstruction error E mi i 1 X t b x mi k t i i t 2 1 if xt mi min xt m j j bit 0 otherwise Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) k-means Clustering Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) k-means Clustering • Disadvantage: – A local search procedure – The final mi highly depend on the initial mi • The methods to initial mi – Randomly select k instances – Calculate the mean of all data and add small random vectors – Calculate the principal component partitioning the data into k groups, and then take the means of these groups Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Applying k-means on SIFT … Appearance codebook Source: B. Leibe Another codebook … … … … … Appearance codebook Source: B. Leibe Image patch examples of codewords Sivic et al. 2005 frequency 3. Image representation ….. codewords Bags-of-words for content-based image retrieval: Video Google Slide from Andrew Zisserman Sivic & Zisserman, ICCV 2003 Video Google (cont.) Sivic & Zisserman, ICCV 2003 • Demo online at : Retrieved frames 1. Collect all words within query region 2. Inverted file index to find relevant frames 3. Compare word counts 4. Spatial verification Query region http://www.robots.ox.ac.uk/~vgg/research/vgoo gle/index.html 34 Summary of bag-of-words • Convert a set of local features to a fixed length feature vector • Procedure 1. Detect local features 2. Build a visual vocabulary from a collection of images 3. Generate the bag-of-word representation Bags of words: pros and cons + + + + flexible to geometry / deformations / viewpoint compact summary of image content provides vector representation for sets very good results in practice - basic model ignores geometry – must verify afterwards, or encode via features - background and foreground mixed when bag covers whole image - optimal vocabulary formation remains unclear Slide credit: Prof. Grauman Visual vocabularies: Issues • How to choose vocabulary size? – Too small: visual words not representative of all patches – Too large: quantization artifacts, overfitting • Computational efficiency – Vocabulary trees (Nister & Stewenius, CVPR 2006) Resources • http://people.csail.mit.edu/fergus/iccv2005/b agwords.html (MATLAB code) • OpenCV – BasicBOWTrainer – BOWGenerator Strategy 1 1. Convert the feature set into a fixed-length feature vector – The bag-of-word representation 2. Apply global proximity measurements xi xj f ( xi , xi ) Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, Introduction to Data Mining, Addison Wesley, 2005. (Chapter 2) Measurement • Proximity: is used to refer to either similarity or dissimilarity ? xi f ( xi , xi ) xj – Sparse or dense – Dimensionality (# of attributes) – Distribution, data range, and more… Examples: Bag-of-words Pixel intensities Color histograms Definitions • Similarity – The degree to which two images are alike • Dissimilarity – The degree to which two images are different – Distance is used to refer to a special class of dissimilarities • Non-negative and often fall in [0, 1] Transformations • Similarities ↔ dissimilarities d = 0, 1, 10, 100 1 s d 1 s ed d min_ d s 1 max_ d min_ d 1, 0.5, 0.09, 0.01 1, 0.37, 0, 0 1, 0.99, 0.9, 0 Popular Metrics • The Minkowski distance 1/ r r d (x, y ) xk yk k 1 n – r = 1. City block (Manhattan, taxicab, L1 norm) distance. – r = 2. Euclidean distance (L2 norm). – r = ∞. Supremum (Lmax or L∞ norm) distance. • Chi-squared statistics χ2 2 n ( x y ) 1 2 (x, y ) i i 2 i 1 xi yi Metric Properties • Positivity – d(x, y) ≥ 0 for all x and y, – d(x, y) = 0 only if x = y. • Symmetry – d(x, y) = d(y, x) for all x and y. • Triangle Inequality – d(x, z) ≤ d(x, y) + d(y, z) for all x, y, and z. Measures that satisfy all three properties are known as metrics. Non-metric measures? Non-metric Dissimilarities (1) • Example: Set Differences – Given two sets A and B How can it be modified to – d(A, B) = size(A - B) hold the properties? – A = {1, 2, 3, 4} d(A, B) = size(A - B) + size(B - A) – B = {2, 3, 4} – d(A, B) = 1 – d(B, A) = 0 – symmetry, triangle inequality Non-metric Dissimilarities (2) • Example: Time if t1 t2 t2 - t1 d (t1 , t2 ) 24 t2 - t1 if t1 t2 – d(1pm, 2pm) = 1 hour – d(2pm, 1pm) = 23 hours Similarities (1) • Properties – Positivity – Symmetry – The triangle inequality typically does not hold. Similarities (2) • Cosine Similarity xy cos( x, y ) x y n x/||x|| y/||y|| x yi i 1 i n 2 i 1 i x 2 y i1 i n The cosine similarity does not take the magnitude of the two data objects into account when computing similarity. Similarities (3) • The Tanimoto coefficient (to handle asymmetric attributes) EJ (x, y ) xy x y xy 2 2 • The histogram intersection n K int (x, y ) min( xi , yi ) i 1 Similarities (4) • Correlation s xy cov( x, y) corr (x, y ) std (x) * std (y ) s x s y 1 n s xy ( xk x )( yk y ) n 1 k 1 1 n 2 sx ( x x ) k n 1 k 1 1 n 2 sy ( y y ) k n 1 k 1 Issues in Proximity Calculation • Attributes have different scales. • Attributes are correlated. • Different types of attributes exist (e.g., quantitative and qualitative). • Attributes have different weights. A Generalization • The Mahalanobis distance mahalanobis(x, y ) (x y ) 1 (x y )T 1 : the inverse of the covariance matrix of the data Similarities of heterogeneous objects 1. For the kth attribute, compute sk(x, y) in the range [0, 1]. 2. Define an indicator variable, δk as follows: 0 if one of the object has a missiong value for the kth attributes k 1 otherwise 3. Compute the overall similarity similarity similarity(x(,xy, )y ) n n ws s(x(x, y, y) ) k k11 k k k k k n k 1 k Weighted Distances • The extended Minkowski distance 1/ r 1/ r r r k k k d (x, y ) w xk x y y k 1 n Summary • The proximity measure should fit the data representation – Example: Use Euclidean distance for dense, continuous data – Example: Ignore 0-0 matches for sparse data • Data normalization is important for obtaining a proper proximity measure Strategy 2 1. Build feature point correspondence 2. Compute a score from the correspondence K. Grauman and T. Darrell. The Pyramid Match Kernel: Discriminative Classification with Sets of Image Features. IEEE ICCV, 2005. Problem • How to measure the proximity between two sets of features? – Each instance is unordered set of vectors – Varying number of vectors per instance Slide credits: Prof. Grauman Existing method (1) • Fit (parametric) model to each set, compare with distance over models GMM1 GMM2 Restrictive assumptions! High complexity! Existing method (2) • Compute pair-wise similarity between all vectors in each set mxn Ignoring set statistics! High complexity! Partial matching for sets of features Compare sets by computing a partial matching between their features. Robust to clutter, occlusion… Formulation • Node ~ Local feature • Edge ~ Patch similarity • Image matching → bipartite graph matching Optimal match X Y • Find the maximum weight matching in a bipartite graph Effective but slow! Pyramid match Optimal match: O(m3) Pyramid match: O(mL) optimal partial matching Pyramid match overview • Pyramid match kernel measures similarity of a partial matching between two sets: – Place multi-dimensional, multi-resolution grid over point sets – Consider points matched at finest resolution where they fall into same grid cell No explicit search for matches! Pyramid match kernel Number of newly matched pairs at level i Approximate partial match similarity Measure of difficulty of a match at level i Feature extraction , Histogram pyramid: level i has bins of size 2i Counting matches Histogram intersection Counting new matches Histogram intersection matches at this level matches at previous level Difference in histogram intersections across levels counts number of new pairs matched Pyramid match kernel histogram pyramids number of newly matched pairs at level i measure of difficulty of a match at level i Weights are inversely proportional to bin size! Efficiency For sets with m features of dimension d, and pyramids with L levels, computational complexity of Pyramid match kernel: Existing set kernel approaches: or Example pyramid match Level 0 (fine level) Example pyramid match Level 1 (coarser level) Example pyramid match Level 2 (coarser level) FAST! Pyramid Match ≈ Optimal Match SLOW! Object recognition results • ETH-80 database: 8 object classes • Features: – Harris detector – PCA-SIFT descriptor, d=10 Kernel Complexity Recognition rate Match [Wallraven et al.] 84% Bhattacharyya affinity [Kondor & Jebara] 85% Pyramid match 84% Eichhorn and Chapelle 2004 Summary: Pyramid match kernel optimal partial matching between sets of features difficulty of a match at level i number of new matches at level i Summary: Pyramid match kernel • A similarity measure based on implicit correspondences that approximates the optimal partial matching – linear time complexity – model-free – insensitive to clutter – fast, effective for object retrieval and recognition Disadvantages? • Place a grid to quantize the feature space: not effective for high-dimensional space • The spatial arrangement of features is still ignored: may produce geometrically unfaithful matching S. Lazebnik, C. Schmid, and J. Ponce. Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories. IEEE CVPR, 2006. BoW issue: No spatial layout preserved! Slide credits: Prof. S. Lazebnik Spatial pyramid match • Extension of bag-of-words: Make a pyramid of bagof-words histograms • Locally orderless representation at several levels of resolution Spatial pyramid representation level 0 Lazebnik, Schmid & Ponce (CVPR 2006) Spatial pyramid representation level 0 level 1 Lazebnik, Schmid & Ponce (CVPR 2006) Spatial pyramid representation level 0 level 1 Lazebnik, Schmid & Ponce (CVPR 2006) level 2 Spatial pyramid match • Based on pyramid match kernels – PM: build pyramid in feature space, discard spatial information – SPM: build pyramid in image space Sum over PMKs computed in image coordinate space Spatial pyramid match Example: 200 visual words d = 200 d = 800 4200 features d = 3200 Spatial pyramid match • Can capture scene categories well---texture-like patterns but with some variability in the positions of all the local pieces. Spatial pyramid match • Sensitive to global shifts of the view Difficult categories Easy categories Resources • A Pyramid Match Toolkit – http://people.csail.mit.edu/jjl/libpmk/ • Spatial Pyramid Match – http://www.cs.unc.edu/~lazebnik/research/Spatia lPyramid.zip (updated 2/29/2012)