Semi-Supervised Learning in Gigantic Image Collections Rob Fergus (New York University) Yair Weiss (Hebrew University) Antonio Torralba (MIT) What does the Image world look like? Gigantic Collections High level Object Recognition image statistics for large-scale image search Spectrum of Label Information Human annotations Noisy labels Unlabeled Semi-Supervised Learning Data Supervised Semi-Supervised • Classification function should be smooth with respect to data density Semi-Supervised Learning using Graph Laplacian [Zhu03,Zhou04] 2 Wi jis=n xexp(¡ kx ¡ x k=2² ) j n affinityi matrix (n = # of points) 2 2 W = exp(¡ kx ¡ x k=2² ) i j i j Wi j = exp(¡ kx i ¡ x j k=2² ) Graph Laplacian: P ¡ 1=2¡ 1=2¡ 1=2 ¡ 1=2 D ¡=1=2 W ¡ 1=2 L == ID¡ D L D W D = I ¡ Di i W j Di j SSL using Graph Laplacian = U® • Want to find label function ff that minimizes: T T f L f + (f ¡ y) ¤ (f ¡ y) Smoothness Agreement with labels • y = labels • If labeled, ¤ i i = ¸ , otherwise¤ i i = 0 • Solution: n x n system (n = # points) Eigenvectors of Laplacian • Smooth vectors will be linear combinations of eigenvectors U with small eigenvalues: [Belkin & Niyogi 06, f = U® Schoelkopf & Smola 02, Zhu et al 03, 08] U = [Á1 ; : : : ; Ák ] Rewrite System • • • • Let f = U® U = smallest k eigenvectors of L ® = coeffs. k is user parameter (typically ~100) • Optimal ® is now solution to k x k system: (§ + U T ¤ U)® = U T ¤ y Computational Bottleneck • Consider a dataset of 80 million images • Inverting L – Inverting 80 million x 80 million matrix • Finding eigenvectors of L – Diagonalizing 80 million x 80 million matrix Large Scale SSL - Related work • Nystrom method: pick small set of landmark points [see Zhu ‘08 survey] – Compute exact eigenvectors on these – Interpolate solution to rest Data Landmarks • Other approaches include: Mixture models (Zhu and Lafferty ‘05), Sparse Grids (Garcke and Griebel ‘05), Sparse Graphs (Tsang and Kwok ‘06) Our Approach Overview of Our Approach • Compute approximate eigenvectors Density Data Ours Limit as n ∞ Linear in number of data-points Landmarks Nystrom Reduce n Polynomial in number of landmarks Consider Limit as n ∞ • Consider x to be drawn from 2D distribution p(x) • Let Lp(F) be a smoothness operator on p(x), for a function F(x) • Smoothness operator penalizes functions that vary in areas of high density • Analyze eigenfunctions of Lp(F) Eigenvectors & Eigenfunctions Key Assumption: Separability of Input data p(x1) • Claim: If p is separable, then: p(x2) p(x1,x2) Eigenfunctions of marginals are also eigenfunctions of the joint density, with same eigenvalue [Nadler et al. 06,Weiss et al. 08] Numerical Approximations to Eigenfunctions in 1D • 300,000 points drawn from distribution p(x) • Consider p(x1) p(x1) p(x) Data Histogram h(x1) Numerical Approximations to Eigenfunctions in 1D • Solve for values of eigenfunction at set of discrete locations (histogram bin centers) – and associated eigenvalues – B x B system (B = # histogram bins, e.g. 50) 1D Approximate Eigenfunctions 1st Eigenfunction of h(x1) 2nd Eigenfunction 3rd Eigenfunction of h(x1) of h(x1) Separability over Dimension • Build histogram over dimension 2: h(x2) • Now solve for eigenfunctions of h(x2) 1st Eigenfunction of h(x2) 2nd Eigenfunction 3rd Eigenfunction of h(x2) of h(x2) From Eigenfunctions to Approximate Eigenvectors Eigenfunction value • Take each data point • Do 1-D interpolation in each eigenfunction 1 • Very fast operation 50 Histogram bin Preprocessing • Need to make data separable • Rotate using PCA PCA Not separable Separable Overall Algorithm 1. Rotate data to maximize separability (currently use PCA) 2. For each of the d input dimensions: – – Construct 1D histogram Solve numerically for eigenfunctions/values 3. Order eigenfunctions from all dimensions by increasing eigenvalue & take first k 4. Interpolate data into k eigenfunctions – Yields approximate eigenvectors of Laplacian 5. Solve k x k least squares system to give label function Experiments on Toy Data Nystrom Comparison • With Nystrom, too few landmark points result in highly unstable eigenvectors Nystrom Comparison • Eigenfunctions fail when data has significant dependencies between dimensions Experiments on Real Data Experiments • Images from 126 classes downloaded from Internet search engines, total 63,000 images Dump truck Emu • Labels (correct/incorrect) provided by Alex Krizhevsky, Vinod Nair & Geoff Hinton, (CIFAR & U. Toronto) Input Image Representation • Pixels not a convenient representation • Use Gist descriptor (Oliva & Torralba, 2001) • L2 distance btw. Gist vectors rough substitute for human perceptual distance • Apply oriented Gabor filters over different scales • Average filter energy in each bin Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA to 64 dimensions PCA MI is mutual information score. 0 = Independent Real 1-D Eigenfunctions of PCA’d Gist descriptors Eigenfunction 1 1 8 16 32 40 48 56 64 Input Dimension 24 Protocol • Task is to re-rank images of each class (class/non-class) • Use eigenfunctions computed on all 63,000 images • Vary number of labeled examples • Measure precision @ 15% recall 4800 Total number of images 5000 6000 8000 4800 Total number of images 5000 6000 8000 4800 Total number of images 5000 6000 8000 4800 Total number of images 5000 6000 8000 80 Million Images Running on 80 million images • PCA to 32 dims, k=48 eigenfunctions • For each class, labels propagating through 80 million images • Precompute approximate eigenvectors (~20Gb) • Label propagation is fast <0.1secs/keyword Japanese Spaniel 3 positive 3 negative Labels from CIFAR set Airbus, Ostrich, Auto Summary • Semi-supervised scheme that can scale to really large problems – linear in # points • Rather than sub-sampling the data, we take the limit of infinite unlabeled data • Assumes input data distribution is separable • Can propagate labels in graph with 80 million nodes in fractions of second • Related paper in this NIPS by Nadler, Srebro & Zhou – See spotlights on Wednesday Future Work • Can potentially use 2D or 3D histograms instead of 1D – Requires more data • Consider diagonal eigenfunctions • Sharing of labels between classes Comparison of Approaches Data Exact Eigenvector Eigenfunction Exact Eigenvectors Eigenvalues Approximate Exact -- Approximate Eigenvectors 0.0531 : 0.0535 Data 0.1920 : 0.1928 0.2049 : 0.2068 0.2480 : 0.5512 0.3580 : 0.7979 Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after PCA PCA MI is mutual information score. 0 = Independent Are Dimensions Independent? Joint histogram for pairs of dimensions from raw 384-dimensional Gist Joint histogram for pairs of dimensions after ICA ICA MI is mutual information score. 0 = Independent Varying # Eigenfunctions Leveraging Noisy Labels • Images in dataset have noisy labels • Keyword used in from Internet search engine • Can easily be incorporated into SSL scheme • Give weight 1/10th of hand-labeled example Leveraging Noisy Labels Effect of Noisy Labels Complexity Comparison Key: n = # data points (big, >106) l = # labeled points (small, <100) m = # landmark points d = # input dims (~100) k = # eigenvectors (~100) b = # histogram bins (~50) Nystrom Eigenfunction Select m landmark points Rotate n points Get smallest k eigenvectors of m x m system Interpolate n points into k eigenvectors Solve k x k linear system Polynomial in # landmarks Form d 1-D histograms Solve d linear systems, each b x b k 1-D interpolations of n points Solve k x k linear system Linear in # data points Semi-Supervised Learning using Graph Laplacian [Zhu03,Zhou04] G = (V; E ) V = data points (n in total) E = n x n affinity matrix W 2 Wi j W =i j exp(¡ kx i kx ¡ ix¡j k=2² ) 2) = exp(¡ x j k=2² ¡ 1=2 Di i = P j Wi j Graph Laplacian: ¡ 1=2 ¡ 1=2 1=2 L == DI ¡¡ 1=2 L D ¡ 1=2 DL¡ D W D= I ¡ D ¡ 1=2 WD¡ 1=2 Rewrite System • • • • Let f = U® U = smallest k eigenvectors of L ® = coeffs. k is user parameter (typically ~100) J (®) = ®T § ®+ (U®¡ y) T ¤ (U® ¡ y) • Optimal ® is now solution to k x k system: (§ + U T ¤ U)® = U T ¤ y Consider Limit as n ∞ • Consider x to be drawn from 2D distribution p(x) • Let Lp(F) be a smoothness operator on p(x), for a function F(x): RR L p (F ) = 1=2 (F (x 1 ) ¡ F (x 2 )) 2 W(x 1 ; x 2 )p(x 1 )p(x 2 )dx 1 dx 2 2 where W (x 1; x 2 ) = exp(¡ kx 1 ¡ x 2 k=2² 2 ) • Analyze eigenfunctions of Lp(F) Numerical Approximations to Eigenfunctions in 1D ~ •)Pg ^ W = ¾ P Dg Solve for values g of eigenfunction at set of discrete locations (histogram bin centers) ¾ – and associated eigenvalues – B x B system (# histogram bins = 50) • P is diag(h(x1)) ~ W Affinity between discrete locations ~= D P j ~ W ^ = di ag( D P j ~ P W) Real 1-D Eigenfunctions of PCA’d Gist descriptors Eigenfunction 1 1 8 16 32 40 Input Dimension 24 48 56 64 Eigenfunction 256