Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Large Graph Construction for Scalable Semi-Supervised Learning Wei Liu Junfeng He Shih-Fu Chang Department of Electrical Engineering Columbia University New York, NY, USA International Conference on Machine Learning, 2010 Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Outline 1 Motivation 2 Related Work Mid-Scale SSL Large-Scale SSL 3 Anchor-Based Label Prediction Anchors Anchor-to-Data Nonparametric Regression 4 AnchorGraph Coupling Label Prediction and Graph Construction AnchorGraph Regularization 5 Experiments Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Why Semi-Supervised Learning (SSL)? In practice, labeling data is expensive. Using labeled data and unlabeled data, SSL generalizes and improves supervised learning. Most SSL methods cannot handle a large amount of unlabeled data because they scale poorly with the data size. Large-scale SSL is critical nowadays as we can easily collect massive (up to hundreds of millions) unlabeled data from the Internet (Facebook, Flickr). Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions SSL on Large Social Networks http://www.cmu.edu/joss/content/articles/volume1/Freeman.html Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Mid-Scale SSL Manifold and Graph Manifold Assumption Close samples in some similarity measure have similar class labels. Graph-based SSL methods including LGC, GFHF, LapSVM, etc. adopt this assumption. Neighborhood Graph A sparse graph can reveal data manifold, where adjacent graph nodes represent nearby data points. The commonly used graph is kNN graph (linking graph edges between any data point and its k nearest neighbors) whose construction takes O(kn2 ). Inefficient in large scale! Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Mid-Scale SSL Manifold and Graph Dense graphs hardly reveal manifolds. Sparse graphs are preferred. Example data manifold (two submanifolds) 6NN Graph 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 −0.2 −0.2 −0.4 −0.4 −0.6 −0.8 −1.5 −0.6 −1 −0.5 0 0.5 1 1.5 2 2.5 3 −0.8 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 3 Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Large-Scale SSL Existing Large-Scale SSL Methods Nonparametric Function Induction, Delalleu et al., 2005 Use the truncated graph on a sample subset and its connection to the rest samples. Harmonic Mixtures, Zhu & Lafferty, 2005 Need a large sparse graph, but do not explain how to construct. Prototype Vector Machine (PVM), Zhang et al., 2009 Use the Nyström approximation of the fully-connected graph adjacency matrix. There is no guarantee for the graph Laplacian matrix computed from this approximation to be positive semidefinite. Eigenfunction, Fergus et al., 2010 Rely on the dimension-independent data density assumption which is not always true. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Large-Scale SSL Our Aim Develop a scalable SSL approach that is easy to implement and has closed-form solutions. Address the critical issue on how to build an entire sparse graph over a large dataset. Low space and time complexities with respect to the data size. Our Goal: achieve linear complexity! Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchors How to scale up? Graph-based SSL methods usually need a cubic time complexity O(n3 ) with respect to the data size n because n × n matrix inversion is needed in inferring full-size label prediction. In large scale, full-size label prediction is infeasible, which has motivated the idea of landmark samples, i.e., anchors. If one inferred the labels on a much smaller landmark set with size m (≪ n), the time complexity could be reduced to O(m3 ). Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchors Anchor Points Example K−Means Clustering 4 x1 3 2 u1 1 u2 u4 0 −1 data point u3 anchor point −2 data −3 cluster center −4 −4 −3 −2 −1 0 1 2 3 4 Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchors Anchor-to-Data Relationship Input data points X = {x1 , · · · , xn } ⊂ Rd and introduced anchor points U = {u1 , · · · , um }. Suppose a soft label prediction function f : Rd → R. Predict the label for each data point as a weighted average of the labels on anchors: f (xi ) = m X Zik f (uk ), i = 1, · · · , n (1) k =1 where Zik ’s denote the combination weights. The matrix form f = Z a, Z ∈ Rn×m , m ≪ n where a = [f (u1 ), · · · , f (um )]⊤ is the label vector that we will solve. (2) Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchor-to-Data Nonparametric Regression How to obtain Z ? We Pm impose the nonnegative normalization constraints k =1 Zik = 1 and Zik ≥ 0 in order to maintain the unified range of values for all predicted soft labels via anchors. The manifold assumption implies that nearby points should have similar labels and distant points are very unlikely to take similar labels. So, we further impose Zik = 0 when anchor uk is far away from xi . Specifically, Zik 6= 0 only for s closest anchor points (the s indexes hii ⊂ [1 : m]) of xi . Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Anchor-to-Data Nonparametric Regression The weights Zik ’s are sample-adaptive, and thus the anchor-based label prediction scheme falls into nonparametric regression: the label of xi is a locally weighted average of the labels on anchors. The regression matrix Z ∈ Rn×m is nonnegative and sparse. Example s=2 x1 Z11 u1 Z 12 u2 u4 u3 Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchor-to-Data Nonparametric Regression Predefined Kernel-based Weights The Nadaraya-Watson Kernel Regression Zik = P Kh (xi , uk ) k ′ ∈hii Kh (xi , uk ′ ) ∀k ∈ hii Typically, we may adopt the Gaussian kernel Kh (xi , uk ) = exp(−kxi − uk k2 /2h2 ). The above predefined weights are sensitive to the hyperparameter h and lack a meaningful interpretation. (3) Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchor-to-Data Nonparametric Regression Optimized Weights Local Anchor Embedding (LAE) mins g(zi ) = zi ∈R s.t. 1 kxi − Uhii zi k2 2 1⊤ zi = 1, zi ≥ 0 (4) U = [u1 , · · · , um ] and Uhii ∈ Rd×s is a sub-matrix composed of s nearest anchors of xi . Similar to Locally Linear Embedding (LLE), LAE reconstructs any xi as a convex combination of its nearest anchors. The combination coefficients are preserved for the weights Zik ’s used in nonparametric regression: Zi,hii = z⊤ i , Zi,hii = 0, |hii| = s. (5) Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Anchor-to-Data Nonparametric Regression We propose the Nesterov-accelerated projected gradient algorithm, a first-order optimization procedure, to solve the optimization problem. In contrast to the predefined kernel-based weights, LAE is more advantageous because it provides optimized regression weights that are also sparser. Example Project xi onto the convex hull of its closest anchors Uhii x1 u1 s=2 u2 u4 u3 Z 42 = 0, Z 43 = 1 x4 x7 Z 73 = 0.5, Z 74 = 0.5 Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Large Graph Construction Construct a large graph G(X , W ) over the dataset X . W ∈ Rn×n denotes the adjacency matrix. The exact kNN graph construction (square time complexity) is infeasible in large scale; it is unrealistic to save a full matrix W in memory. Our idea is low-rank matrix factorization: representing W as a Gram matrix. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Principles for Designing W W ≥ 0, which is sufficient to make the resulting graph Laplacian L = D − W (D = diag(W 1)) positive semidefinite. Keeping p.s.d. graph Laplacians is important to guarantee global optimum of graph-based SSL (ensuring that the graph regularizer f⊤ Lf is convex). Sparse W . Sparse graphs have much less spurious connections between dissimilar points and tend to exhibit high quality. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Nyström Approximation When using the Nyström method to design the graph adjacency matrix W = K̇ , produce an improper dense graph. K̇ = K.U KU−1 U KU . approximates a predefined kernel matrix K ∈ Rn×n . Cannot guarantee K̇ ≥ 0; K̇ is dense. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Design W Using Z In anchor-based label prediction, we have obtained the nonnegative and sparse Z ∈ Rn×m . Don’t waste it. In depth, each row Zi. in Z is a new representation of raw sample xi . xi → Zi.⊤ is reminiscent of sparse coding with the basis U since xi ≈ Uhii zi = UZi.⊤ . Guess the Gram matrix W = ZZ ⊤ ? Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Design W Using Z Low-Rank Matrix Factorization W = Z Λ−1 Z ⊤ , (6) in which the diagonal matrix ∈ Rm×m contains inbound PΛ n degrees of anchors Λkk = i=1 Zik . W balances the popularity of anchors. W is nonnegative and empirically sparse. Theoretically, we can derive this equation by a probabilistic means (e.g., employ the bipartite graph on the next slide). Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Coupling Label Prediction and Graph Construction Bipartite Data-Anchor Graph x1 x2 x3 Z21 u1 Z 22 x4 x5 u2 x6 Figure: A bipartite graph representation B(X , U, Z ) of data X and anchors U. Z is defined as the cross adjacency matrix between X and U (Z 1 = 1). Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Coupling Label Prediction and Graph Construction One-Step Transition in Markov Random Walks x1 x2 u1 x3 x4 u2 x5 x6 Z p (1) (uk |xi ) = Pm ik k ′ =1 Zik ′ = Zik Experiments Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Coupling Label Prediction and Graph Construction One-Step Transition in Markov Random Walks x1 x2 u1 x3 x4 u2 x5 x6 Zik p (1) (xi |uk ) = Pn j=1 Zjk = Zik Λkk Experiments Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Two-Step Transition in Markov Random Walks x1 x2 u1 x3 x4 u2 x5 x6 p (2) (xj |xi ) = p (2) (xi |xj ) = m X k =1 p (1) (xj |uk )p (1) (uk |xi ) = m X Zik Zjk k =1 Λkk Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Coupling Label Prediction and Graph Construction Interpret W as two-step transition probabilities in the bipartite data-anchor graph B: Wij = p (2) (xj |xi ) = p (2) (xi |xj ) ⇒W = Z Λ−1 Z ⊤ Term the large graph G described by such a W AnchorGraph. Don’t need to compute W and the graph Laplacian L = D − W explicitly. Only need to save Z (O(sn)) in memory. (7) Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions AnchorGraph Regularization With f = Z a, the object function of SSL is Q(a) = kZl a − yl k2 + γa⊤ Z ⊤ LZ a | {z } | {z } label fitting (8) graph regularization where we compute a “reduced" Laplacian matrix: L̃ = Z ⊤ LZ = Z ⊤ (D − Z Λ−1Z ⊤ )Z = Z ⊤ Z − (Z ⊤ Z )Λ−1 (Z ⊤ Z ), where D = diag(W 1) = diag(Z Λ−1 Z ⊤ 1) = diag(Z Λ−1 Λ1) = diag(Z 1) = diag(1) = I. Computing L̃ is tractable, taking O(m3 + m2 n) time. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions AnchorGraph Regularization We can easily obtain the globally optimal solution as follows a∗ = (Zl⊤ Zl + γ L̃)−1 Zl⊤ yl . (9) As such, we yield a closed-form solution for large-scale SSL. Final prediction of multiple classes (Y = {1, · · · , c}): ŷi = arg max j∈Y Zi.a ∗j λj , i = l + 1, · · · , n (10) where a∗j is solved corresponding to each class and the a∗j balances skewed class normalization factor λj = 1⊤ Za distributions. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions AnchorGraph Regularization Complexity Analysis Space Complexity O(n) since we save Z and L̃ ∈ Rm×m in memory. Time Complexity O(m2 n) since n ≫ m. Table: Time complexity analysis. n is the data size, m is # of anchor points, s is # of nearest anchors in designing Z , and T is # of iterations of LAE (n ≫ m ≫ s). Stage find anchors design Z graph regularization total time complexity AnchorGraphReg O(mn) O(smn) or O(smn + s2 Tn) O(m3 + m2 n) O(m2 n) Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Compared Methods LGC, GFHF, PVM, Eigenfunction, etc. random AGR0 : random anchors, predefined Z random AGR: random anchors, LAE-optimized Z AGR0 : k-means anchors, predefined Z AGR: k-means anchors, LAE-optimized Z Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Mid-sized Dataset Table: Classification error rates (%) on USPS-Train (7,291 samples) with 100 labeled samples. Take 1,000 anchors for four versions of AGR. The running time of k-means clustering is 7.65 seconds. Error rate GFHF < AGR < AGR0 < LGC. AGR is much faster than GFHF. Method 1NN LGC with 6NN graph GFHF with 6NN graph random AGR0 random AGR AGR0 AGR Error Rate (%) 20.15±1.80 8.79±2.27 5.19±0.43 11.15±0.77 10.30±0.75 7.40±0.59 6.56±0.55 Running Time (seconds) 0.12 403.02 413.28 2.55 8.85 10.20 16.57 Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Mid-sized Dataset Classification Error Rates vs. # Anchors on USPS−Train 22 LGC GFHF 20 Error Rate (%) 0 18 random AGR random AGR 16 AGR0 AGR 14 12 10 8 6 4 200 300 400 500 600 700 # of anchor points 800 900 1000 Figure: 100 labeled samples. When the anchor size reaches 600, both AGR0 and AGR begin to outperform LGC. Conclusions Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Large Dataset Table: Classification error rates (%) on Extended MNIST (630,000 samples) with 100 labeled samples. Take 500 anchors for two versions of AGR. The running time of k-means clustering is 195.16 seconds. AGR is the best in classification accuracy, achieving 1/2-fold improvement over the baseline 1NN. Method 1NN Eigenfunction PVM(square loss) AGR0 AGR Error Rate (%) 39.65±1.86 36.94±2.67 29.37±2.53 24.71±1.92 19.75±1.83 Running Time (seconds) 5.46 44.08 266.89 232.37 331.72 Motivation Related Work Anchor-Based Label Prediction AnchorGraph Experiments Conclusions Conclusions Develop a scalable SSL approach using the proposed large graph construction, AnchorGraph, which is guaranteed to result in positive semidefinite graph Laplacians. Both time and memory needed by AGR grow only linearly with the data size, so it can enable us to apply SSL to even larger datasets with millions of samples. AnchorGraph has wider utility such as large-scale nonlinear dimensionality reduction, spectral clustering, regression, etc. Motivation Related Work Anchor-Based Label Prediction AnchorGraph Thanks! Ask me via wliu@ee.columbia.edu. Experiments Conclusions