Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings Anima Anandkumar U.C. Irvine Learning with Big Data Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Data vs. Information Messy Data Missing observations, gross corruptions, outliers. High dimensional regime: as data grows, more variables ! Useful information: low-dimensional structures. Learning with big data: ill-posed problem. Learning is finding needle in a haystack Learning with big data: computationally challenging! Principled approaches for finding low dimensional structures? How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. How to model information structures? Latent variable models Incorporate hidden or latent variables. Information structures: Relationships between latent variables and observed data. Basic Approach: mixtures/clusters Hidden variable is categorical. Advanced: Probabilistic models Hidden variables have more general distributions. h1 h2 h3 Can model mixed membership/hierarchical groups. x1 x2 x3 x4 x5 Latent Variable Models (LVMs) Document modeling Observed: words. Hidden: topics. Social Network Modeling Observed: social interactions. Hidden: communities, relationships. Recommendation Systems Observed: recommendations (e.g., reviews). Hidden: User and business attributes Unsupervised Learning: Learn LVM without labeled examples. LVM for Feature Engineering Learn good features/representations for classification tasks, e.g., computer vision and NLP. Sparse Coding/Dictionary Learning Sparse representations, low dimensional hidden structures. A few dictionary elements make complicated shapes. Associative Latent Variable Models Supervised Learning Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x). Associative Latent Variable Models Supervised Learning Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x). Associative/conditional models: p(y|x). Example: Logistic regression: E[y|x] = σ(hu, xi). Associative Latent Variable Models Supervised Learning Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x). Associative/conditional models: p(y|x). Example: Logistic regression: E[y|x] = σ(hu, xi). Mixture of Logistic Regressions E[y|x, h] = g(hU h, xi + hb, hi) Associative Latent Variable Models Supervised Learning Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x). Associative/conditional models: p(y|x). Example: Logistic regression: E[y|x] = σ(hu, xi). Mixture of Logistic Regressions E[y|x, h] = g(hU h, xi + hb, hi) Multi-layer/Deep Network E[y|x] = σd (Ad σd−1 (Ad−1 σd−2 (· · · A2 σ1 (A1 x)))) Challenges in Learning LVMs Computational Challenges Maximum likelihood is NP-hard in most scenarios. Practice: Local search approaches such as Back-propagation, EM, Variational Bayes have no consistency guarantees. Sample Complexity Sample complexity is exponential (w.r.t hidden variable dimension) for many learning methods. Guaranteed and efficient learning through spectral methods Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Classical Spectral Methods: Matrix PCA and CCA Unsupervised Setting: PCA For centered samples {xi }, find projection P with Rank(P ) = k s.t. min P 1 X kxi − P xi k2 . n i∈[n] Result: Eigen-decomposition of S = Cov(X). Supervised Setting: CCA For centered samples {xi , yi }, find max q a,b a⊤ Ê[xy ⊤ ]b a⊤ Ê[xx⊤ ]a y x . b⊤ Ê[yy ⊤ ]b Result: Generalized eigen decomposition. ha, xi hb, yi Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Shortcomings of Matrix Methods Learning through Spectral Clustering Dimension reduction through PCA (on data matrix) Clustering on projected vectors (e.g. k-means). Basic method works only for single memberships. Failure to cluster under small separation. Efficient Learning Without Separation Constraints? Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Beyond SVD: Spectral Methods on Tensors How to learn the mixture models without separation constraints? ◮ PCA uses covariance matrix of data. Are higher order moments helpful? Unified framework? ◮ Moment-based estimation of probabilistic latent variable models? SVD gives spectral decomposition of matrices. ◮ What are the analogues for tensors? Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x]. Matrix E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1 ,i2 = E[xi1 xi2 ]. For matrices: E[x ⊗ x] = E[xx⊤ ]. Tensor E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1 ,i2 ,i3 = E[xi1 xi2 xi3 ]. Moment Matrices and Tensors Multivariate Moments in Unsupervised Setting M1 := E[x], M2 := E[x ⊗ x], M3 := E[x ⊗ x ⊗ x]. Matrix E[x ⊗ x] ∈ Rd×d is a second order tensor. E[x ⊗ x]i1 ,i2 = E[xi1 xi2 ]. For matrices: E[x ⊗ x] = E[xx⊤ ]. Tensor E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor. E[x ⊗ x ⊗ x]i1 ,i2 ,i3 = E[xi1 xi2 xi3 ]. Multivariate Moments in Supervised Setting M1 := E[x], E[y], M2 := E[x ⊗ y], M3 := E[x ⊗ x ⊗ y]. Spectral Decomposition of Tensors M2 = P λi ui ⊗ vi i = Matrix M2 λ1 u1 ⊗ v1 .... + λ2 u2 ⊗ v2 Spectral Decomposition of Tensors M2 = P λi ui ⊗ vi i = Matrix M2 λ2 u2 ⊗ v2 λ1 u1 ⊗ v1 M3 = .... + P λi ui ⊗ vi ⊗ wi i = Tensor M3 λ1 u1 ⊗ v1 ⊗ w1 .... + λ2 u2 ⊗ v2 ⊗ w2 u ⊗ v ⊗ w is a rank-1 tensor since its (i1 , i2 , i3 )th entry is ui1 vi2 wi3 . How to solve this non-convex problem? Decomposition of Orthogonal Tensors M3 = X wi ai ⊗ ai ⊗ ai . i Suppose A has orthogonal columns. Decomposition of Orthogonal Tensors M3 = X wi ai ⊗ ai ⊗ ai . i Suppose A has orthogonal columns. M3 (I, a1 , a1 ) = P 2 i wi hai , a1 i ai = w1 a1 . Decomposition of Orthogonal Tensors M3 = X wi ai ⊗ ai ⊗ ai . i Suppose A has orthogonal columns. M3 (I, a1 , a1 ) = P 2 i wi hai , a1 i ai = w1 a1 . ai are eigenvectors of tensor M3 . Analogous to matrix eigenvectors: M v = M (I, v) = λv. Decomposition of Orthogonal Tensors M3 = X wi ai ⊗ ai ⊗ ai . i Suppose A has orthogonal columns. M3 (I, a1 , a1 ) = P 2 i wi hai , a1 i ai = w1 a1 . ai are eigenvectors of tensor M3 . Analogous to matrix eigenvectors: M v = M (I, v) = λv. Two Problems How to find eigenvectors of a tensor? A is not orthogonal in general. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Recall matrix power method: v 7→ M (I, v) . kM (I, v)k Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Recall matrix power method: v 7→ Algorithm: M (I, v) . kM (I, v)k tensor power method: v 7→ T (I, v, v) . kT (I, v, v)k Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Recall matrix power method: v 7→ Algorithm: M (I, v) . kM (I, v)k tensor power method: v 7→ T (I, v, v) . kT (I, v, v)k How do we avoid spurious solutions (not part of decomposition)? • {vi }’s are the only robust fixed points. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Recall matrix power method: v 7→ Algorithm: M (I, v) . kM (I, v)k tensor power method: v 7→ T (I, v, v) . kT (I, v, v)k How do we avoid spurious solutions (not part of decomposition)? • {vi }’s are the only robust fixed points. • All other eigenvectors are saddle points. Orthogonal Tensor Power Method Symmetric orthogonal tensor T ∈ Rd×d×d : X T = λ i vi ⊗ vi ⊗ vi . i∈[k] Recall matrix power method: v 7→ Algorithm: M (I, v) . kM (I, v)k tensor power method: v 7→ T (I, v, v) . kT (I, v, v)k How do we avoid spurious solutions (not part of decomposition)? • {vi }’s are the only robust fixed points. • All other eigenvectors are saddle points. For an orthogonal tensor, no spurious local optima! Whitening: Conversion to Orthogonal Tensor M3 = X wi ai ⊗ ai ⊗ ai , M2 = i X wi ai ⊗ ai . i Find whitening matrix W s.t. W ⊤ A = V is an orthogonal matrix. When A ∈ Rd×k has full column rank, it is an invertible transformation. v1 a1 a2 W a3 Use pairwise moments M2 to find W . SVD of M2 is needed. v2 v3 Putting it together Non-orthogonal tensor M3 = P i wi ai ⊗ ai ⊗ ai , M2 = P i wi ai ⊗ ai . Whitening matrix W : a1 a2 a3 Multilinear transform: T = M3 (W, W, W ) Tensor M3 W v1 v3 Tensor T v2 Putting it together Non-orthogonal tensor M3 = P i wi ai ⊗ ai ⊗ ai , M2 = P i wi ai ⊗ ai . Whitening matrix W : a1 a2 a3 W v3 Multilinear transform: T = M3 (W, W, W ) Tensor M3 v1 Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! v2 Putting it together Non-orthogonal tensor M3 = P i wi ai ⊗ ai ⊗ ai , M2 = P i wi ai ⊗ ai . Whitening matrix W : a1 a2 a3 v3 Multilinear transform: T = M3 (W, W, W ) Tensor M3 v1 W Tensor T Tensor Decomposition: Guaranteed Non-Convex Optimization! For what latent variable models can we obtain M2 and M3 forms? v2 Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Types of Latent Variable Models What is the form of hidden variables h? Basic Approach: mixtures/clusters Hidden variable h is categorical. Advanced: Probabilistic models Hidden variable h has more general distributions. h1 h2 h3 Can model mixed memberships, e.g. Dirichlet distribution. x1 x2 x3 x4 x5 Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Topic Modeling Geometric Picture for Topic Models Topic proportions vector (h) Document Geometric Picture for Topic Models Single topic (h) Geometric Picture for Topic Models Single topic (h) A A A x2 x1 x3 Word generation (x1 , x2 , . . .) Geometric Picture for Topic Models Single topic (h) A A A x2 x1 x3 Word generation (x1 , x2 , . . .) Linear model: E[xi |h] = Ah . Moments for Single Topic Models E[xi |h] = Ah. h w := E[h]. Learn topic-word matrix A, vector w A x1 A A x2 x3 A A x4 x5 Moments for Single Topic Models E[xi |h] = Ah. h w := E[h]. Learn topic-word matrix A, vector w A x1 A A x2 x3 A A x4 x5 Pairwise Co-occurence Matrix Mx M2 := E[x1 ⊗ x2 ] = E[E[x1 ⊗ x2 |h]] = k X wi ai ⊗ ai i=1 Triples Tensor M3 M3 := E[x1 ⊗ x2 ⊗ x3 ] = E[E[x1 ⊗ x2 ⊗ x3 |h]] = k X i=1 wi ai ⊗ ai ⊗ ai Moments under LDA M2 := E[x1 ⊗ x2 ] α0 E[x1 ] ⊗ E[x1 ] α0 + 1 α0 − E[x1 ⊗ x2 ⊗ E[x1 ]] − more stuff... α0 + 2 − M3 := E[x1 ⊗ x2 ⊗ x3 ] Then M2 = X w̃i ai ⊗ ai M3 = X w̃i ai ⊗ ai ⊗ ai . Three words per document suffice for learning LDA. Similar forms for HMM, ICA, sparse coding etc. “Tensor Decompositions for Learning Latent Variable Models” by A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014. Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Network Community Models Network Community Models 0.1 0.4 0.3 0.3 0.8 0.1 0.7 0.2 0.1 Network Community Models 0.1 0.4 0.3 0.3 0.8 0.1 0.7 0.2 0.1 Network Community Models 0.9 0.1 0.4 0.3 0.3 0.8 0.1 0.7 0.2 0.1 Network Community Models 0.1 0.4 0.3 0.3 0.8 0.1 0.1 0.7 0.2 0.1 Network Community Models 0.1 0.4 0.3 0.3 0.8 0.1 0.7 0.2 0.1 Subgraph Counts as Graph Moments “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013. Subgraph Counts as Graph Moments “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013. Subgraph Counts as Graph Moments 3-Star Count Tensor 1 # of common neighbors in X |X| 1 X G(x, a)G(x, b)G(x, c). = |X| x∈X 1 X ⊤ ⊤ M̃3 = [Gx,A ⊗ G⊤ x,B ⊗ Gx,C ] |X| M̃3 (a, b, c) = x X A B a C b x∈X “A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A. Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013. c Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Computational Complexity (k ≪ n) n = # of nodes k = # of communities. N = # of iterations c = # of cores. Space Time Whiten O(nk) O(nsk/c + k3 ) STGD O(k2 ) O(N k 3 /c) Unwhiten O(nk) O(nsk/c) Whiten: matrix/vector products and SVD. STGD: Stochastic Tensor Gradient Descent Unwhiten: matrix/vector products 3 Our approach: O( nsk c +k ) Embarrassingly Parallel and fast! Tensor Decomposition on GPUs 4 10 3 Running time(secs) 10 2 10 1 10 MATLAB Tensor Toolbox(CPU) CULA Standard Interface(GPU) CULA Device Interface(GPU) Eigen Sparse(CPU) 0 10 −1 10 2 3 10 10 Number of communities k Summary of Results Author Coauthor Business User Reviews Users Friend Facebook n ∼ 20k Yelp n ∼ 40k DBLP(sub) n ∼ 1 million(∼ 100k) Error (E) and Recovery ratio (R) Dataset Facebook(k=360) Facebook(k=360) . Yelp(k=159) Yelp(k=159) . DBLP sub(k=250) DBLP sub(k=250) DBLP(k=6000) k̂ 500 500 Method ours variational Running Time 468 86,808 E 0.0175 0.0308 R 100% 100% 100 100 ours variational 287 N.A. 0.046 86% 500 500 100 ours variational ours 10,157 558,723 5407 0.139 16.38 0.105 89% 99% 95% Thanks to Prem Gopalan and David Mimno for providing variational code. Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank 1 2 3 4 5 Category Latin American Gluten Free Hobby Shops Mass Media Yoga Business Salvadoreno Restaurant P.F. Chang’s China Bistro Make Meaning KJZZ 91.5FM Sutra Midtown Stars 4.0 3.5 4.5 4.0 4.5 Review Counts 36 55 14 13 31 Experimental Results on Yelp Lowest error business categories & largest weight businesses Rank 1 2 3 4 5 Category Latin American Gluten Free Hobby Shops Mass Media Yoga Business Salvadoreno Restaurant P.F. Chang’s China Bistro Make Meaning KJZZ 91.5FM Sutra Midtown Stars 4.0 3.5 4.5 4.0 4.5 Review Counts 36 55 14 13 31 Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤ Top-5 bridging nodes (businesses) Business Four Peaks Brewing Pizzeria Bianco FEZ Matt’s Big Breakfast Cornish Pasty Co Categories Restaurants, Restaurants, Restaurants, Restaurants, Restaurants, Bars, American, Nightlife, Food, Pubs, Tempe Pizza, Phoenix Bars, American, Nightlife, Mediterranean, Lounges, Phoenix Phoenix, Breakfast& Brunch Bars, Nightlife, Pubs, Tempe Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Moment Tensors for Associative Models Multivariate Moments: Many possibilities... E[x ⊗ y], E[x ⊗ x ⊗ y], E[ψ(x) ⊗ y] . . . . Feature Transformations of the Input: x 7→ ψ(x) How to exploit them? Are moments E[ψ(x) ⊗ y] useful? If ψ(x) is a matrix/tensor, we have matrix/tensor moments. Can carry out spectral decomposition of the moments. Score Function Features m Higher order score function: Sm (x) := (−1) ∇(m) p(x) p(x) ∗ Can be a matrix or a tensor instead of a vector. ∗ Derivative w.r.t parameter or input Form the cross-moments: E [y · Sm (x)]. h i (m) E [y · S (x)] = E ∇ G(x) m Extension of Stein’s lemma: when E[y|x] := G(x) Spectral decomposition: h i X E ∇(m) G(x) = u⊗m j j∈[k] Can be applied for learning of associative latent variable models. Learning Deep Neural Networks Realizable Setting E[y|x] = σd (Ad σd−1 (Ad−1 σd−2 (· · · A2 σ1 (A1 x)))) M3 = E[y · S3 (x)] = X λi · u⊗3 i i∈[r] ⊤ where ui = ei A1 are rows of A1 . Guaranteed learning of weights (layer-by-layer) via tensor decomposition. Similar guarantees for learning mixture of classifiers Automated Extraction of Discriminative Features Outline 1 Introduction 2 Spectral Methods Classical Matrix Methods Beyond Matrices: Tensors 3 Moment Tensors for Latent Variable Models Topic Models Network Community Models Experimental Results 4 Moment Tensors in Supervised Setting 5 Conclusion Conclusion: Guaranteed Non-Convex Optimization Tensor Decomposition Efficient sample and computational complexities Better performance compared to EM, Variational Bayes etc. In practice Scalable and embarrassingly parallel: handle large datasets. Efficient performance: perplexity or ground truth validation. Related Topics Overcomplete Tensor Decomposition: Neural networks, sparse coding and ICA models tend to be overcomplete (more neurons than input dimensions). Provable Non-Convex Iterative Methods: Robust PCA, Dictionary learning etc. My Research Group and Resources Furong Huang Niranjan UN Majid Janzamin Hanie Sedghi Forough Arabshahi ML summer school lectures available at http://newport.eecs.uci.edu/anandkumar/MLSS.html