Spectral Methods for Learning Latent Variable Models: Unsupervised and Supervised Settings Anima Anandkumar

advertisement
Spectral Methods for Learning
Latent Variable Models:
Unsupervised and Supervised Settings
Anima Anandkumar
U.C. Irvine
Learning with Big Data
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning is finding needle in a haystack
Data vs. Information
Messy Data
Missing observations, gross corruptions, outliers.
High dimensional regime: as data grows, more variables !
Useful information: low-dimensional structures.
Learning with big data: ill-posed problem.
Learning is finding needle in a haystack
Learning with big data: computationally challenging!
Principled approaches for finding low dimensional structures?
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables and
observed data.
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables and
observed data.
Basic Approach: mixtures/clusters
Hidden variable is categorical.
How to model information structures?
Latent variable models
Incorporate hidden or latent variables.
Information structures: Relationships between latent variables and
observed data.
Basic Approach: mixtures/clusters
Hidden variable is categorical.
Advanced: Probabilistic models
Hidden variables have more general distributions.
h1
h2
h3
Can model mixed membership/hierarchical
groups.
x1 x2 x3 x4 x5
Latent Variable Models (LVMs)
Document modeling
Observed: words.
Hidden: topics.
Social Network Modeling
Observed: social interactions.
Hidden: communities, relationships.
Recommendation Systems
Observed: recommendations (e.g., reviews).
Hidden: User and business attributes
Unsupervised Learning: Learn LVM without labeled examples.
LVM for Feature Engineering
Learn good features/representations for classification tasks, e.g.,
computer vision and NLP.
Sparse Coding/Dictionary Learning
Sparse representations, low dimensional hidden structures.
A few dictionary elements make complicated shapes.
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x).
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(hu, xi).
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(hu, xi).
Mixture of Logistic Regressions
E[y|x, h] = g(hU h, xi + hb, hi)
Associative Latent Variable Models
Supervised Learning
Given labeled examples {(xi , yi )}, learn a classifier ŷ = f (x).
Associative/conditional models: p(y|x).
Example: Logistic regression: E[y|x] = σ(hu, xi).
Mixture of Logistic Regressions
E[y|x, h] = g(hU h, xi + hb, hi)
Multi-layer/Deep Network
E[y|x] = σd (Ad σd−1 (Ad−1 σd−2 (· · · A2 σ1 (A1 x))))
Challenges in Learning LVMs
Computational Challenges
Maximum likelihood is NP-hard in most scenarios.
Practice: Local search approaches such as Back-propagation, EM,
Variational Bayes have no consistency guarantees.
Sample Complexity
Sample complexity is exponential (w.r.t hidden variable dimension)
for many learning methods.
Guaranteed and efficient learning through spectral methods
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Classical Spectral Methods: Matrix PCA and CCA
Unsupervised Setting: PCA
For centered samples {xi }, find projection P with
Rank(P ) = k s.t.
min
P
1 X
kxi − P xi k2 .
n
i∈[n]
Result: Eigen-decomposition of S = Cov(X).
Supervised Setting: CCA
For centered samples {xi , yi }, find
max q
a,b
a⊤ Ê[xy ⊤ ]b
a⊤ Ê[xx⊤ ]a
y
x
.
b⊤ Ê[yy ⊤ ]b
Result: Generalized eigen decomposition.
ha, xi
hb, yi
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships.
Failure to cluster under small separation.
Shortcomings of Matrix Methods
Learning through Spectral Clustering
Dimension reduction through PCA (on data matrix)
Clustering on projected vectors (e.g. k-means).
Basic method works only for single memberships.
Failure to cluster under small separation.
Efficient Learning Without Separation Constraints?
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Beyond SVD: Spectral Methods on Tensors
How to learn the mixture models without separation constraints?
◮
PCA uses covariance matrix of data. Are higher order moments helpful?
Unified framework?
◮
Moment-based estimation of probabilistic latent variable models?
SVD gives spectral decomposition of matrices.
◮
What are the analogues for tensors?
Moment Matrices and Tensors
Multivariate Moments in Unsupervised Setting
M1 := E[x],
M2 := E[x ⊗ x],
M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor.
E[x ⊗ x]i1 ,i2 = E[xi1 xi2 ].
For matrices: E[x ⊗ x] = E[xx⊤ ].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor.
E[x ⊗ x ⊗ x]i1 ,i2 ,i3 = E[xi1 xi2 xi3 ].
Moment Matrices and Tensors
Multivariate Moments in Unsupervised Setting
M1 := E[x],
M2 := E[x ⊗ x],
M3 := E[x ⊗ x ⊗ x].
Matrix
E[x ⊗ x] ∈ Rd×d is a second order tensor.
E[x ⊗ x]i1 ,i2 = E[xi1 xi2 ].
For matrices: E[x ⊗ x] = E[xx⊤ ].
Tensor
E[x ⊗ x ⊗ x] ∈ Rd×d×d is a third order tensor.
E[x ⊗ x ⊗ x]i1 ,i2 ,i3 = E[xi1 xi2 xi3 ].
Multivariate Moments in Supervised Setting
M1 := E[x], E[y],
M2 := E[x ⊗ y],
M3 := E[x ⊗ x ⊗ y].
Spectral Decomposition of Tensors
M2 =
P
λi ui ⊗ vi
i
=
Matrix M2
λ1 u1 ⊗ v1
....
+
λ2 u2 ⊗ v2
Spectral Decomposition of Tensors
M2 =
P
λi ui ⊗ vi
i
=
Matrix M2
λ2 u2 ⊗ v2
λ1 u1 ⊗ v1
M3 =
....
+
P
λi ui ⊗ vi ⊗ wi
i
=
Tensor M3
λ1 u1 ⊗ v1 ⊗ w1
....
+
λ2 u2 ⊗ v2 ⊗ w2
u ⊗ v ⊗ w is a rank-1 tensor since its (i1 , i2 , i3 )th entry is ui1 vi2 wi3 .
How to solve this non-convex problem?
Decomposition of Orthogonal Tensors
M3 =
X
wi ai ⊗ ai ⊗ ai .
i
Suppose A has orthogonal columns.
Decomposition of Orthogonal Tensors
M3 =
X
wi ai ⊗ ai ⊗ ai .
i
Suppose A has orthogonal columns.
M3 (I, a1 , a1 ) =
P
2
i wi hai , a1 i ai
= w1 a1 .
Decomposition of Orthogonal Tensors
M3 =
X
wi ai ⊗ ai ⊗ ai .
i
Suppose A has orthogonal columns.
M3 (I, a1 , a1 ) =
P
2
i wi hai , a1 i ai
= w1 a1 .
ai are eigenvectors of tensor M3 .
Analogous to matrix eigenvectors:
M v = M (I, v) = λv.
Decomposition of Orthogonal Tensors
M3 =
X
wi ai ⊗ ai ⊗ ai .
i
Suppose A has orthogonal columns.
M3 (I, a1 , a1 ) =
P
2
i wi hai , a1 i ai
= w1 a1 .
ai are eigenvectors of tensor M3 .
Analogous to matrix eigenvectors:
M v = M (I, v) = λv.
Two Problems
How to find eigenvectors of a tensor?
A is not orthogonal in general.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Recall matrix power method: v 7→
M (I, v)
.
kM (I, v)k
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Recall matrix power method: v 7→
Algorithm:
M (I, v)
.
kM (I, v)k
tensor power method: v 7→
T (I, v, v)
.
kT (I, v, v)k
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Recall matrix power method: v 7→
Algorithm:
M (I, v)
.
kM (I, v)k
tensor power method: v 7→
T (I, v, v)
.
kT (I, v, v)k
How do we avoid spurious solutions (not part of decomposition)?
• {vi }’s are the only robust fixed points.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Recall matrix power method: v 7→
Algorithm:
M (I, v)
.
kM (I, v)k
tensor power method: v 7→
T (I, v, v)
.
kT (I, v, v)k
How do we avoid spurious solutions (not part of decomposition)?
• {vi }’s are the only robust fixed points.
• All other eigenvectors are saddle points.
Orthogonal Tensor Power Method
Symmetric orthogonal tensor T ∈ Rd×d×d :
X
T =
λ i vi ⊗ vi ⊗ vi .
i∈[k]
Recall matrix power method: v 7→
Algorithm:
M (I, v)
.
kM (I, v)k
tensor power method: v 7→
T (I, v, v)
.
kT (I, v, v)k
How do we avoid spurious solutions (not part of decomposition)?
• {vi }’s are the only robust fixed points.
• All other eigenvectors are saddle points.
For an orthogonal tensor, no spurious local optima!
Whitening: Conversion to Orthogonal Tensor
M3 =
X
wi ai ⊗ ai ⊗ ai ,
M2 =
i
X
wi ai ⊗ ai .
i
Find whitening matrix W s.t. W ⊤ A = V is an orthogonal matrix.
When A ∈ Rd×k has full column rank, it is an invertible
transformation.
v1
a1
a2
W
a3
Use pairwise moments M2 to find W .
SVD of M2 is needed.
v2
v3
Putting it together
Non-orthogonal tensor M3 =
P
i wi ai
⊗ ai ⊗ ai , M2 =
P
i wi ai
⊗ ai .
Whitening matrix W :
a1
a2
a3
Multilinear transform: T = M3 (W, W, W )
Tensor M3
W
v1
v3
Tensor T
v2
Putting it together
Non-orthogonal tensor M3 =
P
i wi ai
⊗ ai ⊗ ai , M2 =
P
i wi ai
⊗ ai .
Whitening matrix W :
a1
a2
a3
W
v3
Multilinear transform: T = M3 (W, W, W )
Tensor M3
v1
Tensor T
Tensor Decomposition: Guaranteed Non-Convex Optimization!
v2
Putting it together
Non-orthogonal tensor M3 =
P
i wi ai
⊗ ai ⊗ ai , M2 =
P
i wi ai
⊗ ai .
Whitening matrix W :
a1
a2
a3
v3
Multilinear transform: T = M3 (W, W, W )
Tensor M3
v1
W
Tensor T
Tensor Decomposition: Guaranteed Non-Convex Optimization!
For what latent variable models can we obtain M2 and M3 forms?
v2
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Types of Latent Variable Models
What is the form of hidden variables h?
Basic Approach: mixtures/clusters
Hidden variable h is categorical.
Advanced: Probabilistic models
Hidden variable h has more general distributions.
h1
h2
h3
Can model mixed memberships, e.g. Dirichlet
distribution.
x1 x2 x3 x4 x5
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Topic Modeling
Geometric Picture for Topic Models
Topic proportions vector (h)
Document
Geometric Picture for Topic Models
Single topic (h)
Geometric Picture for Topic Models
Single topic (h)
A
A
A
x2
x1
x3
Word generation (x1 , x2 , . . .)
Geometric Picture for Topic Models
Single topic (h)
A
A
A
x2
x1
x3
Word generation (x1 , x2 , . . .)
Linear model: E[xi |h] = Ah .
Moments for Single Topic Models
E[xi |h] = Ah.
h
w := E[h].
Learn topic-word matrix A, vector w
A
x1
A A
x2
x3
A
A
x4
x5
Moments for Single Topic Models
E[xi |h] = Ah.
h
w := E[h].
Learn topic-word matrix A, vector w
A
x1
A A
x2
x3
A
A
x4
x5
Pairwise Co-occurence Matrix Mx
M2 := E[x1 ⊗ x2 ] = E[E[x1 ⊗ x2 |h]] =
k
X
wi ai ⊗ ai
i=1
Triples Tensor M3
M3 := E[x1 ⊗ x2 ⊗ x3 ] = E[E[x1 ⊗ x2 ⊗ x3 |h]] =
k
X
i=1
wi ai ⊗ ai ⊗ ai
Moments under LDA
M2 := E[x1 ⊗ x2 ]
α0
E[x1 ] ⊗ E[x1 ]
α0 + 1
α0
−
E[x1 ⊗ x2 ⊗ E[x1 ]] − more stuff...
α0 + 2
−
M3 := E[x1 ⊗ x2 ⊗ x3 ]
Then
M2 =
X
w̃i ai ⊗ ai
M3 =
X
w̃i ai ⊗ ai ⊗ ai .
Three words per document suffice for learning LDA.
Similar forms for HMM, ICA, sparse coding etc.
“Tensor Decompositions for Learning Latent Variable Models” by A. Anandkumar, R. Ge, D.
Hsu, S.M. Kakade and M. Telgarsky. JMLR 2014.
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Network Community Models
Network Community Models
0.1
0.4
0.3
0.3
0.8
0.1
0.7
0.2
0.1
Network Community Models
0.1
0.4
0.3
0.3
0.8
0.1
0.7
0.2
0.1
Network Community Models
0.9
0.1
0.4
0.3
0.3
0.8
0.1
0.7
0.2
0.1
Network Community Models
0.1
0.4
0.3
0.3
0.8
0.1
0.1 0.7
0.2
0.1
Network Community Models
0.1
0.4
0.3
0.3
0.8
0.1
0.7
0.2
0.1
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
Subgraph Counts as Graph Moments
3-Star Count Tensor
1
# of common neighbors in X
|X|
1 X
G(x, a)G(x, b)G(x, c).
=
|X|
x∈X
1 X ⊤
⊤
M̃3 =
[Gx,A ⊗ G⊤
x,B ⊗ Gx,C ]
|X|
M̃3 (a, b, c) =
x
X
A
B
a
C
b
x∈X
“A Tensor Spectral Approach to Learning Mixed Membership Community Models” by A.
Anandkumar, R. Ge, D. Hsu, and S.M. Kakade. COLT 2013.
c
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Computational Complexity (k ≪ n)
n = # of nodes
k = # of communities.
N = # of iterations
c = # of cores.
Space
Time
Whiten
O(nk)
O(nsk/c + k3 )
STGD
O(k2 )
O(N k 3 /c)
Unwhiten
O(nk)
O(nsk/c)
Whiten: matrix/vector products and SVD.
STGD: Stochastic Tensor Gradient Descent
Unwhiten: matrix/vector products
3
Our approach: O( nsk
c +k )
Embarrassingly Parallel and fast!
Tensor Decomposition on GPUs
4
10
3
Running time(secs)
10
2
10
1
10
MATLAB Tensor Toolbox(CPU)
CULA Standard Interface(GPU)
CULA Device Interface(GPU)
Eigen Sparse(CPU)
0
10
−1
10
2
3
10
10
Number of communities k
Summary of Results
Author
Coauthor
Business
User
Reviews
Users
Friend
Facebook
n ∼ 20k
Yelp
n ∼ 40k
DBLP(sub)
n ∼ 1 million(∼ 100k)
Error (E) and Recovery ratio (R)
Dataset
Facebook(k=360)
Facebook(k=360)
.
Yelp(k=159)
Yelp(k=159)
.
DBLP sub(k=250)
DBLP sub(k=250)
DBLP(k=6000)
k̂
500
500
Method
ours
variational
Running Time
468
86,808
E
0.0175
0.0308
R
100%
100%
100
100
ours
variational
287
N.A.
0.046
86%
500
500
100
ours
variational
ours
10,157
558,723
5407
0.139
16.38
0.105
89%
99%
95%
Thanks to Prem Gopalan and David Mimno for providing variational code.
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank
1
2
3
4
5
Category
Latin American
Gluten Free
Hobby Shops
Mass Media
Yoga
Business
Salvadoreno Restaurant
P.F. Chang’s China Bistro
Make Meaning
KJZZ 91.5FM
Sutra Midtown
Stars
4.0
3.5
4.5
4.0
4.5
Review Counts
36
55
14
13
31
Experimental Results on Yelp
Lowest error business categories & largest weight businesses
Rank
1
2
3
4
5
Category
Latin American
Gluten Free
Hobby Shops
Mass Media
Yoga
Business
Salvadoreno Restaurant
P.F. Chang’s China Bistro
Make Meaning
KJZZ 91.5FM
Sutra Midtown
Stars
4.0
3.5
4.5
4.0
4.5
Review Counts
36
55
14
13
31
Bridgeness: Distance from vector [1/k̂, . . . , 1/k̂]⊤
Top-5 bridging nodes (businesses)
Business
Four Peaks Brewing
Pizzeria Bianco
FEZ
Matt’s Big Breakfast
Cornish Pasty Co
Categories
Restaurants,
Restaurants,
Restaurants,
Restaurants,
Restaurants,
Bars, American, Nightlife, Food, Pubs, Tempe
Pizza, Phoenix
Bars, American, Nightlife, Mediterranean, Lounges, Phoenix
Phoenix, Breakfast& Brunch
Bars, Nightlife, Pubs, Tempe
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Moment Tensors for Associative Models
Multivariate Moments: Many possibilities...
E[x ⊗ y], E[x ⊗ x ⊗ y], E[ψ(x) ⊗ y] . . . .
Feature Transformations of the Input: x 7→ ψ(x)
How to exploit them?
Are moments E[ψ(x) ⊗ y] useful?
If ψ(x) is a matrix/tensor, we have matrix/tensor moments.
Can carry out spectral decomposition of the moments.
Score Function Features
m
Higher order score function: Sm (x) := (−1)
∇(m) p(x)
p(x)
∗ Can be a matrix or a tensor instead of a vector.
∗ Derivative w.r.t parameter or input
Form the cross-moments: E [y · Sm (x)].
h
i
(m)
E
[y
·
S
(x)]
=
E
∇
G(x)
m
Extension of Stein’s lemma:
when E[y|x] := G(x)
Spectral decomposition:
h
i X
E ∇(m) G(x) =
u⊗m
j
j∈[k]
Can be applied for learning of associative latent variable models.
Learning Deep Neural Networks
Realizable Setting E[y|x] = σd (Ad σd−1 (Ad−1 σd−2 (· · · A2 σ1 (A1 x))))
M3 = E[y · S3 (x)] =
X
λi · u⊗3
i
i∈[r]
⊤
where ui = ei A1 are rows of A1 .
Guaranteed learning of weights
(layer-by-layer) via tensor
decomposition.
Similar guarantees for learning mixture of classifiers
Automated Extraction of Discriminative Features
Outline
1
Introduction
2
Spectral Methods
Classical Matrix Methods
Beyond Matrices: Tensors
3
Moment Tensors for Latent Variable Models
Topic Models
Network Community Models
Experimental Results
4
Moment Tensors in Supervised Setting
5
Conclusion
Conclusion: Guaranteed Non-Convex Optimization
Tensor Decomposition
Efficient sample and computational complexities
Better performance compared to EM, Variational Bayes etc.
In practice
Scalable and embarrassingly parallel: handle large datasets.
Efficient performance: perplexity or ground truth validation.
Related Topics
Overcomplete Tensor Decomposition: Neural networks, sparse
coding and ICA models tend to be overcomplete (more neurons than
input dimensions).
Provable Non-Convex Iterative Methods: Robust PCA, Dictionary
learning etc.
My Research Group and Resources
Furong Huang
Niranjan UN
Majid Janzamin
Hanie Sedghi
Forough Arabshahi
ML summer school lectures available at
http://newport.eecs.uci.edu/anandkumar/MLSS.html
Download