F - Shuang Yang

advertisement
Matrix Factorization
Models, Algorithms and Applications
Shuang Hong Yang
http://www.cc.gatech.edu/~syang46/
shyang@gatech.edu
May, 2010
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
2
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
3
Problem Definition
• Matrix Factorization: for a given matrix M, find a compact (low-rank)
approximation
a
a
u(a)
M(a,b)
M(a,b)
D
b
b
v(b)
– M may be partially observed (i.e., entry missing)
– In the simplest form:
(U,V) = arg min ||M-UTV||F2




an identity function f(x) = x is used as link function
U, V and D are interacting in a multiplicative fashion
D is assumed to be an identity matrix
Euclidian distance is used as the measure of goodness
4
Problem Definition (cont)
• Matrices Co-Factorization
For a given set of related matrices {M}, find a coupled set of compact (low-rank)
approximations
– Each M represents an interaction observation between two entity
– Multi-View MF:
– Joint MF:
a
b
c
5
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
6
Overview: MF taxonomy
• Targeted tasks:
– Dimensionality reduction




PCA and other spectral algorithms
LSI and other SVD algorithms
NMF and other (convex) optimization algorithms
PPCA and other statistical models
– Clustering
 K-mean, mean-shift, min-cut, normalized-cut, NMF, etc
 Gaussian mixture model, Bi-Gaussian model, etc
– Factor analysis (e.g., profiling, decomposition)
 ICA, CCA, NMF, etc.
 SVD, MMMF, etc.
– Codebook learning
 Sparse coding, k-mean, NMF, LDA, etc.
– Topic modeling
 LSI, LDA, PLSI, etc.
– Graph mining
 Random walk, pagerank, Hits, etc
– Prediction




Classification, Regression
Link prediction, matrix completion, community detection
Collaborative filtering, recommendation, learning to rank
Domain adaptation, multi-task learning
7
Overview: MF taxonomy
•
Models:
–
Computational: by optimization
models differ in objective and regularizer design
 Objective:








L2 error minimization (least square, Frobenius in matrix form)
L1 error minimization (least absolute deviation)
Hinge, logistic, log, cosine loss
Huber loss, ε-loss, etc.
Information-theoretic loss: entropy, mutual information, KL-divergence
Exponential family loss and Bregman divergence: logistic, log, etc.
Graph laplacian and smoothness
Joint loss of fitting error and prediction accuracy
 Regularizer:




–
L2 norm, L1 norm, Ky-Fan (e.g. nuclear)
Graph laplacian and smoothness
Lower & upepr bound constraint (e.g., positive constraint)
Other constraint: linear constraint (e.g., probabilistic wellness), quadratic constraint (e.g., covariance), orthogonal constraint
Statistic: by inference
model differ in factorization, prior and conditional design
 Factorization:
 Symmetric: p(a,b) = Σzp(z)p(a|z)p(b|z)
 Asymmetric: p(a,b) = p(a)Σzp(z|a)p(b|z)
 Conditional: usually exponential family

Gaussian, Laplacian, Multinomial, Bernoulli, Poission
 Prior:




Conjugate prior
Popularly picked ones: Gaussian, Laplacian (or exponential), Dirichlet
Non-informative prior: max entropy prior, etc.
Nonparametric: ARD, Chinese restaurant process, Indian buffet process, etc.
8
Overview: MF taxonomy
• Models:
– Connection between the two lines [Collins et al, NIPS02,Long et al KDD 07, Singh et al, ECML08]
L2
L1
Logistic
Bregman/KL
…
L2
L1
Lap. Smoothness
…
Computational
Statistic
loss function
conditional
regularization
prior
Gaussian
Laplacian
Bernoulli
Exponential family
…
Gaussian
Lap./Exp.
Gaussian Rand Field
…
9
Overview: MF taxonomy
• Algorithm:
– Deterministic:
•
•
•
•
•
•
•
•
•
Spectral analysis
Matrix decomposition: SVD, QR, LU
Solving linear system
Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, SDP, etc.
Alternative coordinate descent
LARS, IRLS
EM
Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
…
– Stochastic:






Stochastic gradient descent (back propagation, message passing)
Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
Random walk
Simulated annealing, annealing EM
Randomized projection
…
10
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
11
Representative Work
• Spectral dimensionality reduction / clustering:
– PCA:
• L2 loss + orthogonal constraint
min ||M – UTV||F2, subject to: VTV = I
• Solve by spectral analysis of MTM
• Analogous to factor-based collaborative filtering
– Laplacian eigenmap:
 Laplacian smoothness + orthogonal constraint
min wij||uiV-ujV||2, subject to: VTV = I
 Graph encodes neighboring info, e.g., heat, kNN
 Analogous to neighbor-based collaborative filtering
12
Representative Work
• Plain CF:
– Factor based (Ky-Fan):
• L2 loss + L2 regularization
min ||M – UTV||F2 + C1||U||F2 +C2||V||F2
• Solve by SVD
• Analogous to LSI, PCA, etc.
– Neighbor based (item or query oriented):
 Not explicitly perform factorization, but could be viewed equivalently as
min wij||uiMi-ujMi||2, subject to: Σv uiv = 1, uiv ≥0
 Graph encodes neighboring info, e.g., heat, kNN
 Analogous to k-mean, Laplacian eigenmap
13
Representative Work
• Joint factor-neighbor based CF:
[Koren: Factor meets the Neighbors: Scalable and Accurate Collaborative Filtering, KDD’08]
–
–
–
–
L2 loss + L2 regularization
Neighbor graph constructed by pearson correlation
Solve by stochastic gradient descent
Analogous to locality (Laplacian) regularized PCA.
14
Representative Work
• Max-margin matrix factorization:
– Max-margin dimensionality reduction: [a lot of work here]
• Hinge loss + L2 regularization
min h(yij – uiTDvj)+C1||D||F2 +C2||U||F2 +C3||V||F2
• Solve by SDP, cutting plane, etc.
– Max-Margin Matrix Factorization:[Srebro et al, NIPS 2005, ALT 2005]
 Hinge loss + Ky-Fan
min h(mij – uiTvj)+C1||U||F2 +C2||V||F2
 Note: no constraint for the rank of U or V
 Solve by SDP
– CoFi-Rank: [Weimer et al, NIPS 2009]
 NDCG + Ky-Fan
min n(mij – uiTvj)+C1||U||F2 +C2||V||F2
 Note: no constraint for the rank of U or V
 Solve by SDP, bundle methods
15
Representative Work
• Sparse coding:
[Lee et al, NIPS 2007] [Lee et al, IJCAI 2009]
– L2 sparse coding :
• L2 loss + L1 regularization
min ||M – UTV||F2+C1||U||1 +C2||V||F2
• Solve by LARS, IRLS, gradient descent with sign searching
– Exponential family sparse coding:
 Bregman divergence + L1 regularization
min D(Mab||g(uavb)) + C1||U||1 +C2||V||F2
 Solve by gradient descent with sign searching
– Sparse is good --- my guess:
• compacter usually implies predictive
• Sparsity poses stronger prior, making local optima more distinguishable
• Shorter descriptive length ( the principal of MDL)
16
Representative Work
• NMF, LDA, and Exponential PCA
– NMF: [Lee et al, NIPS 2001]
• L2 loss + nonnegative constraint
min ||M – UTV||F2, subject to: U≥0, V ≥ 0
• Solve by SDP, projected gradient descent, interior point
– LDA: [Blei et al, NIPS 2002]
 Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior
ua~ Dir(α), zab~Disc(ua), Mab~Mult(V, zab)
 Variational Bayesian, EP, Gibbs sampling, collapsed VB/GS
– Exponential PCA: [Collins et al, NIPS 2002]
 Bregman divergence + orthogonal constraint
min D(Mab||g(uavb)) , subject to: VTV = I
 Solved by gradient descent
– Essentially, these are equivalent to each other
17
Representative Work
• Link analysis:
– Factor based / bi-clustering:
[a lot of papers in co-clustering and social network analysis]
• L2 loss + L2 regularization
min ||M – UTDV||F2+C1||U||1 +C2||V||F2 +C3||D||F2
• To further simplify, assume diagonal or even identity D
• Modern models use logistic regression
– Bayesian Co-clustering [Shan et al ICDM 2008]
Or Mixed membership stochastic block model [Airoldi et al, NIPS 2008]
 Symmetric + Bernoulli conditional + Dirichlet prior
ui~ Dir(α), zi~Disc(ui), Mij~sigmoid(ziTDzj)
– Nonparametric feature model: [Miller et al, NIPS 2010]
 Symmetric + Bernoulli conditional + Nonparametric prior
zi~ IBP(α), Mij~sigmoid(ziTDzj)
– In essence, equivalent
18
Representative Work
• Joint Link & content analysis:
– Collective factorization:
• L2 loss + L2 regularization [Long et al, ICML 2006, AAAI 2008; Zhou et al, WWW 2008]
• Or Laplacian smoothness loss + orthogonal [Zhou et al ICML 2007]
• Shared representation matrix
min ||M – UTDU||F2 +||F – UTB|| +C1||U||1 +C2||B||F2 +C3||D||F2
– Relational topic model:
[Chang et al, AISTATS 2009, KDD 2009]
 For M: Symmetric + Bernoulli conditional + Dirichlet Prior
 For F: Asymmetric + Multinomial conditional + Dirichlet Prior
 Shared representation matrix
ui~ Dir(α), zif~Disc(ui), Fif~Mult(B, zif) , Mij~sigmoid(ziTDzj),
– Regression based latent factor model:
[Agarwal et al, KDD 2009]
 For M: Symmetric + Gaussian conditional + Gaussian Prior
 For F: Linear regression (Gaussian)
zi~Gaussian(BxFi , σI) , Mij~Gaussian(ziTzj),
– fLDA model:
[Agarwal et al, WSDM 2009]
 LDA content factorization + Gaussian factorization model
– In essence, equivalent
19
Representative Work
• Tensor factorization/hypergraph mining and personalized CF:
– Two-way model:
[Rendle et al WSDM 2010, WWW 2010]
min ||Mijk – uiTDvj - uiTDwk - vjTDwk ||F2 +C(||U|| F2 +||V||F2 +||W||F2 +||D||F2 )
i
ui
i
Mijk
j
vj
j
k
wk
k
– Full factorization:
+
+
Mijk
[Symeonidis et al RecSys 2008, Rendle et al KDD 2009]
min ||Mijk – < ui , vj ,wk >||F2 +C(||U|| F2 +||V||F2 +||W||F2)
i
j
k
i
Mijk
j
k
ui
vj
wk
+
+
Mijk
20
Outline
• Problem Definition
• Overview
– Taxonomy by targeted tasks
– Taxonomy by models
– Taxonomy by algorithms
• Representative work
• Summary and Discussion
21
Summary and discussion
• Recipe for design an MF model:
– Step 1: Understand your task / data:
 What is the goal of my task?
 What is the underlying mechanism in the task?
 Knowledge, patterns, heuristics, clues…
 What data are available to support my task?
 Are all the available data sources reliable and useful to achieve the goal? Any
preprocessing/aggregation needed?
 What is the basic characteristic of my data?
 Symmetric, directional
 positive, fractional, centralized, bounded
 positive definite, triangle inequality
 Which distribution is appropriate to interpret my data?
 Any special concerns for the task?
 Task requirement: is there a need for online operation?
 Resources constraint: computational cost, labeled data,…
22
Summary and discussion
• Recipe for design an MF model:
– Step 2: Choose an appropriate model:
 Computational or statistic?
 Computational models are generally efficient, ease-of-implementation, off-the-shelf
blackbox (no need for fancy skills)…
 Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly,
promising if properly designed…
 If computational:
 Which loss function?






L2, most popular, most efficient, generally promising
Evidently heavy noise: L1, Huber, epsilon
Dominant locality: Laplacian smoothness
Specific distribution: Bregman divergence (also use a link function)
Measurable prediction quality: wrapper the prediction objective
Readily translated knowledge, heuristic, clue:
 What regularization?





L2, most popular, most efficient
Any constraints to retain?
Sparsity: L1
Dominant locality: Laplacian smoothness
Readily translated knowledge, heuristic, clue
23
Summary and discussion
• Recipe for design an MF model:
– Step 2: Choose an appropriate model (cont):
 Computational or statistic?
 Computational models are generally efficient, ease-of-implementation, off-the-shelf
blackbox (no need for fancy skills)…
 Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly,
promising if properly designed…
 If statistic:
 How to decompose the joint pdf?
 To reflect the underlying mechanism
 To efficiently parameterize
 What’s the appropriate model for each pdf factor?
 To encode prior knowledge/underlying mechanism
 To reflect the data distribution
 What’s the appropriate prior for Bayesian treatment?




Conjugate:
Sparsity: Laplacian, exponential
Nonparametric prior
No idea? Choose none or noninformative
24
Summary and discussion
• Recipe for design an MF model:
–
Step 3: Choose or derive an algorithm:
 To meet task requirement and/or resource constraints
 To ease implementation
 To achieve the best of the performance
Deterministic:
•
•
•
•
•
•
•
•
•
Spectral analysis
Matrix decomposition: SVD, QR, LU
Solving linear system
Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, etc.
Alternative coordinate descent
LARS, IRLS
EM
Mean field, Variational Bayesian, Expectation Propagation, collapsed VB
…
Stochastic:






Stochastic gradient descent (back propagation, message passing)
Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC
Random walk
Simulated annealing, annealing EM
Randomized projection
…
25
Summary and discussion
• Other thoughts
– Link propagation:
 Friendship / correlation
?
?
 Preprocessing:
 Propagate S (self-propagation or based on an auxiliary similarity matrix)
 S is required to be a random matrix (positive entries, row sum = 1)
 Postprocessing:
 Propagate P (using S or an auxiliary similarity matrix)
 Both S and P are required to be random matrices
26
Summary and discussion
• Other thoughts
– Smoothness:
 Friendship / neighborhood
 Correlation, same-category
More parameters, but could be parameter free
Applying low-pass filtering
Single parameter
Spectral smoothness
27
Thanks!
Any comments would be appreciated!
28
Download