F - Shuang Yang

Matrix Factorization Models, Algorithms and Applications Shuang Hong Yang http://www.cc.gatech.edu/~syang46/ shyang@gatech.edu May, 2010 Outline • Problem Definition • Overview – Taxonomy by targeted tasks – Taxonomy by models – Taxonomy by algorithms • Representative work • Summary and Discussion 2 Outline • Problem Definition • Overview – Taxonomy by targeted tasks – Taxonomy by models – Taxonomy by algorithms • Representative work • Summary and Discussion 3 Problem Definition • Matrix Factorization: for a given matrix M, find a compact (low-rank) approximation a a u(a) M(a,b) M(a,b) D b b v(b) – M may be partially observed (i.e., entry missing) – In the simplest form: (U,V) = arg min ||M-UTV||F2     an identity function f(x) = x is used as link function U, V and D are interacting in a multiplicative fashion D is assumed to be an identity matrix Euclidian distance is used as the measure of goodness 4 Problem Definition (cont) • Matrices Co-Factorization For a given set of related matrices {M}, find a coupled set of compact (low-rank) approximations – Each M represents an interaction observation between two entity – Multi-View MF: – Joint MF: a b c 5 Outline • Problem Definition • Overview – Taxonomy by targeted tasks – Taxonomy by models – Taxonomy by algorithms • Representative work • Summary and Discussion 6 Overview: MF taxonomy • Targeted tasks: – Dimensionality reduction     PCA and other spectral algorithms LSI and other SVD algorithms NMF and other (convex) optimization algorithms PPCA and other statistical models – Clustering  K-mean, mean-shift, min-cut, normalized-cut, NMF, etc  Gaussian mixture model, Bi-Gaussian model, etc – Factor analysis (e.g., profiling, decomposition)  ICA, CCA, NMF, etc.  SVD, MMMF, etc. – Codebook learning  Sparse coding, k-mean, NMF, LDA, etc. – Topic modeling  LSI, LDA, PLSI, etc. – Graph mining  Random walk, pagerank, Hits, etc – Prediction     Classification, Regression Link prediction, matrix completion, community detection Collaborative filtering, recommendation, learning to rank Domain adaptation, multi-task learning 7 Overview: MF taxonomy • Models: – Computational: by optimization models differ in objective and regularizer design  Objective:         L2 error minimization (least square, Frobenius in matrix form) L1 error minimization (least absolute deviation) Hinge, logistic, log, cosine loss Huber loss, ε-loss, etc. Information-theoretic loss: entropy, mutual information, KL-divergence Exponential family loss and Bregman divergence: logistic, log, etc. Graph laplacian and smoothness Joint loss of fitting error and prediction accuracy  Regularizer:     – L2 norm, L1 norm, Ky-Fan (e.g. nuclear) Graph laplacian and smoothness Lower & upepr bound constraint (e.g., positive constraint) Other constraint: linear constraint (e.g., probabilistic wellness), quadratic constraint (e.g., covariance), orthogonal constraint Statistic: by inference model differ in factorization, prior and conditional design  Factorization:  Symmetric: p(a,b) = Σzp(z)p(a|z)p(b|z)  Asymmetric: p(a,b) = p(a)Σzp(z|a)p(b|z)  Conditional: usually exponential family  Gaussian, Laplacian, Multinomial, Bernoulli, Poission  Prior:     Conjugate prior Popularly picked ones: Gaussian, Laplacian (or exponential), Dirichlet Non-informative prior: max entropy prior, etc. Nonparametric: ARD, Chinese restaurant process, Indian buffet process, etc. 8 Overview: MF taxonomy • Models: – Connection between the two lines [Collins et al, NIPS02,Long et al KDD 07, Singh et al, ECML08] L2 L1 Logistic Bregman/KL … L2 L1 Lap. Smoothness … Computational Statistic loss function conditional regularization prior Gaussian Laplacian Bernoulli Exponential family … Gaussian Lap./Exp. Gaussian Rand Field … 9 Overview: MF taxonomy • Algorithm: – Deterministic: • • • • • • • • • Spectral analysis Matrix decomposition: SVD, QR, LU Solving linear system Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, SDP, etc. Alternative coordinate descent LARS, IRLS EM Mean field, Variational Bayesian, Expectation Propagation, collapsed VB … – Stochastic:       Stochastic gradient descent (back propagation, message passing) Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC Random walk Simulated annealing, annealing EM Randomized projection … 10 Outline • Problem Definition • Overview – Taxonomy by targeted tasks – Taxonomy by models – Taxonomy by algorithms • Representative work • Summary and Discussion 11 Representative Work • Spectral dimensionality reduction / clustering: – PCA: • L2 loss + orthogonal constraint min ||M – UTV||F2, subject to: VTV = I • Solve by spectral analysis of MTM • Analogous to factor-based collaborative filtering – Laplacian eigenmap:  Laplacian smoothness + orthogonal constraint min wij||uiV-ujV||2, subject to: VTV = I  Graph encodes neighboring info, e.g., heat, kNN  Analogous to neighbor-based collaborative filtering 12 Representative Work • Plain CF: – Factor based (Ky-Fan): • L2 loss + L2 regularization min ||M – UTV||F2 + C1||U||F2 +C2||V||F2 • Solve by SVD • Analogous to LSI, PCA, etc. – Neighbor based (item or query oriented):  Not explicitly perform factorization, but could be viewed equivalently as min wij||uiMi-ujMi||2, subject to: Σv uiv = 1, uiv ≥0  Graph encodes neighboring info, e.g., heat, kNN  Analogous to k-mean, Laplacian eigenmap 13 Representative Work • Joint factor-neighbor based CF: [Koren: Factor meets the Neighbors: Scalable and Accurate Collaborative Filtering, KDD’08] – – – – L2 loss + L2 regularization Neighbor graph constructed by pearson correlation Solve by stochastic gradient descent Analogous to locality (Laplacian) regularized PCA. 14 Representative Work • Max-margin matrix factorization: – Max-margin dimensionality reduction: [a lot of work here] • Hinge loss + L2 regularization min h(yij – uiTDvj)+C1||D||F2 +C2||U||F2 +C3||V||F2 • Solve by SDP, cutting plane, etc. – Max-Margin Matrix Factorization:[Srebro et al, NIPS 2005, ALT 2005]  Hinge loss + Ky-Fan min h(mij – uiTvj)+C1||U||F2 +C2||V||F2  Note: no constraint for the rank of U or V  Solve by SDP – CoFi-Rank: [Weimer et al, NIPS 2009]  NDCG + Ky-Fan min n(mij – uiTvj)+C1||U||F2 +C2||V||F2  Note: no constraint for the rank of U or V  Solve by SDP, bundle methods 15 Representative Work • Sparse coding: [Lee et al, NIPS 2007] [Lee et al, IJCAI 2009] – L2 sparse coding : • L2 loss + L1 regularization min ||M – UTV||F2+C1||U||1 +C2||V||F2 • Solve by LARS, IRLS, gradient descent with sign searching – Exponential family sparse coding:  Bregman divergence + L1 regularization min D(Mab||g(uavb)) + C1||U||1 +C2||V||F2  Solve by gradient descent with sign searching – Sparse is good --- my guess: • compacter usually implies predictive • Sparsity poses stronger prior, making local optima more distinguishable • Shorter descriptive length ( the principal of MDL) 16 Representative Work • NMF, LDA, and Exponential PCA – NMF: [Lee et al, NIPS 2001] • L2 loss + nonnegative constraint min ||M – UTV||F2, subject to: U≥0, V ≥ 0 • Solve by SDP, projected gradient descent, interior point – LDA: [Blei et al, NIPS 2002]  Asymmetric + Multinomial conditional + conjugate (Dirichlet) prior ua~ Dir(α), zab~Disc(ua), Mab~Mult(V, zab)  Variational Bayesian, EP, Gibbs sampling, collapsed VB/GS – Exponential PCA: [Collins et al, NIPS 2002]  Bregman divergence + orthogonal constraint min D(Mab||g(uavb)) , subject to: VTV = I  Solved by gradient descent – Essentially, these are equivalent to each other 17 Representative Work • Link analysis: – Factor based / bi-clustering: [a lot of papers in co-clustering and social network analysis] • L2 loss + L2 regularization min ||M – UTDV||F2+C1||U||1 +C2||V||F2 +C3||D||F2 • To further simplify, assume diagonal or even identity D • Modern models use logistic regression – Bayesian Co-clustering [Shan et al ICDM 2008] Or Mixed membership stochastic block model [Airoldi et al, NIPS 2008]  Symmetric + Bernoulli conditional + Dirichlet prior ui~ Dir(α), zi~Disc(ui), Mij~sigmoid(ziTDzj) – Nonparametric feature model: [Miller et al, NIPS 2010]  Symmetric + Bernoulli conditional + Nonparametric prior zi~ IBP(α), Mij~sigmoid(ziTDzj) – In essence, equivalent 18 Representative Work • Joint Link & content analysis: – Collective factorization: • L2 loss + L2 regularization [Long et al, ICML 2006, AAAI 2008; Zhou et al, WWW 2008] • Or Laplacian smoothness loss + orthogonal [Zhou et al ICML 2007] • Shared representation matrix min ||M – UTDU||F2 +||F – UTB|| +C1||U||1 +C2||B||F2 +C3||D||F2 – Relational topic model: [Chang et al, AISTATS 2009, KDD 2009]  For M: Symmetric + Bernoulli conditional + Dirichlet Prior  For F: Asymmetric + Multinomial conditional + Dirichlet Prior  Shared representation matrix ui~ Dir(α), zif~Disc(ui), Fif~Mult(B, zif) , Mij~sigmoid(ziTDzj), – Regression based latent factor model: [Agarwal et al, KDD 2009]  For M: Symmetric + Gaussian conditional + Gaussian Prior  For F: Linear regression (Gaussian) zi~Gaussian(BxFi , σI) , Mij~Gaussian(ziTzj), – fLDA model: [Agarwal et al, WSDM 2009]  LDA content factorization + Gaussian factorization model – In essence, equivalent 19 Representative Work • Tensor factorization/hypergraph mining and personalized CF: – Two-way model: [Rendle et al WSDM 2010, WWW 2010] min ||Mijk – uiTDvj - uiTDwk - vjTDwk ||F2 +C(||U|| F2 +||V||F2 +||W||F2 +||D||F2 ) i ui i Mijk j vj j k wk k – Full factorization: + + Mijk [Symeonidis et al RecSys 2008, Rendle et al KDD 2009] min ||Mijk – < ui , vj ,wk >||F2 +C(||U|| F2 +||V||F2 +||W||F2) i j k i Mijk j k ui vj wk + + Mijk 20 Outline • Problem Definition • Overview – Taxonomy by targeted tasks – Taxonomy by models – Taxonomy by algorithms • Representative work • Summary and Discussion 21 Summary and discussion • Recipe for design an MF model: – Step 1: Understand your task / data:  What is the goal of my task?  What is the underlying mechanism in the task?  Knowledge, patterns, heuristics, clues…  What data are available to support my task?  Are all the available data sources reliable and useful to achieve the goal? Any preprocessing/aggregation needed?  What is the basic characteristic of my data?  Symmetric, directional  positive, fractional, centralized, bounded  positive definite, triangle inequality  Which distribution is appropriate to interpret my data?  Any special concerns for the task?  Task requirement: is there a need for online operation?  Resources constraint: computational cost, labeled data,… 22 Summary and discussion • Recipe for design an MF model: – Step 2: Choose an appropriate model:  Computational or statistic?  Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…  Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…  If computational:  Which loss function?       L2, most popular, most efficient, generally promising Evidently heavy noise: L1, Huber, epsilon Dominant locality: Laplacian smoothness Specific distribution: Bregman divergence (also use a link function) Measurable prediction quality: wrapper the prediction objective Readily translated knowledge, heuristic, clue:  What regularization?      L2, most popular, most efficient Any constraints to retain? Sparsity: L1 Dominant locality: Laplacian smoothness Readily translated knowledge, heuristic, clue 23 Summary and discussion • Recipe for design an MF model: – Step 2: Choose an appropriate model (cont):  Computational or statistic?  Computational models are generally efficient, ease-of-implementation, off-the-shelf blackbox (no need for fancy skills)…  Statistic models are usually interpretable, robust to overfitting, prior-knowledge-friendly, promising if properly designed…  If statistic:  How to decompose the joint pdf?  To reflect the underlying mechanism  To efficiently parameterize  What’s the appropriate model for each pdf factor?  To encode prior knowledge/underlying mechanism  To reflect the data distribution  What’s the appropriate prior for Bayesian treatment?     Conjugate: Sparsity: Laplacian, exponential Nonparametric prior No idea? Choose none or noninformative 24 Summary and discussion • Recipe for design an MF model: – Step 3: Choose or derive an algorithm:  To meet task requirement and/or resource constraints  To ease implementation  To achieve the best of the performance Deterministic: • • • • • • • • • Spectral analysis Matrix decomposition: SVD, QR, LU Solving linear system Optimization: LP, gradient descent, conjugate gradient, quasi-Newton, etc. Alternative coordinate descent LARS, IRLS EM Mean field, Variational Bayesian, Expectation Propagation, collapsed VB … Stochastic:       Stochastic gradient descent (back propagation, message passing) Monte Carlo: MCMC, Gibbs sampling, collapsed MCMC Random walk Simulated annealing, annealing EM Randomized projection … 25 Summary and discussion • Other thoughts – Link propagation:  Friendship / correlation ? ?  Preprocessing:  Propagate S (self-propagation or based on an auxiliary similarity matrix)  S is required to be a random matrix (positive entries, row sum = 1)  Postprocessing:  Propagate P (using S or an auxiliary similarity matrix)  Both S and P are required to be random matrices 26 Summary and discussion • Other thoughts – Smoothness:  Friendship / neighborhood  Correlation, same-category More parameters, but could be parameter free Applying low-pass filtering Single parameter Spectral smoothness 27 Thanks! Any comments would be appreciated! 28

F - Shuang Yang

Related documents

Products

Support

F - Shuang Yang

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib