Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology, learning and inference follow the same principle Difference: the actual transition and observation model chosen Learning parameter: EM Compute posterior of hidden states: message passing Decoding: message passing with max-product π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 2 MRF vs. CRF MRF models joint of π and π π π, π = 1 ∏Ψ(ππ , ππ ) π π does not depend on π or π CRF models conditional of π given π π ππ = 1 ∏Ψ π(π) ππ , ππ π(π) is a function of π Some complicated relation in π may be hard to model Avoid modeling complicated relation in π Learning: each gradient needs one inference Learning: in each gradient, one inference per training point 3 Parameter Learning for Conditional Random Fields π π1 , … , ππ |π, π = 1 π π,π exp ππ πππ ππ ππ π + π ππ ππ π 1 = π π,π ∏ππ exp(πππ ππ ππ π) ∏π exp(ππ ππ π) π π, π = π₯ ∏ππ exp(πππ ππ ππ π) ∏π exp(ππ ππ π) π1 π2 π3 π Maximize log conditional likelihood ππ π, π· = 1 π log(∏π=1 π π π¦ ,π = = π π ( π π ( ∏ππ exp(πππ π₯π ππ₯ππ π¦ π ) ∏π exp(ππ π₯ππ π¦ π )) ππ log(exp(πππ π₯π ππ₯ π π¦ π + π π₯ ππ π ππ π πππ ππ ππ‘βππ ππππ‘π’ππ ππ’πππ‘πππ π π₯π π π π π₯π π¦ )) + π log(exp(ππ π₯ππ π¦ π )) − π π π, π ) π π₯ π¦ − log π π¦ π π π log π π¦ π , π ) ππππ ππππ π¦ π , π ππππ πππ‘ ππππππππ π! 4 Derivatives of log likelihood ππ π, π· = πππ π,π· ππππ = 1 π = 1 π = 1 π π π 1 π ππ₯ π π¦ π + π π₯ ππ π ππ π π π π π π π₯π π₯π π¦ π π π π π π₯π π₯π π¦ π π π π π π₯π π₯π π¦ ( − − − π log π π¦ π ,π ππππ π π π, π ) π π₯ π¦ − log π π¦ π π π π΄ ππππ£ππ₯ πππππππ Can find global optimum π π π¦ π ,π 1 Z(π¦ π ,π) ππππ 1 Z(π¦ π ,π) π₯ ∏ππ exp(πππ ππ ππ π¦ π ) ∏π exp(ππ ππ π¦ π ) ππ ππ π¦ π ππππ π‘π ππ πππππππππ πππ πππβ π¦ π βΌ! 5 Moment matching condition πππ π,π· = ππππ 1 π π π π π₯ π₯ π¦ π π π π − 1 Z π¦ π ,π πππ£πππππππ πππ‘πππ₯ from model πππππππππ πππ£πππππππ πππ‘πππ₯ P ππ , ππ , π = P π = 1 π π ∏ exp π π π¦ π π π π¦ π ∏ exp π π π π¦ π₯ ππ ππ π π π π π π π 1 π π π π π=1 πΏ(ππ , π₯π ) πΏ(ππ , π₯π ) πΏ(π, π¦ π ) π π π=1 πΏ(π, π¦ ) Moment matching: πππ π,π· ππππ = πΈP ππ ,ππ ,π ππ ππ π − πΈπ(π|π,π)P π [ππ ππ π] 6 Optimize MLE for undirected models max ππ π, π· is a convex optimization problem. π Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters π Loop until convergence Compute πππ π,π· ππππ = πΈP Update πππ ← πππ − π ππ ,ππ ,π ππ ππ π − πΈπ(π|π,π)P π [ππ ππ π] πππ π,π· ππππ 7 Collaborative Filtering 8 Collaborative Filtering R: rating matrix; U: user factor; V: movie factor min U ,V s .t . f ( U ,V ) ο½ R ο UV T 2 F U ο³ 0 ,V ο³ 0 , k οΌοΌ m , n . Low rank matrix approximation approach Probabilistic matrix factorization Bayesian probabilistic matrix factorization 9 Parameter Estimation and Prediction Bayesian treats the unknown parameters as a random variable: π(π|π·) = π π· π π(π) π(π·) = π π· π π(π) ∫ π π· π π π ππ Posterior mean estimation: ππππ¦ππ = ∫ π π π π· ππ π ππππ€ Maximum likelihood approach πππΏ = ππππππ₯π π π· π , πππ΄π = ππππππ₯π π(π|π·) π π Bayesian prediction, take into account all possible value of π π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ = ∫ π π₯πππ€ π π π π· ππ A frequentist prediction: use a “plug-in” estimator π π₯πππ€ π· = π(π₯πππ€ | πππΏ ) ππ π π₯πππ€ π· = π(π₯πππ€ | πππ΄π ) 10 Matrix Factorization Approach Unconstrained problem with Frobenius norm can be solved using SVD πππππππ,π π − ππ β€ Global optimal 2 πΉ πβ€ π = π When we have sparse entries, Frobenius norm is computed only for the observed entries, can no longer use SVD Eg. Use nonnegative matrix factorization πππππππ,π π − ππ β€ 2πΉ , ππ β€ ≥ 0 Local optimal, over fitting problem 11 Probabilistic matrix factorization (PMF) Model components 12 PMF: Parameter Estimation Parameter estimation: MAP estimate πππ΄π = ππππππ₯π π π π·, πΌ = ππππππ₯π π π, π· πΌ = ππππππ₯π π π· π π(π|πΌ) In the paper: 13 PMF: Interpret prior as regularization Maximize the posterior distribution with respect to parameter U and V Equivalent to minimize the sum-of-squares error function with quadratic regularization term (Plug in Gaussians and take log) 14 PMF: optimization Optimization: alternating between π and π Fix π, it is convex in π Fix π, it is convex in π Find a local minima Need to choose the regularization parameter ππ and ππ 15 Bayesian PMF: generative model A more flexible prior over π and π factor Hyperparameters Θπ = {ππ , Λ π } Θπ = {ππ , Λπ } 16 Bayesian PMF: prior over prior Add a prior over the hyperparameter Hyperhyperparameter Θ0 = {π0 , π0 , π0 } π is the Wishart distribution with π£0 degrees of freedom and a π· × π· scale matrix π0 17 Bayesian PMF: predicting new ratings Bayesian prediction, take into account all possible value of π π π₯πππ€ π· = ∫ π π₯πππ€ , π π· ππ = ∫ π π₯πππ€ π π π π· ππ In the paper, integrating out all parameters and hyperparameters. 18 Bayesian PMF: sampling for inference Use sampling technique to compute Key idea: approximate expectation by sample average πΈπ ≈ 1 π π π=1 π π₯π ∗ πΈπ,π,Θπ ,ΘV|π π π ππ ππ , ππ ≈ 1 π ππ ∗ π ππ πππ , πππ 19 Gibbs Sampling Gibbs sampling π = π₯0 For t = 1 to N πΉππ ππππβππππ ππππππ Only need to condition on the Variables in the Markov blanket π₯1π‘ = π(π1 |π₯2π‘−1 , … , π₯πΎπ‘−1 ) π₯2π‘ = π(π2 |π₯1π‘ , … , π₯πΎπ‘−1 ) … π‘ π₯πΎπ‘ = π(π2 |π₯1π‘ , … , π₯πΎ−1 ) π3 π2 π1 Variants: Randomly pick variable to sample sample block by block π4 π5 20 Bayesian PMF: Gibbs sampling We have a directed graphical model Moralize first Markov blankets ππ : π , π, Θπ ππ : π , π, Θπ Θπ : π, Θ0 Θπ : π, Θ0 21 Bayesian PMF: Gibbs sampling equation Gibbs sampling equation for ππ 22 Bayesian PMF: overall algorithm πΆππ ππ π ππππππ ππ ππππππππ 23 Experiments: RMSE 24 Experiments: Runtime 25 Experiments: posterior distribution 26 Experiments: users with different history 27 Experiments: effect of training size 28 Issue: diagnose convergence Gibbs sampling: take sample after burn-in period Sampled Value Iteration number 29