Graphical Models for Collaborative Filtering Le Song

advertisement
Graphical Models for
Collaborative Filtering
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Sequence modeling
HMM, Kalman Filter, etc.:
Similarity: the same graphical model topology, learning and
inference follow the same principle
Difference: the actual transition and observation model chosen
Learning parameter: EM
Compute posterior of hidden states: message passing
Decoding: message passing with max-product
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
2
MRF vs. CRF
MRF models joint of 𝑋 and π‘Œ
𝑃 𝑋, π‘Œ =
1
∏Ψ(𝑋𝑐 , π‘Œπ‘ )
𝑍
𝑍 does not depend on 𝑋 or π‘Œ
CRF models conditional of 𝑋
given π‘Œ
𝑃 π‘‹π‘Œ =
1
∏Ψ
𝑍(π‘Œ)
𝑋𝑐 , π‘Œπ‘
𝑍(π‘Œ) is a function of π‘Œ
Some complicated relation in π‘Œ
may be hard to model
Avoid modeling complicated
relation in π‘Œ
Learning: each gradient needs
one inference
Learning: in each gradient,
one inference per training
point
3
Parameter Learning for Conditional Random Fields
𝑃 𝑋1 , … , π‘‹π‘˜ |π‘Œ, πœƒ =
1
𝑍 π‘Œ,πœƒ
exp
𝑖𝑗 πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ
+
𝑖 πœƒπ‘– 𝑋𝑖
π‘Œ
1
= 𝑍 π‘Œ,πœƒ ∏𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ) ∏𝑖 exp(πœƒπ‘– 𝑋𝑖 π‘Œ)
𝑍 π‘Œ, πœƒ =
π‘₯ ∏𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ) ∏𝑖 exp(πœƒπ‘– 𝑋𝑖 π‘Œ)
𝑋1
𝑋2
𝑋3
π‘Œ
Maximize log conditional likelihood
𝑐𝑙 πœƒ, 𝐷 =
1
𝑁
log(∏𝑙=1
𝑙
𝑍 𝑦 ,πœƒ
=
=
𝑁
𝑙 (
𝑁
𝑙 (
∏𝑖𝑗 exp(πœƒπ‘–π‘— π‘₯𝑖 𝑙π‘₯𝑗𝑙 𝑦 𝑙 ) ∏𝑖 exp(πœƒπ‘– π‘₯𝑖𝑙 𝑦 𝑙 ))
𝑖𝑗 log(exp(πœƒπ‘–π‘— π‘₯𝑖
𝑙π‘₯ 𝑙 𝑦 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
π‘π‘Žπ‘› 𝑏𝑒 π‘œπ‘‘β„Žπ‘’π‘Ÿ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’
π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› 𝑓 π‘₯𝑖
𝑙 𝑙 𝑙
π‘₯𝑗 𝑦 )) + 𝑖 log(exp(πœƒπ‘– π‘₯𝑖𝑙 𝑦 𝑙 )) −
𝑙 𝑙
𝑙, πœƒ )
πœƒ
π‘₯
𝑦
−
log
𝑍
𝑦
𝑖
𝑖
𝑖
log 𝑍 𝑦 𝑙 , πœƒ )
π‘‡π‘’π‘Ÿπ‘š π‘™π‘œπ‘”π‘ 𝑦 𝑙 , πœƒ π‘‘π‘œπ‘’π‘  π‘›π‘œπ‘‘
π‘‘π‘’π‘π‘œπ‘šπ‘π‘œπ‘ π‘’!
4
Derivatives of log likelihood
𝑐𝑙 πœƒ, 𝐷 =
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
=
1
𝑁
=
1
𝑁
=
1
𝑁
𝑁
𝑙
1
𝑁
𝑙π‘₯ 𝑙 𝑦 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
𝑁 𝑙 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗 𝑦
𝑁 𝑙 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗 𝑦
𝑁 𝑙 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗 𝑦
(
−
−
−
πœ• log 𝑍 𝑦 𝑙 ,πœƒ
πœ•πœƒπ‘–π‘—
𝑙 𝑙
𝑙, πœƒ )
πœƒ
π‘₯
𝑦
−
log
𝑍
𝑦
𝑖
𝑖
𝑖
𝐴 π‘π‘œπ‘›π‘£π‘’π‘₯ π‘π‘Ÿπ‘œπ‘π‘™π‘’π‘š
Can find global optimum
πœ• 𝑍 𝑦 𝑙 ,πœƒ
1
Z(𝑦 𝑙 ,πœƒ)
πœ•πœƒπ‘–π‘—
1
Z(𝑦 𝑙 ,πœƒ)
π‘₯ ∏𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑦 𝑙 ) ∏𝑖 exp(πœƒπ‘– 𝑋𝑖 𝑦 𝑙 ) 𝑋𝑖 𝑋𝑗 𝑦 𝑙
𝑛𝑒𝑒𝑑 π‘‘π‘œ π‘‘π‘œ π‘–π‘›π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’ π‘“π‘œπ‘Ÿ π‘’π‘Žπ‘β„Ž 𝑦 𝑙 β€Ό!
5
Moment matching condition
πœ•π‘π‘™ πœƒ,𝐷
=
πœ•πœƒπ‘–π‘—
1 𝑁 𝑙 𝑙 𝑙
π‘₯ π‘₯ 𝑦
𝑁 𝑙 𝑖 𝑗
−
1
Z 𝑦 𝑙 ,πœƒ
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
from model
π‘’π‘šπ‘π‘–π‘Ÿπ‘–π‘π‘Žπ‘™
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
P 𝑋𝑖 , 𝑋𝑗 , π‘Œ =
P π‘Œ =
1
𝑁
𝑙 ∏ exp πœƒ 𝑋 𝑦 𝑙 𝑋 𝑋 𝑦 𝑙
∏
exp
πœƒ
𝑋
𝑋
𝑦
π‘₯ 𝑖𝑗
𝑖𝑗 𝑖 𝑗
𝑖
𝑖 𝑖
𝑖 𝑗
1
𝑁
𝑙
𝑙
𝑁
𝑙=1 𝛿(𝑋𝑖 , π‘₯𝑖 ) 𝛿(𝑋𝑗 , π‘₯𝑗 )
𝛿(π‘Œ, 𝑦 𝑙 )
𝑁
𝑙
𝑙=1 𝛿(π‘Œ, 𝑦 )
Moment matching:
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
𝑋𝑖 ,𝑋𝑗 ,π‘Œ
𝑋𝑖 𝑋𝑗 π‘Œ − 𝐸𝑃(𝑋|π‘Œ,πœƒ)P
π‘Œ
[𝑋𝑖 𝑋𝑗 π‘Œ]
6
Optimize MLE for undirected models
max 𝑐𝑙 πœƒ, 𝐷 is a convex optimization problem.
πœƒ
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters πœƒ
Loop until convergence
Compute
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
Update πœƒπ‘–π‘— ← πœƒπ‘–π‘— − πœ‚
𝑋𝑖 ,𝑋𝑗 ,π‘Œ
𝑋𝑖 𝑋𝑗 π‘Œ − 𝐸𝑃(𝑋|π‘Œ,πœƒ)P
π‘Œ
[𝑋𝑖 𝑋𝑗 π‘Œ]
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
7
Collaborative Filtering
8
Collaborative Filtering
R: rating matrix; U: user
factor; V: movie factor
min
U ,V
s .t .
f ( U ,V ) ο€½ R ο€­ UV
T
2
F
U ο‚³ 0 ,V ο‚³ 0 , k ο€Όο€Ό m , n .
Low rank matrix
approximation approach
Probabilistic matrix
factorization
Bayesian probabilistic matrix
factorization
9
Parameter Estimation and Prediction
Bayesian treats the unknown parameters as a random variable:
𝑃(πœƒ|𝐷) =
𝑃 𝐷 πœƒ 𝑃(πœƒ)
𝑃(𝐷)
=
𝑃 𝐷 πœƒ 𝑃(πœƒ)
∫ 𝑃 𝐷 πœƒ 𝑃 πœƒ π‘‘πœƒ
Posterior mean estimation: πœƒπ‘π‘Žπ‘¦π‘’π‘  = ∫ πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
πœƒ
𝑋𝑛𝑒𝑀
Maximum likelihood approach
πœƒπ‘€πΏ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 𝐷 πœƒ , πœƒπ‘€π΄π‘ƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃(πœƒ|𝐷)
𝑋
𝑁
Bayesian prediction, take into account all possible value of πœƒ
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = ∫ 𝑃 π‘₯𝑛𝑒𝑀 , πœƒ 𝐷 π‘‘πœƒ = ∫ 𝑃 π‘₯𝑛𝑒𝑀 πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
A frequentist prediction: use a “plug-in” estimator
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€πΏ ) π‘œπ‘Ÿ 𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€π΄π‘ƒ )
10
Matrix Factorization Approach
Unconstrained problem with Frobenius norm can be solved
using SVD
π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘ˆ,𝑉 𝑅 − π‘ˆπ‘‰ ⊀
Global optimal
2
𝐹
π‘‰βŠ€
𝑅
= π‘ˆ
When we have sparse entries, Frobenius norm is computed
only for the observed entries, can no longer use SVD
Eg. Use nonnegative matrix factorization
π‘Žπ‘Ÿπ‘”π‘šπ‘–π‘›π‘ˆ,𝑉 𝑅 − π‘ˆπ‘‰ ⊀ 2𝐹 , π‘ˆπ‘‰ ⊀ ≥ 0
Local optimal, over fitting problem
11
Probabilistic matrix factorization (PMF)
Model components
12
PMF: Parameter Estimation
Parameter estimation: MAP estimate
πœƒπ‘€π΄π‘ƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 πœƒ 𝐷, 𝛼 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 πœƒ, 𝐷 𝛼
= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 𝐷 πœƒ 𝑃(πœƒ|𝛼)
In the paper:
13
PMF: Interpret prior as regularization
Maximize the posterior distribution with respect to parameter
U and V
Equivalent to minimize the sum-of-squares error function with
quadratic regularization term (Plug in Gaussians and take log)
14
PMF: optimization
Optimization: alternating between π‘ˆ and 𝑉
Fix π‘ˆ, it is convex in 𝑉
Fix 𝑉, it is convex in π‘ˆ
Find a local minima
Need to choose the regularization parameter πœ†π‘ˆ and πœ†π‘‰
15
Bayesian PMF: generative model
A more flexible prior over π‘ˆ and
𝑉 factor
Hyperparameters
Θπ‘ˆ = {πœ‡π‘ˆ , Λ π‘ˆ }
Θ𝑉 = {πœ‡π‘‰ , Λ𝑉 }
16
Bayesian PMF: prior over prior
Add a prior over the hyperparameter
Hyperhyperparameter Θ0 = {πœ‡0 , 𝜈0 , π‘Š0 }
π‘Š is the Wishart distribution with 𝑣0 degrees of freedom and
a 𝐷 × π· scale matrix π‘Š0
17
Bayesian PMF: predicting new ratings
Bayesian prediction, take into account all possible value of πœƒ
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = ∫ 𝑃 π‘₯𝑛𝑒𝑀 , πœƒ 𝐷 π‘‘πœƒ = ∫ 𝑃 π‘₯𝑛𝑒𝑀 πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
In the paper, integrating out all parameters and
hyperparameters.
18
Bayesian PMF: sampling for inference
Use sampling technique to compute
Key idea: approximate expectation by sample average
𝐸𝑓 ≈
1
𝑁
𝑁
𝑖=1 𝑓
π‘₯𝑖
∗
πΈπ‘ˆ,𝑉,Θπ‘ˆ ,ΘV|𝑅 𝑝 𝑅𝑖𝑗
π‘ˆπ‘– , 𝑉𝑗
≈
1
𝑁
π‘˜π‘
∗
𝑅𝑖𝑗
π‘ˆπ‘–π‘˜ , π‘‰π‘—π‘˜
19
Gibbs Sampling
Gibbs sampling
𝑋 = π‘₯0
For t = 1 to N
πΉπ‘œπ‘Ÿ π‘”π‘Ÿπ‘Žπ‘β„Žπ‘–π‘π‘Žπ‘™ π‘šπ‘œπ‘‘π‘’π‘™π‘ 
Only need to condition on the
Variables in the Markov blanket
π‘₯1𝑑 = 𝑃(𝑋1 |π‘₯2𝑑−1 , … , π‘₯𝐾𝑑−1 )
π‘₯2𝑑 = 𝑃(𝑋2 |π‘₯1𝑑 , … , π‘₯𝐾𝑑−1 )
…
𝑑
π‘₯𝐾𝑑 = 𝑃(𝑋2 |π‘₯1𝑑 , … , π‘₯𝐾−1
)
𝑋3
𝑋2
𝑋1
Variants:
Randomly pick variable to sample
sample block by block
𝑋4
𝑋5
20
Bayesian PMF: Gibbs sampling
We have a directed graphical model
Moralize first
Markov blankets
π‘ˆπ‘– : 𝑅, 𝑉, Θπ‘ˆ
𝑉𝑖 : 𝑅, π‘ˆ, Θ𝑉
Θπ‘ˆ : π‘ˆ, Θ0
Θ𝑉 : 𝑉, Θ0
21
Bayesian PMF: Gibbs sampling equation
Gibbs sampling equation for π‘ˆπ‘–
22
Bayesian PMF: overall algorithm
πΆπ‘Žπ‘› 𝑏𝑒 π‘ π‘Žπ‘šπ‘π‘™π‘’π‘‘
𝑖𝑛 π‘π‘Žπ‘Ÿπ‘Žπ‘™π‘™π‘’π‘™
23
Experiments: RMSE
24
Experiments: Runtime
25
Experiments: posterior distribution
26
Experiments: users with different history
27
Experiments: effect of training size
28
Issue: diagnose convergence
Gibbs sampling: take sample after burn-in period
Sampled Value
Iteration number
29
Download