Sequence Learning Le Song Machine Learning II: Advanced Topics

advertisement
Sequence Learning
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Topic Modeling in Document Analysis
Document Collections
Topic Discovery: Discover Sets of Frequently
Co-occuring Words
Topics Attribution: Attribute Words to a Particular Topic
[Blei et al. 03]
Basic Model: Latent Dirichlet Allocation
A generative model for text documents
For each document, choose a topic distribution Θ
For each word w in the document
Choose a specific topic z for the word
Choose the word w from the topic (a distribution over vocabulary)
LDA
pLSI
Simplified version without the prior distribution 𝑃 πœƒ 𝛼
corresponds to probabilistic latent semantic indexing (pLSI)
[Blei et al. 03]
Learning Problem in Topic Modeling
Automatically discover the topics from document collections
Given: a collection of M documents
Basic learning problems: discover the following latent
structures
k topics
Per-document topic distribution
Per-word topic assignment
Application of topic modeling
Visualization and interpretation
Use topic distribution as document features for classification or
clustering
Predict the occurrence of words in new documents (collaborative
filtering)
Gibbs Sampling for Learning Topic Models
Variational Inference for Learning Topic Models
Approximate the posterior distribution P of latent variable by a
simpler distribution Q(Φ*) in a parametric family
Φ* = argmin KL(Q(Φ) || P), where KL = Kullback-Leibler
The approximated posterior is then used in the E-step of the
EM algorithm
Only find a local minima
Variational approximation
of mixture of 2 Gaussians
by a single Gaussian
Each figure showing a
different local optimum
Sequence Learning
Hidden Markov Models
Speech recognition
Genome sequence modeling
…
Kalman Filter (Gaussian noise models and linear models)
Tracking
Weather forecasting
…
Conditional random fields (discriminative model)
Natural language processing
Image processing
…
7
More complex sequence models
8
Hidden Markov Models
𝐻𝑖𝑑𝑑𝑒𝑛 π‘†π‘‘π‘Žπ‘‘π‘’π‘  π‘Œπ‘‘
π‘‡π‘Ÿπ‘Žπ‘›π‘ π‘–π‘‘π‘–π‘œπ‘› π‘€π‘œπ‘‘π‘’π‘™
𝑃(π‘Œπ‘‘+1 |π‘Œπ‘‘ )
π‘‚π‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘› π‘€π‘œπ‘‘π‘’π‘™
𝑃(𝑋𝑑 |π‘Œπ‘‘ )
π‘‚π‘π‘ π‘’π‘Ÿπ‘£π‘Žπ‘‘π‘–π‘œπ‘› 𝑋𝑑
9
Three Problems in Hidden Markov Models
Given an observation sequence and a model, how to compute
the probability of the observed sequence?
Given an observation sequence and a model, how do we
choose the hidden states that best explain the observations?
How do we learn the model given the observation sequence?
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
10
Hidden Markov Model
Directed graphical model
The joint distribution factorizes according to parent child
relation
𝑃(𝑋1 , … , 𝑋𝑛 , π‘Œ1 , … , π‘Œπ‘› ) = 𝑃(π‘Œ1 )
𝑃 𝑋𝑑 π‘Œπ‘‘ )
𝑑=1…𝑛
𝑃 π‘Œπ‘‘ π‘Œπ‘‘−1 )
𝑑=2…𝑛
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
11
HMM Problem I: sequence probability
Given an observation sequence and a model, how to compute
the probability of the observed sequence?
Inference problem in graphical models: compute the marginal
distribution of observation by summing out the hidden
variables
𝑃 𝑋1 , … , 𝑋𝑛 , π‘Œ1 , … , π‘Œπ‘› =
𝑃 𝑋1 , … , 𝑋𝑛 , π‘Œ1 , … , π‘Œπ‘›
π‘Œ1 ,…,π‘Œπ‘›
=
𝑃(π‘Œ1 )
π‘Œ1 ,…,π‘Œπ‘›
𝑃 𝑋𝑑 π‘Œπ‘‘ )
𝑑=1…𝑛
𝑃 π‘Œπ‘‘ π‘Œπ‘‘−1 )
𝑑=2…𝑛
12
HMM Problem I: sequence probability
𝐡
𝐴
π‘šπ΄π΅ 𝑏
π‘šπ΅πΆ 𝑐
𝐸
𝐷
𝐢
π‘šπΆπ· 𝑑
π‘šπ·πΈ 𝐸
Nice localization in computation
𝑍=
𝑍=
𝑒
𝑑
𝑒
𝑑𝑓
𝑐
𝑏
𝐸𝑑
π‘Žπ‘“
π‘Ž)𝑓 𝑏 π‘Ž 𝑓 𝑐 𝑏 𝑓 𝑑 𝑐 𝑓(𝐸|𝑑
𝑐𝑓 𝑑 𝑐 ( 𝑏𝑓 𝑐 𝑏
π‘Ž 𝑓 𝑏 π‘Ž 𝑓 π‘Ž)
π‘šπ΄π΅ 𝑏
π‘šπ΅πΆ 𝑐
π‘šπΆπ· 𝑑
π‘šπ·πΈ 𝐸
𝑍
13
HMM Problem II: best hidden states
Given an observation sequence and a model, how to compute
the hidden state with highest probability? (decoding)
Inference problem in graphical models: compute the maximum
a posterior assignment (MAP) for hidden states given the
observations
π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘Œ1,…,π‘Œπ‘› 𝑃 𝑋1 , … , 𝑋𝑛 , π‘Œ1 , … , π‘Œπ‘›
= π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘Œ1,…,π‘Œπ‘› 𝑃(π‘Œ1 )
𝑃 𝑋𝑑 π‘Œπ‘‘ )
𝑑=1…𝑛
𝑃 π‘Œπ‘‘ π‘Œπ‘‘−1 )
𝑑=2…𝑛
14
HMM Problem II: best hidden states
The general problem solve by the junction tree algorithm is the
sum-of-product problem
𝑃 𝑋𝑖 ∝
𝑓 𝑋1 , … , 𝑋𝑛 =
𝑉\X𝑖
Ψ π·π‘—
𝑉\X𝑖
𝑗
The property used by the junction tree algorithm is the
distributivity of × over +
For HMM, it is distributivity of max over product. Or if you take
log of the probability, it is distributivity of max over sum.
You just need to keep track of the hidden sequence that
achieve the maximum score
15
HMM problem II: find model parameters
Given just an observation sequence, find the model
parameters that maximize the likelihood of the data
Suppose the set of parameters are Θ
π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯Θ log 𝑃 𝑋1 , … , 𝑋𝑛
aπ‘Ÿπ‘”π‘šπ‘Žπ‘₯Θ log π‘Œ1 ,…,π‘Œπ‘› 𝑃(π‘Œ1 )
𝑑=1…𝑛 𝑃
𝑋𝑑 π‘Œπ‘‘ )
𝑑=2…𝑛 𝑃
π‘Œπ‘‘ π‘Œπ‘‘−1 )
16
EM algorithm
EM: Expectation-maximization for finding πœƒ
𝑙 πœƒ; 𝐷 = log
𝑧𝑝
π‘₯, 𝑧 πœƒ = log
𝑧 𝑝(π‘₯𝑖 | 𝑧, πœƒ2 )𝑃
𝑧|πœƒ1
Iterate between E-step and M-step until convergence
Expectation step (E-step)
𝑓 πœƒ = πΈπ‘ž
𝑧
log 𝑝 π‘₯, 𝑧 πœƒ , π‘€β„Žπ‘’π‘Ÿπ‘’ π‘ž 𝑧 = 𝑃(𝑧|π‘₯, πœƒ 𝑑 )
Maximization step (M-step)
πœƒ 𝑑+1 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑓 πœƒ
Also called Baum-Welch algorithm
17
More complex sequence models
18
Kalman Filter
Also called state space models, or linear dynamical system
𝑋𝑑 = π΄π‘Œπ‘‘ + π‘Šπ‘‘
π‘Œπ‘‘ = πΆπ‘Œπ‘‘−1 + 𝑉𝑑
π‘Šπ‘‘ ∼ 𝑁 0; 𝑄 , 𝑉𝑑 ∼ 𝑁 0; 𝑅
π‘₯0 ∼ 𝑁(0; Σ)
𝐼𝑛 π‘”π‘’π‘›π‘’π‘Ÿπ‘Žπ‘™ π‘π‘Žπ‘› 𝑏𝑒 π‘›π‘œπ‘›π‘™π‘–π‘›π‘’π‘Žπ‘Ÿ
π‘ π‘¦π‘ π‘‘π‘’π‘š 𝐹 π‘Œπ‘‘ and 𝐺 π‘Œπ‘‘−1
HMM with special transition and observation models
𝑃 π‘Œπ‘‘ π‘Œπ‘‘−1 ∼ 𝑁(π‘Œπ‘‘ − πΆπ‘Œπ‘‘−1 ; 𝑅)
𝑃 𝑋𝑑 π‘Œπ‘‘ ∼ 𝑁 𝑋𝑑 − π΄π‘Œπ‘‘ ; 𝑄
𝑃 𝑋0 ∼ 𝑁(0; Σ)
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
19
Inference problem in Kalman filter
Filtering: given an observation sequence π‘₯1 … π‘₯𝑑 , estimate 𝑦𝑑
Inference problem in graphical models: estimate the marginal
probability:
𝑃(𝑦𝑑 |π‘₯1 … π‘₯𝑑 )
By summing out all value of 𝑦1 , … , 𝑦𝑑−1
Forward message passing
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
20
Inference problem in Kalman filter
Smoothing: given an observation sequence π‘₯1 … π‘₯𝑑 , estimate
y1 , … , 𝑦𝑑
Inference problem in graphical models: estimate the marginal
probability:
For all t, 𝑃(𝑦𝑑 |π‘₯1 … π‘₯𝑑 )
By summing out all internal variables other than 𝑦𝑑
Forward and backward message passing
π‘Œ1
π‘Œ2
π‘Œπ‘‘
π‘Œπ‘‘+1
𝑋1
𝑋2
𝑋𝑑
𝑋𝑑+1
21
Learning Kalman Filters
HMM with special transition and observation models
𝑃 π‘Œπ‘‘ π‘Œπ‘‘−1 ∼ 𝑁(π‘Œπ‘‘ − πΆπ‘Œπ‘‘−1 ; 𝑅)
𝑃 𝑋𝑑 π‘Œπ‘‘ ∼ 𝑁 𝑋𝑑 − π΄π‘Œπ‘‘ ; 𝑄
𝑃 𝑋0 ∼ 𝑁(0; Σ)
Usually, people have chosen some dynamics for A and C, just
estimate the noise model for this case
New position = old position + Δ x velocity + noise
Observation = true position + noise
If no true position is provided, use EM algorithm for learning
22
Tracking and Smoothing with Kalman Filters
23
Parameter Learning for Markov Random Fields
𝑃 𝑋1 , … , π‘‹π‘˜ |πœƒ =
1
=π‘πœƒ
1
𝑍 πœƒ
exp
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑍 πœƒ =
𝑙 πœƒ, 𝐷 = log(
π‘₯
𝑖𝑗 πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑁
𝑙
(
𝑙 𝑙
πœƒ
π‘₯
π‘₯𝑗 +
𝑖𝑗
𝑖
𝑖𝑗
π‘π‘Žπ‘› 𝑏𝑒 π‘œπ‘‘β„Žπ‘’π‘Ÿ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’
π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› 𝑓 π‘₯𝑖
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑙 𝑙
exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 π‘₯𝑗 )
𝑖𝑗
𝑙 π‘₯ 𝑙 )) +
= 𝑁
(
log
(exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙
𝑖𝑗
log 𝑍 πœƒ )
=
𝑖 πœƒπ‘– 𝑋𝑖
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
1
𝑁
𝑙=1 𝑍 πœƒ
+
𝑋2
𝑋4
𝑋6
𝑋1
𝑋3
𝑋5
𝑙
exp
(πœƒ
π‘₯
𝑖 𝑖 ))
𝑖
𝑙
log(exp
(πœƒ
π‘₯
𝑖 𝑖 )) −
𝑖
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
π‘‡π‘’π‘Ÿπ‘š π‘™π‘œπ‘”π‘ πœƒ π‘‘π‘œπ‘’π‘  π‘›π‘œπ‘‘
π‘‘π‘’π‘π‘œπ‘šπ‘π‘œπ‘ π‘’!
24
Derivatives of log likelihood
𝑙 πœƒ, 𝐷 =
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
=
1
𝑁
=
=
1
𝑁
𝑁 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗
𝑁
𝑙
𝑁
𝑙
𝑁
𝑙
−
(
𝑙π‘₯ 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
1
Z(πœƒ)
π‘₯
−
−
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
πœ• log 𝑍 πœƒ
πœ•πœƒπ‘–π‘—
𝐴 π‘π‘œπ‘›π‘£π‘’π‘₯ π‘π‘Ÿπ‘œπ‘π‘™π‘’π‘š
Can find global optimum
1 πœ•π‘ πœƒ
Z(πœƒ) πœ•πœƒπ‘–π‘—
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑋𝑖 𝑋𝑗
𝑛𝑒𝑒𝑑 π‘‘π‘œ π‘‘π‘œ π‘–π‘›π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’
25
Moment matching condition
πœ•π‘™ πœƒ,𝐷
=
πœ•πœƒπ‘–π‘—
1 𝑁 𝑙 𝑙
π‘₯ π‘₯
𝑁 𝑙 𝑖 𝑗
−
1
Z πœƒ
π‘₯
𝑖𝑗 exp
Moment
1
𝑁
𝑖 exp
πœƒπ‘– 𝑋𝑖
𝑋𝑖 𝑋𝑗
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
from model
π‘’π‘šπ‘π‘–π‘Ÿπ‘–π‘π‘Žπ‘™
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
P 𝑋𝑖 , 𝑋𝑗 =
πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑙
𝑙
𝑁
𝛿(𝑋
,
π‘₯
)
𝛿(𝑋
,
π‘₯
𝑖
𝑗
𝑙=1
𝑖
𝑗)
πœ•π‘™ πœƒ,𝐷
matching:
πœ•πœƒπ‘–π‘—
= 𝐸P
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃(𝑋|πœƒ) [𝑋𝑖 𝑋𝑗 ]
26
Optimize MLE for undirected models
max 𝑙 πœƒ, 𝐷 is a convex optimization problem.
πœƒ
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters πœƒ
Loop until convergence
Compute
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
Update πœƒπ‘–π‘— ← πœƒπ‘–π‘— − πœ‚
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃
π‘‹πœƒ
𝑋𝑖 𝑋𝑗
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
Or use the gradient equation for fixed point iteration: iterative
proportional fitting
27
Conditional random Fields
Focus on conditional distribution
𝑃(𝑋1 , … , 𝑋𝑛 , |π‘Œ1 , … , π‘Œπ‘› , π‘Œ)
Do not explicitly model dependence between
π‘Œ1 , … , π‘Œπ‘› , π‘Œ
π‘Œ
Only model relation between 𝑋 − π‘Œ and 𝑋 − π‘Œ
𝑃(𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 |π‘Œ1 , π‘Œ2 , π‘Œ3 , π‘Œ4 , π‘Œ)
=𝑍
1
π‘Œ1 ,π‘Œ2 ,π‘Œ3 ,π‘Œ4 ,π‘Œ
Ψ π‘Œ1 , π‘Œ2 , 𝑋1 , 𝑋2 , π‘Œ Ψ(π‘Œ2 , π‘Œ3 , 𝑋2 , 𝑋3 , π‘Œ)
𝑋1
𝑋2
𝑋3
π‘Œ1
π‘Œ2
π‘Œ3
𝑍 π‘Œ1 , π‘Œ2 , π‘Œ3 , π‘Œ4 , π‘Œ =
𝑦1 𝑦2 𝑦3 Ψ π‘Œ1 , π‘Œ2 , 𝑋1 , 𝑋2 , 𝑋 Ψ(π‘Œ2 , π‘Œ3 , 𝑋2 , 𝑋3 , 𝑋)
28
Parameter Learning for Conditional Random Fields
𝑃 𝑋1 , … , π‘‹π‘˜ |π‘Œ, πœƒ =
1
= 𝑍 π‘Œ,πœƒ
1
𝑍 π‘Œ,πœƒ
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ)
𝑍 π‘Œ, πœƒ =
π‘₯
exp
𝑖𝑗 πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ
+
𝑖 πœƒπ‘– 𝑋𝑖
π‘Œ
𝑖 exp(πœƒπ‘– 𝑋𝑖 π‘Œ)
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 π‘Œ)
𝑖 exp(πœƒπ‘– 𝑋𝑖 π‘Œ)
𝑋1
𝑋2
𝑋3
π‘Œ
Maximize long conditional likelihood
𝑐𝑙 πœƒ, 𝐷 =
1
𝑁
log( 𝑙=1
𝑙
𝑍 𝑦 ,πœƒ
=
=
𝑁
𝑙 (
𝑁
𝑙 (
𝑖𝑗 log(exp(πœƒπ‘–π‘— π‘₯𝑖
𝑙π‘₯ 𝑙 𝑦 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
π‘π‘Žπ‘› 𝑏𝑒 π‘œπ‘‘β„Žπ‘’π‘Ÿ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’
π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› 𝑓 π‘₯𝑖
𝑙 𝑙 𝑙
exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 π‘₯𝑗 𝑦 )
𝑖𝑗
𝑙 𝑙
exp
(πœƒ
π‘₯
𝑖 𝑖 𝑦 ))
𝑖
𝑙 𝑙 𝑙
π‘₯𝑗 𝑦 )) + 𝑖 log(exp(πœƒπ‘– π‘₯𝑖𝑙 𝑦 𝑙 )) −
𝑙 𝑙
𝑙, πœƒ )
πœƒ
π‘₯
𝑦
−
log
𝑍
𝑦
𝑖
𝑖
𝑖
log 𝑍 𝑦 𝑙 , πœƒ )
π‘‡π‘’π‘Ÿπ‘š π‘™π‘œπ‘”π‘ 𝑦 𝑙 , πœƒ π‘‘π‘œπ‘’π‘  π‘›π‘œπ‘‘
π‘‘π‘’π‘π‘œπ‘šπ‘π‘œπ‘ π‘’!
29
Derivatives of log likelihood
𝑐𝑙 πœƒ, 𝐷 =
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
=
1
𝑁
=
1
𝑁
𝑁
𝑙
=
1
𝑁
𝑁
𝑙
1
𝑁
𝑁
𝑙
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
𝑁 𝑙 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗 𝑦
−
𝑙π‘₯ 𝑙 𝑦 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
(
𝑙π‘₯ 𝑙
π‘₯
𝑖
𝑖𝑗
𝑗
𝑦𝑙
−
1
Z(𝑦 𝑙 ,πœƒ)
𝑦𝑙
−
𝑙 𝑙
𝑙, πœƒ )
πœƒ
π‘₯
𝑦
−
log
𝑍
𝑦
𝑖
𝑖
𝑖
πœ• log 𝑍 𝑦 𝑙 ,πœƒ
πœ•πœƒπ‘–π‘—
𝐴 π‘π‘œπ‘›π‘£π‘’π‘₯ π‘π‘Ÿπ‘œπ‘π‘™π‘’π‘š
Can find global optimum
πœ• 𝑍 𝑦 𝑙 ,πœƒ
1
Z(𝑦 𝑙 ,πœƒ)
πœ•πœƒπ‘–π‘—
π‘₯
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑦𝑙 )
𝑖 exp(πœƒπ‘– 𝑋𝑖
𝑦𝑙 )
𝑋𝑖 𝑋𝑗 𝑦 𝑙
𝑛𝑒𝑒𝑑 π‘‘π‘œ π‘‘π‘œ π‘–π‘›π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’ π‘“π‘œπ‘Ÿ π‘’π‘Žπ‘β„Ž 𝑦 𝑙 β€Ό!
30
Moment matching condition
πœ•π‘π‘™ πœƒ,𝐷
=
πœ•πœƒπ‘–π‘—
1 𝑁 𝑙 𝑙 𝑙
π‘₯ π‘₯ 𝑦
𝑁 𝑙 𝑖 𝑗
−
1
Z 𝑦 𝑙 ,πœƒ
P π‘Œ =
1
𝑁
𝑙
𝑙
exp
πœƒ
𝑋
𝑦
𝑋
𝑋
𝑦
𝑖
𝑖 𝑖
𝑖 𝑗
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
from model
π‘’π‘šπ‘π‘–π‘Ÿπ‘–π‘π‘Žπ‘™
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
P 𝑋𝑖 , 𝑋𝑗 , π‘Œ =
π‘₯
𝑙
exp
πœƒ
𝑋
𝑋
𝑦
𝑖𝑗 𝑖 𝑗
𝑖𝑗
1
𝑁
𝑙
𝑙
𝑁
𝑙=1 𝛿(𝑋𝑖 , π‘₯𝑖 ) 𝛿(𝑋𝑗 , π‘₯𝑗 )
𝛿(π‘Œ, 𝑦 𝑙 )
𝑁
𝑙
𝑙=1 𝛿(π‘Œ, 𝑦 )
Moment matching:
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
𝑋𝑖 ,𝑋𝑗 ,π‘Œ
𝑋𝑖 𝑋𝑗 π‘Œ − 𝐸𝑃(𝑋|π‘Œ,πœƒ)P
π‘Œ
[𝑋𝑖 𝑋𝑗 π‘Œ]
31
Optimize MLE for undirected models
max 𝑐𝑙 πœƒ, 𝐷 is a convex optimization problem.
πœƒ
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters πœƒ
Loop until convergence
Compute
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
Update πœƒπ‘–π‘— ← πœƒπ‘–π‘— − πœ‚
𝑋𝑖 ,𝑋𝑗 ,π‘Œ
𝑋𝑖 𝑋𝑗 π‘Œ − 𝐸𝑃(𝑋|π‘Œ,πœƒ)P
π‘Œ
[𝑋𝑖 𝑋𝑗 π‘Œ]
πœ•π‘π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
32
Download