Parameter Learning and EM Le Song Machine Learning II: Advanced Topics

advertisement
Parameter Learning and EM
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Parameter Estimation and Prediction
Bayesian treats the unknown parameters as a random variable:
𝑃(πœƒ|𝐷) =
𝑃 𝐷 πœƒ 𝑃(πœƒ)
𝑃(𝐷)
=
𝑃 𝐷 πœƒ 𝑃(πœƒ)
∫ 𝑃 𝐷 πœƒ 𝑃 πœƒ π‘‘πœƒ
Posterior mean estimation: πœƒπ‘π‘Žπ‘¦π‘’π‘  = ∫ πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
πœƒ
𝑋𝑛𝑒𝑀
Maximum likelihood approach
πœƒπ‘€πΏ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃 𝐷 πœƒ , πœƒπ‘€π΄π‘ƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃(πœƒ|𝐷)
𝑋
𝑁
Bayesian prediction, take into account all possible value of πœƒ
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = ∫ 𝑃 π‘₯𝑛𝑒𝑀 , πœƒ 𝐷 π‘‘πœƒ = ∫ 𝑃 π‘₯𝑛𝑒𝑀 πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
A frequentist prediction: use a “plug-in” estimator
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€πΏ ) π‘œπ‘Ÿ 𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€π΄π‘ƒ )
2
MLE for directed model
𝑙 πœƒ; 𝐷 = log 𝑃 𝐷 πœƒ =
𝑖 log 𝑃
π‘Žπ‘– |πœƒπ‘Ž +
𝑖 log 𝑃
𝑓 𝑖 |πœƒπ‘“ +
𝑖 π‘™π‘œπ‘”π‘ƒ
𝑠 𝑖 π‘Žπ‘– , 𝑓 𝑖 , πœƒπ‘  +
𝑖 π‘™π‘œπ‘”π‘ƒ(β„Ž
𝑖
|𝑠 𝑖 , πœƒβ„Ž )
One term for each CPT; break up MLE problem into independent subproblems
Because the factorization of the distribution, we can estimate each CPT
separately.
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
𝐹𝑙𝑒
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
Learn separately
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
𝑆𝑖𝑛𝑒𝑠
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
3
MLE for BNs with tabluar CPTs
Assume each CPT is represented as a table (multinomial):
πœƒπ‘–π‘—π‘˜ = 𝑃 𝑋𝑖 = 𝑗 π‘‹πœ‹π‘– = π‘˜
Note that in case of multiple parents, π‘‹πœ‹π‘– will have a composite
state, and CPT will be a high dimensional table
The sufficient statistics are counts of family configurations
π‘›π‘–π‘—π‘˜ = # 𝑋𝑖 = 𝑗&&π‘‹πœ‹π‘– = π‘˜
The log-likelihood is
𝐿 πœƒ; 𝐷 = log
π‘›π‘–π‘—π‘˜
π‘–π‘—π‘˜ πœƒπ‘–π‘—π‘˜
=
π‘–π‘—π‘˜ π‘›π‘–π‘—π‘˜
Using a Lagrange multiplier to enforce
𝑀𝐿
πœƒπ‘–π‘—π‘˜
=
log πœƒπ‘–π‘—π‘˜
𝑗 πœƒπ‘–π‘—π‘˜
= 1, we get
π‘›π‘–π‘—π‘˜
𝑗′
𝑛𝑖𝑗′ π‘˜
4
Bayesian estimator for directed models
Factorization 𝑃 𝑋 = π‘₯ =
𝑖𝑃
π‘₯𝑖 π‘π‘Žπ‘‹π‘– , πœƒπ‘– )
Local CPT: multinomial distribution 𝑃 𝑋𝑖 = π‘˜ π‘ƒπ‘Žπ‘‹π‘– = 𝑗 = πœƒπ‘˜π‘—
Factorized prior over parameters 𝑃 πœƒπ‘Ž 𝑃 πœƒπ‘ 𝑃 πœƒπ‘  𝑃(πœƒβ„Ž )
πœƒπ‘
πœƒπ‘Ž
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
πœƒπ‘ 
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
πœƒβ„Ž
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
5
Parameter independence
Provided all variables are observed, we can perform Bayesian
update for each parameter independently
π‘”π‘™π‘œπ‘π‘Žπ‘™ π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒
π‘™π‘œπ‘π‘Žπ‘™ π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒
Discrete DAG models:
𝑋𝑖 |π‘ƒπ‘Žπ‘‹π‘– ∼ 𝑀𝑒𝑙𝑑𝑖 πœƒ
Dirichlet prior: 𝑃 πœƒ =
Γ( π‘˜ π›Όπ‘˜ )
π‘˜ Γ π›Όπ‘˜
𝛼
π‘˜ πœƒπ‘˜
πœƒ21
πœƒ1
𝑋1
𝑋2
𝑋3
𝑋4
πœƒ22
6
MLE for undirected models
𝑃 𝑋1 , … , π‘‹π‘˜ |πœƒ =
=
1
𝑍 πœƒ
1
𝑍 πœƒ
exp
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑍 πœƒ =
𝑙 πœƒ, 𝐷 = log(
π‘₯
𝑖𝑗 πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑁
𝑙
(
𝑙 𝑙
πœƒ
π‘₯
π‘₯𝑗 +
𝑖𝑗
𝑖
𝑖𝑗
π‘π‘Žπ‘› 𝑏𝑒 π‘œπ‘‘β„Žπ‘’π‘Ÿ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’
π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› 𝑓 π‘₯𝑖
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑙 𝑙
exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 π‘₯𝑗 )
𝑖𝑗
𝑙 π‘₯ 𝑙 )) +
= 𝑁
(
log
(exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙
𝑖𝑗
log 𝑍 πœƒ )
=
𝑖 πœƒπ‘– 𝑋𝑖
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
1
𝑁
𝑙=1 𝑍 πœƒ
+
𝑋6
𝑋5
𝑋8
𝑋2
𝑋1
𝑋4
𝑋7
𝑋3
𝑋9
𝑙
exp
(πœƒ
π‘₯
𝑖 𝑖 ))
𝑖
𝑙
log(exp
(πœƒ
π‘₯
𝑖 𝑖 )) −
𝑖
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
π‘‡π‘’π‘Ÿπ‘š π‘™π‘œπ‘”π‘ πœƒ π‘‘π‘œπ‘’π‘  π‘›π‘œπ‘‘
π‘‘π‘’π‘π‘œπ‘šπ‘π‘œπ‘ π‘’!
7
Derivatives of log likelihood
𝑙 πœƒ, 𝐷 =
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
=
1
𝑁
=
=
1
𝑁
𝑁 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗
𝑁
𝑙
𝑁
𝑙
𝑁
𝑙
−
(
𝑙π‘₯ 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
1
Z(πœƒ)
π‘₯
−
−
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
πœ• log 𝑍 πœƒ
πœ•πœƒπ‘–π‘—
𝐴 π‘π‘œπ‘›π‘£π‘’π‘₯ π‘π‘Ÿπ‘œπ‘π‘™π‘’π‘š
Can find global optimum
1 πœ•π‘ πœƒ
Z(πœƒ) πœ•πœƒπ‘–π‘—
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑋𝑖 𝑋𝑗
𝑛𝑒𝑒𝑑 π‘‘π‘œ π‘‘π‘œ π‘–π‘›π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’
8
Moment matching condition
πœ•π‘™ πœƒ,𝐷
=
πœ•πœƒπ‘–π‘—
1 𝑁 𝑙 𝑙
π‘₯ π‘₯
𝑁 𝑙 𝑖 𝑗
−
1
Z πœƒ
π‘₯
𝑖𝑗 exp
Moment
1
𝑁
𝑖 exp
πœƒπ‘– 𝑋𝑖
𝑋𝑖 𝑋𝑗
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
from model
π‘’π‘šπ‘π‘–π‘Ÿπ‘–π‘π‘Žπ‘™
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
P 𝑋𝑖 , 𝑋𝑗 =
πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑙
𝑙
𝑁
𝛿(𝑋
,
π‘₯
)
𝛿(𝑋
,
π‘₯
𝑖
𝑗
𝑙=1
𝑖
𝑗)
πœ•π‘™ πœƒ,𝐷
matching:
πœ•πœƒπ‘–π‘—
= 𝐸P
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃(𝑋|πœƒ) [𝑋𝑖 𝑋𝑗 ]
9
Optimize MLE for undirected models
max 𝑙 πœƒ, 𝐷 is a convex optimization problem.
πœƒ
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters πœƒ
Loop until convergence
Compute
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
Update πœƒπ‘–π‘— ← πœƒπ‘–π‘— − πœ‚
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃
π‘‹πœƒ
𝑋𝑖 𝑋𝑗
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
Or use the gradient equation for fixed point iteration: iterative
proportional fitting
10
Exponential Family
Rand variable X, 𝑃 𝑋 πœƒ =
1
𝑍 πœƒ
𝑍 πœƒ = ∫ β„Ž 𝑋 exp πœƒ ⊀ 𝑇 𝑋
β„Ž 𝑋 exp(πœƒ ⊀ 𝑇 𝑋 )
𝑑𝑋
𝑃 𝑋 πœƒ = β„Ž 𝑋 exp(πœƒ ⊀ 𝑇 𝑋 − 𝐴 πœƒ )
π‘π‘Žπ‘ π‘’
π‘šπ‘’π‘Žπ‘ π‘’π‘Ÿπ‘’
π‘π‘Žπ‘›π‘œπ‘›π‘–π‘π‘Žπ‘™
π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ πœƒ
𝑠𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑑
π‘ π‘‘π‘Žπ‘‘π‘–π‘ π‘‘π‘–π‘π‘ 
lπ‘œπ‘” − π‘π‘Žπ‘Ÿπ‘‘π‘–π‘‘π‘–π‘œπ‘› π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘›
𝐴 πœƒ = log 𝑍 πœƒ
Example: Bernoulli, multinomial, Gaussian, Poisson, Gamma, …
11
Multivariate Gaussian
𝑃 𝑋 πœƒ = β„Ž 𝑋 exp(πœƒ ⊀ 𝑇 𝑋 − 𝐴 πœƒ )
Random variable 𝑋 ∈ π‘…π‘˜
1
𝑃 𝑋 πœ‡, Σ =
2πœ‹
=
1
π‘˜
2
1 exp −
Σ2
1
2
𝑋−πœ‡
1
2
⊀ −1
Σ
1
2
(𝑋 − πœ‡)
1
2
−1
𝑋𝑋 ⊀ ) + πœ‡βŠ€ Σ−1 𝑋 − πœ‡βŠ€ Σ −1 πœ‡ − log |Σ|
π‘˜ exp − tr(Σ
2πœ‹ 2
Exponential family representation
1
2
πœƒ = (Σ −1 πœ‡; − 𝑣𝑒𝑐 Σ −1 )
𝑇 𝑋 = π‘₯; 𝑣𝑒𝑐 𝑋𝑋 ⊀
1
2
1
2
𝐴 πœƒ = πœ‡βŠ€ Σ −1 πœ‡ + log |Σ|
β„Ž 𝑋 = 2πœ‹
π‘˜
2
12
Multinomial distribution
𝑃 𝑋 πœƒ = β„Ž 𝑋 exp(πœƒ ⊀ 𝑇 𝑋 − 𝐴 πœƒ )
multinomial distribution with k values
π‘œπ‘›π‘™π‘¦ π‘œπ‘›π‘’ π‘’π‘›π‘‘π‘Ÿπ‘¦
𝑖𝑠 π‘›π‘œπ‘› − π‘§π‘’π‘Ÿπ‘œπ‘ 
Binary vector 𝑋 ∈ 0,1 𝐾 , 𝑋 ∼ π‘šπ‘’π‘™π‘‘π‘–(𝑋|πœƒ)
π‘˜ π‘‹π‘˜ = 1, π‘˜ πœƒπ‘˜ = 1
𝑃 𝑋 πœƒ = πœƒ1 𝑋1 πœƒ2 𝑋2 … πœƒπΎ 𝑋𝐾 = exp
log πœƒπ‘˜ + 𝑋𝐾 log πœƒπΎ
= exp
𝐾−1
π‘˜=1 π‘‹π‘˜
𝐾−1
π‘˜=1 π‘‹π‘˜
= exp
𝐾−1
π‘˜=1 π‘‹π‘˜
log
= exp
πœƒ = log
πœƒπ‘˜
;0
πœƒπΎ
log πœƒπ‘˜ + (1 −
πœƒπ‘˜
1−
𝐾−1 πœƒ
π‘˜=1 π‘˜
𝐾
π‘˜=1 π‘‹π‘˜
ln πœƒπ‘˜
𝐾−1
π‘˜=1 π‘‹π‘˜ ) log(1
+ log(1 −
, 𝑇 𝑋 = 𝑋, 𝐴 πœƒ = −log(1 −
−
𝐾−1
π‘˜=1 πœƒπ‘˜ )
𝐾−1
π‘˜=1 πœƒπ‘˜ )
𝐾−1
π‘˜=1 πœƒπ‘˜ ) , β„Ž
𝑋 =1
13
Why exponential family?
Moment generating property: we can easily compute
moments of any exponential family distribution by taking the
derivatives of the log normalizer
Mean:
𝑑𝐴
π‘‘πœƒ
=
=
𝑑
log 𝑍
π‘‘πœƒ
1 𝑑
∫β„Ž
𝑍 πœƒ π‘‘πœƒ
= ∫𝑇 𝑋
πœƒ =
1 𝑑
𝑍
𝑍 πœƒ π‘‘πœƒ
𝑋 exp πœƒ ⊀ 𝑇 𝑋
β„Ž 𝑋 exp πœƒ ⊀ 𝑇 𝑋
𝑍 πœƒ
πœƒ
𝑑𝑋
𝑑𝑋
= 𝐸𝑃(𝑋|πœƒ) 𝑇 𝑋
Variance:
𝑑2𝐴
π‘‘πœƒ 2
= 𝐸𝑃(𝑋|πœƒ) 𝑇 2 𝑋
2
− 𝐸𝑃(𝑋|πœƒ)
𝑇(𝑋) = π‘‰π‘Žπ‘Ÿ[𝑇 𝑋 ]
14
MLE for exponential family
For iid data, the log-likelihood is
𝑙 πœƒ, 𝐷 = log(
=
𝑁
𝑙=1 β„Ž
⊀
log
β„Ž(π‘₯
)
+
πœƒ
𝑙
𝑙
π‘₯𝑙 exp(πœƒ ⊀ 𝑇 π‘₯𝑙 − 𝐴 πœƒ ))
𝑙 𝑇(π‘₯𝑙 ) −
𝑁𝐴 πœƒ
Take derivatives and set of zero:
πœ•π‘™ πœƒ,𝐷
πœ•πœƒ
1
𝑁
=
𝑙 𝑇(π‘₯𝑙 ) − 𝑁
𝑙 𝑇(π‘₯𝑙 ) =
πœ•π΄ πœƒ
πœ•πœƒ
=0
πœ•π΄ πœƒ
πœ•πœƒ
This is moment matching condition for exponential family
15
Partially observed graphical models
Speech recognition
16
Partially observed graphical models
Biological Evolution
17
Partially observed graphical models
Mixture Models
𝑁(πœ‡1 , Σ1 )
𝑍𝑖
𝑋𝑖
𝑁
𝑁(πœ‡2 , Σ2 )
18
Unobserved variables
A variable can be unobserved (latent,hidden,missing) because:
It is an imaginary quantity meant to provide some simplified and
abstract view of the data generation process
Eg. Mixture models, topic modeling, image context
It is a real-world object and/or phenomena, but difficult or
impossible to measure
Eg. Causes of disease, evolutionary ancestor
It is a real-world object and/or phenomena, but sometimes
wasn’t measured, because of faulty sensors etc.
Discrete latent variables can be used to partition/cluster data
into subgroups
Continuous latent variables (factors) can be used for
dimensionality reduction (factor analysis, etc.)
19
Gaussian Mixture model
A density model p(X) may be multi-modal: model it as a
mixture of uni-modal distributions (e.g. Gaussians)
𝑁(πœ‡1 , Σ1 )
Consider a mixture of 𝐾 Gaussians
𝑝(𝑋) =
π‘˜ πœ‹π‘˜ 𝑁(𝑋|πœ‡π‘˜ , Σπ‘˜ )
𝑀𝑖π‘₯π‘‘π‘’π‘Ÿπ‘’
proportion
π‘šπ‘–π‘₯π‘‘π‘’π‘Ÿπ‘’
Component
𝑁(πœ‡2 , Σ2 )
Learn πœ‹π‘˜ , πœ‡π‘˜ , Σπ‘˜ ;
𝑍
Correspond to 𝑝 𝑋 =
𝑧 𝑝(𝑋| 𝑧;
πœ‡π‘§ , Σ𝑧 )𝑃 𝑧; πœ‹
𝑋
Can be used for unsupervised clustering
𝑁
20
Why is learning hard?
In fully observed iid settings, the log-likelihood decomposes
into a sum of local terms
𝑙 πœƒ; 𝐷 = log 𝑝 π‘₯, 𝑧 πœƒ = log 𝑃 𝑧 πœƒ1 + log 𝑝(π‘₯|𝑧, πœƒ2 )
With latent variables, all the parameters become coupled
together via marginalization
𝑙 πœƒ; 𝐷 = log
𝑧𝑝
π‘₯, 𝑧 πœƒ = log
𝑧 𝑝(π‘₯| 𝑧, πœƒ2 )𝑃
𝑍
𝑍
𝑋
𝑋
𝑁
𝑧|πœƒ1
𝑁
21
Key questions in EM algorithm
EM: Expectation-maximization for finding πœƒ
𝑙 πœƒ; 𝐷 = log
𝑧𝑝
π‘₯, 𝑧 πœƒ = log
𝑧 𝑝(π‘₯| 𝑧, πœƒ2 )𝑃
𝑧|πœƒ1
Expectation step (E-step)
What distribution do we take expectation with?
π‘ž 𝑧 = 𝑃(𝑧| π‘₯ , πœƒ)
What do we take expectation over?
𝑓 πœƒ = πΈπ‘ž
𝑧
[log 𝑝 π‘₯, 𝑧 πœƒ ]
Maximization step (M-step)
What do we maximize?
𝑓 πœƒ
What do we maximize with respect to?
πœƒ
22
Example: Gaussian mixture model
A mixture of K Gaussians:
𝑍
Z is latent class indicator vector
𝑃 𝑍 πœƒ = πœƒ1 𝑍1 πœƒ2 𝑍2 … πœƒπΎ 𝑍𝐾
𝑋
X is a conditional Gaussian variable with class specific mean and
covariance
𝑃 𝑋 π‘π‘˜ = 1, πœ‡, Σ =
1
2πœ‹
𝑑
2
1
1 exp − 2 𝑋 − πœ‡π‘˜
Σ2
⊀ Σ −1 (𝑋
k
− πœ‡π‘˜ )
The likelihood of a sample:
𝑃 π‘₯𝑖 πœƒ, πœ‡, Σ =
π‘˜ 𝑃(π‘§π‘˜
= 1|πœƒ) 𝑃 π‘₯𝑖 π‘§π‘˜ = 1, πœ‡, Σ =
π‘˜ πœƒπ‘˜
𝑁 π‘₯𝑖 πœ‡π‘˜ , Σπ‘˜
The expected complete log-likelihood
< 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃 𝑍|{π‘₯} =
𝑖
𝑖 < log 𝑃 𝑧 πœƒ >𝑃 𝑍|{π‘₯} +
=
1
2
𝑖
𝑖
π‘˜
< π‘§π‘˜π‘– >𝑃
𝑍|{π‘₯}
log πœƒπ‘˜ −
𝑖
π‘˜
< π‘§π‘˜π‘– >𝑃
𝑍|{π‘₯}
( π‘₯𝑖 − πœ‡π‘˜
< log 𝑝(π‘₯𝑖 |𝑧 𝑖 , πœ‡, Σ) >𝑃
⊀ Σ −1 (π‘₯
𝑖
k
𝑍|{π‘₯}
− πœ‡π‘˜ ) + log |Σπ‘˜ | + 𝐢)
23
𝑁
E-step
We maximize < 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
the following procedure:
𝑍|{π‘₯}
iteratively using
Expectation step: computing the expected value of the
sufficient statistics of the hidden variables (z) given current
estimate of the parameters (πœƒ, πœ‡, Σ)
πœπ‘˜π‘–
=<
π‘§π‘˜π‘–
>𝑃
𝑍|{π‘₯}
=𝑃
π‘§π‘˜π‘–
= 1 π‘₯ , πœ‡, Σ =
πœƒπ‘˜ 𝑁 π‘₯𝑖 πœ‡π‘˜ ,Σπ‘˜
π‘˜ πœƒπ‘˜ 𝑁 π‘₯𝑖 πœ‡π‘˜ ,Σπ‘˜
We are essentially doing inference
24
M-step
We maximize < 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
the following procedure:
𝑍|{π‘₯}
iteratively using
Maximization step: compute the parameters under current
results of the expected complete log-likelihood
< 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
𝑖
1
𝑖
π‘˜ πœπ‘˜ log πœƒπ‘˜ − 2
𝑍|{π‘₯} =
𝑖
𝑖 π‘˜ πœπ‘˜ ( π‘₯𝑖
− πœ‡π‘˜
⊀ Σ −1 (π‘₯
𝑖
k
πœƒπ‘˜ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒπ‘˜ < 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
⇒ πœƒπ‘˜ =
π‘˜ πœƒπ‘˜
=1
𝑁
𝑍|{π‘₯}
𝑖
𝑖 πœπ‘˜ π‘₯𝑖
𝑖
𝑖 πœπ‘˜
Σπ‘˜ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯Σπ‘˜ < 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
⇒ Σπ‘˜ =
, 𝑠. 𝑑.
𝑖
𝑖 πœπ‘˜
πœ‡π‘˜ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœ‡π‘˜ < 𝑙𝑐 {π‘₯, 𝑧}; πœƒ, πœ‡, Σ >𝑃
⇒ πœ‡π‘˜ =
𝑍|{π‘₯}
− πœ‡π‘˜ ) + log |Σπ‘˜ | + 𝐢)
𝑖
𝑇
𝑖 πœπ‘˜ (π‘₯𝑖 −πœ‡π‘˜ ) (π‘₯𝑖 −πœ‡π‘˜ )
𝑖
𝑖 πœπ‘˜
𝑍|{π‘₯}
Fact:
πœ• log |𝐴−1 |
= 𝐴⊀
−1
πœ•π΄
πœ•π‘₯ ⊀ 𝐴π‘₯
= π‘₯π‘₯ ⊀
πœ•π΄
25
Expectation-Maximization Iteractions
26
K-means vs EM for Gaussian mixture
The EM algorithm for mixture of Gaussian is like a soft
clustering algorithm
K-means:
“E-step”, we do hard assignment:
𝑧 𝑖 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯π‘˜ (π‘₯𝑖 − πœ‡π‘˜ ) Σk−1 (π‘₯𝑖 − πœ‡π‘˜ )
“M-step”, we update the means and covariance of cluster using
maximum likelihood estimate:
πœ‡π‘˜ =
Σπ‘˜ =
𝑖𝛿
𝑖𝛿
𝑖𝛿
𝑧 𝑖 ,π‘˜ π‘₯𝑖
𝑧 𝑖 ,π‘˜
𝑧 𝑖 ,π‘˜ (π‘₯𝑖 −πœ‡π‘˜ ) (π‘₯𝑖 −πœ‡π‘˜ )𝑇
𝑖𝛿
𝑧 𝑖 ,π‘˜
x
27
Theory underlying EM
What are we doing?
Recall that according to MLE, we intend to learn the model
parameter that would have maximize the likelihood of the
data.
But we are iterating these:
Expectation step (E-step)
𝑓 πœƒ = πΈπ‘ž
𝑧
log 𝑝 π‘₯, 𝑧 πœƒ , π‘€β„Žπ‘’π‘Ÿπ‘’ π‘ž 𝑧 = 𝑃(𝑧|π‘₯, πœƒ 𝑑 )
Maximization step (M-step)
πœƒ 𝑑+1 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑓 πœƒ
28
Download