Parameter Learning Le Song Machine Learning II: Advanced Topics

advertisement
Parameter Learning
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Sampling Methods
Direct Sampling
Works only for easy distributions (multinomial, Gaussian etc.)
Rejection Sampling
Create samples like direct sampling
Only count samples consistent with given evidence
Importance Sampling
Create samples like direct sampling
Assign weights to samples
Gibbs Sampling
Often used for high-dimensional problem
Use variables and its Markov blanket for sampling
2
Gibbs Sampling in formula
Gibbs sampling
𝑋 = π‘₯0
For t = 1 to N
π‘₯1𝑑 ∼ 𝑃(𝑋1 |π‘₯2𝑑−1 , … , π‘₯𝐾𝑑−1 )
π‘₯2𝑑 ∼ 𝑃(𝑋2 |π‘₯1𝑑 , π‘₯3𝑑−1 … , π‘₯𝐾𝑑−1 )
…
𝑑
π‘₯𝐾𝑑 ∼ 𝑃(𝑋𝐾 |π‘₯1𝑑 , … , π‘₯𝐾−1
)
πΉπ‘œπ‘Ÿ π‘”π‘Ÿπ‘Žπ‘β„Žπ‘–π‘π‘Žπ‘™ π‘šπ‘œπ‘‘π‘’π‘™π‘ 
Only need to condition on the
Variables in the Markov blanket
𝑋1
Variants:
Randomly pick variable to sample
sample block by block
𝑋3
𝑋2
𝑋4
𝑋5
π‘₯1𝑑 , π‘₯2𝑑 ∼ 𝑃(𝑋1 , 𝑋2 |π‘₯3𝑑−1 … , π‘₯𝐾𝑑−1 )
3
Gibbs Sampling: Image Segmentation
1
𝑍
Pairwise Markov random fields: 𝑃 𝑋 =
𝑖Ψ
𝑋𝑖
𝑖𝑗 Ψ
𝑋𝑖 , 𝑋𝑗
Need conditional 𝑃(π‘₯𝑖 |π‘₯1 , … , π‘₯𝑖−1 , π‘₯𝑖+1 , … , π‘₯π‘˜ )
𝑃(π‘₯1 ,…,π‘₯π‘˜ )
𝑃(π‘₯1 ,…,π‘₯𝑖−1 ,π‘₯𝑖+1 ,π‘₯π‘˜ )
=
1
𝑍
𝑙Ψ
1
π‘₯𝑖 𝑍
π‘₯𝑙
𝑙Ψ
𝑙𝑗 Ψ
π‘₯𝑙
π‘₯𝑙 ,π‘₯𝑗
𝑙𝑗 Ψ
π‘₯𝑙 ,π‘₯𝑗
Terms without π‘₯𝑖 will cancel out
𝑋6
𝑋5
𝑋8
𝑋2
𝑋1
𝑋4
𝑋7
𝑋3
𝑋9
π‘₯𝑖 is summed out in the denominator
∝ Ψ π‘₯𝑖
𝑗∈𝑁(𝑖) Ψ(π‘₯𝑖 , π‘₯𝑗 )
4
Convergence of Gibbs Sampling
Not all samples π‘₯ 0 , … π‘₯ 𝑇 are independent
Consider a particular marginal 𝑃(π‘₯𝑖 |𝑒𝑖 )
1
True
𝑃(π‘₯𝑖 |𝑒𝑖 ) πΈπ‘šπ‘π‘–π‘π‘Žπ‘™
𝑃 π‘₯𝑖 𝑒𝑖
𝑑
0
Burn-in
Take samples from here
5
Parameter Learning Example
Estimate the probability πœƒ of landing in heads
using a biased coin
Given a sequence of 𝑁 independently and
identically distributed (iid) flips
Eg., 𝐷 = π‘₯1 , π‘₯2 , … , π‘₯𝑁 = {1,0,1, … , 0}, π‘₯𝑖 ∈ {0,1}
Model: 𝑃 π‘₯|πœƒ = πœƒ π‘₯ 1 − πœƒ
𝑃(π‘₯|πœƒ ) =
1−π‘₯
1 − πœƒ, π‘“π‘œπ‘Ÿ π‘₯ = 0
πœƒ,
π‘“π‘œπ‘Ÿ π‘₯ = 1
Likelihood of a single observation π‘₯𝑖 ?
𝑃 π‘₯𝑖 |πœƒ = πœƒ
π‘₯𝑖
1−πœƒ
1−π‘₯𝑖
πœƒ
𝑋
𝑁
6
Bayesian Parameter Estimation
Bayesian treats the unknown parameters as a random variable,
whose distribution can be inferred using Bayes rule:
𝑃(πœƒ|𝐷) =
𝑃 𝐷 πœƒ 𝑃(πœƒ)
𝑃(𝐷)
=
𝑃 𝐷 πœƒ 𝑃(πœƒ)
∫ 𝑃 𝐷 πœƒ 𝑃 πœƒ π‘‘πœƒ
Prior over πœƒ, Beta distribution
𝑃 πœƒ; 𝛼, 𝛽 ∝ πœƒ 𝛼−1 1 − πœƒ
𝛽−1
Posterior distribution πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁 πœƒ 𝑃 πœƒ
𝑃 π‘₯1 ,…,π‘₯𝑁
𝑛𝑑 πœƒ 𝛼−1 1 − πœƒ 𝛽−1 =
𝑃 πœƒ|π‘₯1 , … , π‘₯𝑁 =
∝
πœƒ π‘›β„Ž 1 − πœƒ
πœƒ π‘›β„Ž +𝛼−1 1 − πœƒ
𝑛𝑑 +𝛽−1
Posterior mean estimation:
πœƒπ‘π‘Žπ‘¦π‘’π‘  = ∫ πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ = 𝐢∫ πœƒ × πœƒ π‘›β„Ž +𝛼−1 1 − πœƒ
𝑛𝑑 +𝛽−1 π‘‘πœƒ
=
(π‘›β„Ž +𝛼)
𝑁+𝛼+𝛽
7
MLE for Biased Coin
πœƒπ‘€πΏ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃(𝐷|πœƒ)
Objective function, log likelihood
𝑙 πœƒ; 𝐷 = log 𝑃 𝐷 πœƒ = log πœƒ π‘›β„Ž 1 − πœƒ
𝑁 − π‘›β„Ž log 1 − πœƒ
𝑛𝑑
= π‘›β„Ž log πœƒ +
We need to maximize this w.r.t. πœƒ
Take derivatives w.r.t. πœƒ
πœ•π‘™
πœ•πœƒ
=
π‘›β„Ž
πœƒ
−
𝑁−π‘›β„Ž
1−πœƒ
= 0 ⇒ πœƒπ‘€πΏπΈ =
π‘›β„Ž
𝑁
or πœƒπ‘€πΏπΈ =
1
𝑁
𝑖 π‘₯𝑖
8
How estimators should be used?
πœƒπ‘€π΄π‘ƒ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯πœƒ 𝑃(πœƒ|𝐷) is not Bayesian (even though it uses
a prior) since it is a point estimate
Consider predicting the future. A sensible way is to combine
predictions based on all possible value of πœƒ, weighted by their
posterior probability, this is called Bayesian prediction:
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = ∫ 𝑃 π‘₯𝑛𝑒𝑀 , πœƒ 𝐷 π‘‘πœƒ
= ∫ 𝑃 π‘₯𝑛𝑒𝑀 πœƒ, 𝐷 𝑃 πœƒ 𝐷 π‘‘πœƒ
= ∫ 𝑃 π‘₯𝑛𝑒𝑀 πœƒ 𝑃 πœƒ 𝐷 π‘‘πœƒ
πœƒ
𝑋𝑛𝑒𝑀
𝑋
𝑁
A frequentist prediction will typically use a “plug-in” estimator
such as ML/MAP
𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€πΏ ) π‘œπ‘Ÿ 𝑃 π‘₯𝑛𝑒𝑀 𝐷 = 𝑃(π‘₯𝑛𝑒𝑀 | πœƒπ‘€π΄π‘ƒ )
9
Learning Graphical Models
The goal: given set of independent samples (assignments of
random variables), find the best (the most likely) graphical
model (both structure and the parameters
𝐹
𝐴
Learn
𝑆
𝑁
Structure
learning
𝑆
𝐻
(A,F,S,N,H) = (T,F,F,T,F)
(A,F,S,N,H) = (T,F,T,T,F)
…
(A,F,S,N,H) = (F,T,T,T,T)
𝐹
𝐴
𝐻
𝑁
S FA TT
TF
FT
FF
t
0.9
0.7
0.8
0.2
f
0.1
0.3
0.2
0.8
parameter
learning
10
MLE for General Bayesian Networks
If we assume that the parameters for each CPT are globally
independent, and all nodes are fully observed, then the loglikelihood function decomposes into a sum of local terms, one
per node:
𝑙 πœƒ; 𝐷 = log 𝑃 𝐷 πœƒ
= log
𝑖
𝑖
𝑖
𝑃
π‘₯
π‘π‘Ž
𝑋𝑗 , πœƒπ‘— ) =
𝑗
𝑗
𝑖
𝑖
𝑖
log
𝑃
π‘₯
π‘π‘Ž
𝑋𝑗 , πœƒπ‘— )
𝑗
𝑗
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
For each variable 𝑋𝑖 :
𝑃𝑀𝐿𝐸 𝑋𝑖 = π‘₯𝑖 π‘ƒπ‘Žπ‘‹π‘– = 𝑒 =
Why?
𝐹𝑙𝑒
#(𝑋𝑖 =π‘₯ ,π‘ƒπ‘Žπ‘‹π‘– =𝑒)
#(π‘ƒπ‘Žπ‘‹π‘– =𝑒)
𝑆𝑖𝑛𝑒𝑠
π‘π‘œπ‘ π‘’
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
11
Decomposable likelihood of directed model
𝑙 πœƒ; 𝐷 = log 𝑃 𝐷 πœƒ =
𝑖 log 𝑃
π‘Žπ‘– |πœƒπ‘Ž +
𝑖 log 𝑃
𝑓 𝑖 |πœƒπ‘“ +
𝑖 π‘™π‘œπ‘”π‘ƒ
𝑠 𝑖 π‘Žπ‘– , 𝑓 𝑖 , πœƒπ‘  +
𝑖 π‘™π‘œπ‘”π‘ƒ(β„Ž
𝑖
|𝑠 𝑖 , πœƒβ„Ž )
One term for each CPT; break up MLE problem into independent subproblems
Because the factorization of the distribution, we can estimate each CPT
separately.
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
𝐹𝑙𝑒
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
Learn separately
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
𝑆𝑖𝑛𝑒𝑠
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
12
MLE for BNs with tabluar CPTs
Assume each CPT is represented as a table (multinomial):
πœƒπ‘–π‘—π‘˜ = 𝑃 𝑋𝑖 = 𝑗 π‘‹πœ‹π‘– = π‘˜
Note that in case of multiple parents, π‘‹πœ‹π‘– will have a composite
state, and CPT will be a high dimensional table
The sufficient statistics are counts of family configurations
π‘›π‘–π‘—π‘˜ = # 𝑋𝑖 = 𝑗&&π‘‹πœ‹π‘– = π‘˜
The log-likelihood is
𝐿 πœƒ; 𝐷 = log
π‘›π‘–π‘—π‘˜
π‘–π‘—π‘˜ πœƒπ‘–π‘—π‘˜
=
π‘–π‘—π‘˜ π‘›π‘–π‘—π‘˜
Using a Lagrange multiplier to enforce
𝑀𝐿
πœƒπ‘–π‘—π‘˜
=
log πœƒπ‘–π‘—π‘˜
𝑗 πœƒπ‘–π‘—π‘˜
= 1, we get
π‘›π‘–π‘—π‘˜
𝑖𝑗′ π‘˜
𝑛𝑖𝑗′ π‘˜
13
MLE and Kullback-Leibler divergence
KL divergence
𝐷(𝑄(𝑋)| 𝑃 𝑋
=
π‘₯ 𝑄 π‘₯ log
𝑄(𝑋)
𝑃 𝑋
Empirical distribution
P 𝑋 =
1
𝑁
𝑁
𝑖=1 𝛿(𝑋, π‘₯𝑖 )
Where 𝛿 𝑋, π‘₯𝑖 is a Kronecker delta function
π‘€π‘Žπ‘₯πœƒ 𝑀𝐿𝐸 = π‘€π‘–π‘›πœƒ 𝐾𝐿
𝐷(P 𝑋 | 𝑃 𝑋|πœƒ
=
π‘₯P
=
π‘₯ logP π‘₯ −
= π‘π‘œπ‘›π‘ π‘‘. −
1
𝑁
𝑁
𝑖=1 log
π‘₯ P π‘₯ log
π‘₯P
P π‘₯
𝑃 π‘₯|πœƒ
π‘₯ log 𝑃 π‘₯|πœƒ
1
𝑁
𝑃 π‘₯𝑖 |πœƒ = π‘π‘œπ‘›π‘ π‘‘. − 𝑙(πœƒ; 𝐷)
14
Bayesian estimator for tabular CPTs
Factorization 𝑃 𝑋 = π‘₯ =
𝑖𝑃
π‘₯𝑖 π‘π‘Žπ‘‹π‘– , πœƒπ‘– )
Local CPT: multinomial distribution 𝑃 𝑋𝑖 = π‘˜ π‘ƒπ‘Žπ‘‹π‘– = 𝑗 = πœƒπ‘˜π‘—
Put prior distribution over parameters 𝑃(πœƒπ‘Ž , πœƒπ‘ , πœƒπ‘  , πœƒβ„Ž )
πœƒπ‘
πœƒπ‘Ž
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
πœƒπ‘ 
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
πœƒβ„Ž
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
15
Global and Local Parameter Independence
Global parameter independence
Parameters for all nodes in a GM
𝑃 πœƒπ‘Ž , πœƒπ‘ , πœƒπ‘  , πœƒβ„Ž =
𝑃 πœƒπ‘Ž 𝑃 πœƒπ‘ 𝑃 πœƒπ‘  𝑃(πœƒβ„Ž )
πœƒπ‘
πœƒπ‘Ž
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
Local Parameter Independence
Parameters in each node
πœƒπ‘ 
𝐹𝑙𝑒
𝑆𝑖𝑛𝑒𝑠
πœƒβ„Ž
𝑃 𝑋𝑖 = π‘˜ π‘ƒπ‘Žπ‘‹π‘– = 𝑗 = πœƒπ‘˜π‘—
𝑃 πœƒπ‘Ž =
π‘˜π‘— 𝑃(πœƒπ‘˜π‘— )
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
16
Parameter independence Example
Provided all variables are observed, we can perform Bayesian
update for each parameter independently
π‘™π‘œπ‘π‘Žπ‘™ π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒
π‘”π‘™π‘œπ‘π‘Žπ‘™ π‘π‘Žπ‘Ÿπ‘Žπ‘šπ‘’π‘‘π‘’π‘Ÿ
𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒
πœƒ21
πœƒ1
𝑋1
𝑋2
𝑋1
𝑋2
πœƒ22
17
Which distribution satisfy independence
assumptions?
Discrete DAG models:
𝑋𝑖 |π‘ƒπ‘Žπ‘‹π‘– ∼ 𝑀𝑒𝑙𝑑𝑖 πœƒ
Dirichlet prior: 𝑃 πœƒ =
Γ( π‘˜ π›Όπ‘˜ )
π‘˜ Γ π›Όπ‘˜
𝛼
π‘˜ πœƒπ‘˜
Gaussian DAG models:
𝑋𝑖 |π‘ƒπ‘Žπ‘‹π‘– ∼ π‘π‘œπ‘Ÿπ‘šπ‘Žπ‘™ πœ‡, Σ
Normal prior: 𝑃 πœ‡ 𝑣, Ψ ∝ exp
1
−
2
πœ‡ − 𝑣 ′ Ψ −1 πœ‡ − 𝑣
18
Parameter sharing
Consider a time-invariant (stationary) first order Markov model
Initial state probability vector: πœ‹π‘— = 𝑃 𝑋1 = 𝑗
State transition probability matrix: 𝐴𝑖𝑗 = 𝑃 𝑋𝑑 = 𝑗 𝑋𝑑−1 = 𝑖
πœ‹
The likelihood of one sequence
𝑃 π‘₯1:𝑇 πœƒ = 𝑃 π‘₯1 πœ‹
𝑇
𝑑=2 𝑃
π‘₯𝑑 π‘₯𝑑−1
𝑋1
𝐴
𝑋2
𝑋𝑇
𝑋3
The log-likelihood of 𝑁 sequences
𝑙 πœƒ; π‘₯1:𝑇 =
𝑁
𝑙=1 log 𝑃
π‘₯1𝑙 πœ‹ +
𝑁
𝑙=1
𝑇
𝑑=2 log 𝑃
𝑙
π‘₯𝑑𝑙 π‘₯𝑑−1
Again, we can optimize each parameter separately
πœ‹ can be simple estimated by the frequency
How about 𝐴?
19
Learning a Markov chain transition matrix
𝐴 is a stochastic matrix:
𝑗 𝐴𝑖𝑗
=1
Each row of 𝐴 is multinomial distribution
MLE of 𝐴𝑖𝑗 is the fraction of transitions from 𝑖 to 𝑗
𝐴𝑀𝐿
𝑖𝑗 =
#(𝑖→𝑗)
#(𝑖→⋅)
Application:
If the states of 𝑋𝑑 represent words, this is called a bigram
language model
Small sample size problem:
If 𝑖 → 𝑗 did not occur in data, we will have 𝐴𝑀𝐿
𝑖𝑗 = 0
A standard hack: backoff smoothing 𝐴𝑖⋅𝑆 = πœ†πœ‚ + (1 − πœ†)𝐴𝑀𝐿
𝑖⋅
20
Bayesian model for Markov chain
Global and local parameter independence
𝐴i→⋅
πœ‹
𝑋2
𝑋1
𝑋3
𝐴i′ →⋅
𝑋𝑇
The posterior of 𝐴i→⋅ and 𝐴i′→⋅ is factorized despite the vstructure
Assign a Dirichlet prior 𝛽𝑖 to each row of the transition matrix
π΅π‘Žπ‘¦π‘’π‘ 
𝐴𝑖𝑗
=
′
πœ†π›½π‘–π‘—
=
#(𝑖→𝑗)+𝛽𝑖𝑗
#(𝑖→⋅)+ 𝑗 𝛽𝑖𝑗
+ (1 −
πœ†)𝐴𝑀𝐿
𝑖𝑗
where πœ† =
𝑗 𝛽𝑖𝑗
#(𝑖→⋅)+ 𝑗 𝛽𝑖𝑗
21
MLE for Exponential Family
𝑃 𝑋1 , … , π‘‹π‘˜ |πœƒ =
1
=π‘πœƒ
1
𝑍 πœƒ
exp
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑍 πœƒ =
𝑙 πœƒ, 𝐷 = log(
π‘₯
𝑖𝑗 πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑁
𝑙
(
𝑙 𝑙
πœƒ
π‘₯
π‘₯𝑗 +
𝑖𝑗
𝑖
𝑖𝑗
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑙 𝑙
exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 π‘₯𝑗 )
𝑖𝑗
𝑙 π‘₯ 𝑙 )) +
= 𝑁
(
log
(exp
(πœƒ
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙
𝑖𝑗
log 𝑍 πœƒ )
=
𝑖 πœƒπ‘– 𝑋𝑖
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
1
𝑁
𝑙=1 𝑍 πœƒ
+
𝑙
exp
(πœƒ
π‘₯
𝑖 𝑖 ))
𝑖
𝑙
log(exp
(πœƒ
π‘₯
𝑖 𝑖 )) −
𝑖
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
π‘π‘Žπ‘› 𝑏𝑒 π‘œπ‘‘β„Žπ‘’π‘Ÿ π‘“π‘’π‘Žπ‘‘π‘’π‘Ÿπ‘’ π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› 𝑓 π‘₯𝑖
22
MLE for Exponential Family
𝑙 πœƒ, 𝐷 =
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
1
𝑁
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
=
1
𝑁
=
=
1
𝑁
𝑁 𝑙 𝑙
𝑙 π‘₯𝑖 π‘₯𝑗
𝑁
𝑙
𝑁
𝑙
𝑁
𝑙
−
(
𝑙π‘₯ 𝑙 +
πœƒ
π‘₯
𝑖𝑗
𝑖
𝑖𝑗
𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
𝑙π‘₯ 𝑙
π‘₯
𝑖𝑗 𝑖 𝑗
1
Z(πœƒ)
π‘₯
−
−
𝑙
πœƒ
π‘₯
𝑖
𝑖
𝑖 − log 𝑍 πœƒ )
πœ• log 𝑍 πœƒ
πœ•πœƒπ‘–π‘—
1 πœ•π‘ πœƒ
Z(πœƒ) πœ•πœƒπ‘–π‘—
𝑖𝑗 exp(πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 )
𝑖 exp(πœƒπ‘– 𝑋𝑖 )
𝑋𝑖 𝑋𝑗
𝑛𝑒𝑒𝑑 π‘‘π‘œ π‘‘π‘œ π‘–π‘›π‘“π‘’π‘Ÿπ‘’π‘›π‘π‘’
23
Moment Matching condition
πœ•π‘™ πœƒ,𝐷
=
πœ•πœƒπ‘–π‘—
1 𝑁 𝑙 𝑙
π‘₯ π‘₯
𝑁 𝑙 𝑖 𝑗
−
1
Z πœƒ
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
1
𝑁
= 𝐸P
𝑖𝑗 exp
πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗
𝑖 exp
πœƒπ‘– 𝑋𝑖
𝑋𝑖 𝑋𝑗
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
from model
π‘’π‘šπ‘π‘–π‘Ÿπ‘–π‘π‘Žπ‘™
π‘π‘œπ‘£π‘Žπ‘Ÿπ‘–π‘Žπ‘›π‘π‘’ π‘šπ‘Žπ‘‘π‘Ÿπ‘–π‘₯
P 𝑋𝑖 , 𝑋𝑗 =
π‘₯
𝑙
𝑙
𝑁
𝛿(𝑋
,
π‘₯
)
𝛿(𝑋
,
π‘₯
𝑖
𝑗
𝑙=1
𝑖
𝑗)
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃(𝑋|πœƒ) [𝑋𝑖 𝑋𝑗 ]
24
MLE Learning Algorithm for Exponential models
maxπœƒ 𝑙 πœƒ, 𝐷 is a convex optimization problem.
Can be solve by many methods, such as gradient descent,
conjugate gradient.
Initialize model parameters πœƒ
Loop until convergence
Compute
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
= 𝐸P
Update πœƒπ‘–π‘— ← πœƒπ‘–π‘— − πœ‚
𝑋𝑖 ,𝑋𝑗
𝑋𝑖 𝑋𝑗 − 𝐸𝑃
π‘‹πœƒ
𝑋𝑖 𝑋𝑗
πœ•π‘™ πœƒ,𝐷
πœ•πœƒπ‘–π‘—
25
Download