Sequence Learning Le Song Machine Learning II: Advanced Topics

Sequence Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Topic Modeling in Document Analysis Document Collections Topic Discovery: Discover Sets of Frequently Co-occuring Words Topics Attribution: Attribute Words to a Particular Topic [Blei et al. 03] Basic Model: Latent Dirichlet Allocation A generative model for text documents For each document, choose a topic distribution Θ For each word w in the document Choose a specific topic z for the word Choose the word w from the topic (a distribution over vocabulary) LDA pLSI Simplified version without the prior distribution 𝑃 𝜃 𝛼 corresponds to probabilistic latent semantic indexing (pLSI) [Blei et al. 03] Learning Problem in Topic Modeling Automatically discover the topics from document collections Given: a collection of M documents Basic learning problems: discover the following latent structures k topics Per-document topic distribution Per-word topic assignment Application of topic modeling Visualization and interpretation Use topic distribution as document features for classification or clustering Predict the occurrence of words in new documents (collaborative filtering) Gibbs Sampling for Learning Topic Models Variational Inference for Learning Topic Models Approximate the posterior distribution P of latent variable by a simpler distribution Q(Φ*) in a parametric family Φ* = argmin KL(Q(Φ) || P), where KL = Kullback-Leibler The approximated posterior is then used in the E-step of the EM algorithm Only find a local minima Variational approximation of mixture of 2 Gaussians by a single Gaussian Each figure showing a different local optimum Sequence Learning Hidden Markov Models Speech recognition Genome sequence modeling … Kalman Filter (Gaussian noise models and linear models) Tracking Weather forecasting … Conditional random fields (discriminative model) Natural language processing Image processing … 7 More complex sequence models 8 Hidden Markov Models 𝐻𝑖𝑑𝑑𝑒𝑛 𝑆𝑡𝑎𝑡𝑒𝑠 𝑌𝑡 𝑇𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛 𝑀𝑜𝑑𝑒𝑙 𝑃(𝑌𝑡+1 |𝑌𝑡 ) 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑀𝑜𝑑𝑒𝑙 𝑃(𝑋𝑡 |𝑌𝑡 ) 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑋𝑡 9 Three Problems in Hidden Markov Models Given an observation sequence and a model, how to compute the probability of the observed sequence? Given an observation sequence and a model, how do we choose the hidden states that best explain the observations? How do we learn the model given the observation sequence? 𝑌1 𝑌2 𝑌𝑡 𝑌𝑡+1 𝑋1 𝑋2 𝑋𝑡 𝑋𝑡+1 10 Hidden Markov Model Directed graphical model The joint distribution factorizes according to parent child relation 𝑃(𝑋1 , … , 𝑋𝑛 , 𝑌1 , … , 𝑌𝑛 ) = 𝑃(𝑌1 ) 𝑃 𝑋𝑡 𝑌𝑡 ) 𝑡=1…𝑛 𝑃 𝑌𝑡 𝑌𝑡−1 ) 𝑡=2…𝑛 𝑌1 𝑌2 𝑌𝑡 𝑌𝑡+1 𝑋1 𝑋2 𝑋𝑡 𝑋𝑡+1 11 HMM Problem I: sequence probability Given an observation sequence and a model, how to compute the probability of the observed sequence? Inference problem in graphical models: compute the marginal distribution of observation by summing out the hidden variables 𝑃 𝑋1 , … , 𝑋𝑛 , 𝑌1 , … , 𝑌𝑛 = 𝑃 𝑋1 , … , 𝑋𝑛 , 𝑌1 , … , 𝑌𝑛 𝑌1 ,…,𝑌𝑛 = 𝑃(𝑌1 ) 𝑌1 ,…,𝑌𝑛 𝑃 𝑋𝑡 𝑌𝑡 ) 𝑡=1…𝑛 𝑃 𝑌𝑡 𝑌𝑡−1 ) 𝑡=2…𝑛 12 HMM Problem I: sequence probability 𝐵 𝐴 𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝐸 𝐷 𝐶 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸 Nice localization in computation 𝑍= 𝑍= 𝑒 𝑑 𝑒 𝑑𝑓 𝑐 𝑏 𝐸𝑑 𝑎𝑓 𝑎)𝑓 𝑏 𝑎 𝑓 𝑐 𝑏 𝑓 𝑑 𝑐 𝑓(𝐸|𝑑 𝑐𝑓 𝑑 𝑐 ( 𝑏𝑓 𝑐 𝑏 𝑎 𝑓 𝑏 𝑎 𝑓 𝑎) 𝑚𝐴𝐵 𝑏 𝑚𝐵𝐶 𝑐 𝑚𝐶𝐷 𝑑 𝑚𝐷𝐸 𝐸 𝑍 13 HMM Problem II: best hidden states Given an observation sequence and a model, how to compute the hidden state with highest probability? (decoding) Inference problem in graphical models: compute the maximum a posterior assignment (MAP) for hidden states given the observations 𝑎𝑟𝑔𝑚𝑎𝑥𝑌1,…,𝑌𝑛 𝑃 𝑋1 , … , 𝑋𝑛 , 𝑌1 , … , 𝑌𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑌1,…,𝑌𝑛 𝑃(𝑌1 ) 𝑃 𝑋𝑡 𝑌𝑡 ) 𝑡=1…𝑛 𝑃 𝑌𝑡 𝑌𝑡−1 ) 𝑡=2…𝑛 14 HMM Problem II: best hidden states The general problem solve by the junction tree algorithm is the sum-of-product problem 𝑃 𝑋𝑖 ∝ 𝑓 𝑋1 , … , 𝑋𝑛 = 𝑉\X𝑖 Ψ 𝐷𝑗 𝑉\X𝑖 𝑗 The property used by the junction tree algorithm is the distributivity of × over + For HMM, it is distributivity of max over product. Or if you take log of the probability, it is distributivity of max over sum. You just need to keep track of the hidden sequence that achieve the maximum score 15 HMM problem II: find model parameters Given just an observation sequence, find the model parameters that maximize the likelihood of the data Suppose the set of parameters are Θ 𝑎𝑟𝑔𝑚𝑎𝑥Θ log 𝑃 𝑋1 , … , 𝑋𝑛 a𝑟𝑔𝑚𝑎𝑥Θ log 𝑌1 ,…,𝑌𝑛 𝑃(𝑌1 ) 𝑡=1…𝑛 𝑃 𝑋𝑡 𝑌𝑡 ) 𝑡=2…𝑛 𝑃 𝑌𝑡 𝑌𝑡−1 ) 16 EM algorithm EM: Expectation-maximization for finding 𝜃 𝑙 𝜃; 𝐷 = log 𝑧𝑝 𝑥, 𝑧 𝜃 = log 𝑧 𝑝(𝑥𝑖 | 𝑧, 𝜃2 )𝑃 𝑧|𝜃1 Iterate between E-step and M-step until convergence Expectation step (E-step) 𝑓 𝜃 = 𝐸𝑞 𝑧 log 𝑝 𝑥, 𝑧 𝜃 , 𝑤ℎ𝑒𝑟𝑒 𝑞 𝑧 = 𝑃(𝑧|𝑥, 𝜃 𝑡 ) Maximization step (M-step) 𝜃 𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝜃 𝑓 𝜃 Also called Baum-Welch algorithm 17 More complex sequence models 18 Kalman Filter Also called state space models, or linear dynamical system 𝑋𝑡 = 𝐴𝑌𝑡 + 𝑊𝑡 𝑌𝑡 = 𝐶𝑌𝑡−1 + 𝑉𝑡 𝑊𝑡 ∼ 𝑁 0; 𝑄 , 𝑉𝑡 ∼ 𝑁 0; 𝑅 𝑥0 ∼ 𝑁(0; Σ) 𝐼𝑛 𝑔𝑒𝑛𝑒𝑟𝑎𝑙 𝑐𝑎𝑛 𝑏𝑒 𝑛𝑜𝑛𝑙𝑖𝑛𝑒𝑎𝑟 𝑠𝑦𝑠𝑡𝑒𝑚 𝐹 𝑌𝑡 and 𝐺 𝑌𝑡−1 HMM with special transition and observation models 𝑃 𝑌𝑡 𝑌𝑡−1 ∼ 𝑁(𝑌𝑡 − 𝐶𝑌𝑡−1 ; 𝑅) 𝑃 𝑋𝑡 𝑌𝑡 ∼ 𝑁 𝑋𝑡 − 𝐴𝑌𝑡 ; 𝑄 𝑃 𝑋0 ∼ 𝑁(0; Σ) 𝑌1 𝑌2 𝑌𝑡 𝑌𝑡+1 𝑋1 𝑋2 𝑋𝑡 𝑋𝑡+1 19 Inference problem in Kalman filter Filtering: given an observation sequence 𝑥1 … 𝑥𝑡 , estimate 𝑦𝑡 Inference problem in graphical models: estimate the marginal probability: 𝑃(𝑦𝑡 |𝑥1 … 𝑥𝑡 ) By summing out all value of 𝑦1 , … , 𝑦𝑡−1 Forward message passing 𝑌1 𝑌2 𝑌𝑡 𝑌𝑡+1 𝑋1 𝑋2 𝑋𝑡 𝑋𝑡+1 20 Inference problem in Kalman filter Smoothing: given an observation sequence 𝑥1 … 𝑥𝑡 , estimate y1 , … , 𝑦𝑡 Inference problem in graphical models: estimate the marginal probability: For all t, 𝑃(𝑦𝑡 |𝑥1 … 𝑥𝑡 ) By summing out all internal variables other than 𝑦𝑡 Forward and backward message passing 𝑌1 𝑌2 𝑌𝑡 𝑌𝑡+1 𝑋1 𝑋2 𝑋𝑡 𝑋𝑡+1 21 Learning Kalman Filters HMM with special transition and observation models 𝑃 𝑌𝑡 𝑌𝑡−1 ∼ 𝑁(𝑌𝑡 − 𝐶𝑌𝑡−1 ; 𝑅) 𝑃 𝑋𝑡 𝑌𝑡 ∼ 𝑁 𝑋𝑡 − 𝐴𝑌𝑡 ; 𝑄 𝑃 𝑋0 ∼ 𝑁(0; Σ) Usually, people have chosen some dynamics for A and C, just estimate the noise model for this case New position = old position + Δ x velocity + noise Observation = true position + noise If no true position is provided, use EM algorithm for learning 22 Tracking and Smoothing with Kalman Filters 23 Parameter Learning for Markov Random Fields 𝑃 𝑋1 , … , 𝑋𝑘 |𝜃 = 1 =𝑍𝜃 1 𝑍 𝜃 exp 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 𝑍 𝜃 = 𝑙 𝜃, 𝐷 = log( 𝑥 𝑖𝑗 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑁 𝑙 ( 𝑙 𝑙 𝜃 𝑥 𝑥𝑗 + 𝑖𝑗 𝑖 𝑖𝑗 𝑐𝑎𝑛 𝑏𝑒 𝑜𝑡ℎ𝑒𝑟 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑥𝑖 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑙 𝑙 exp (𝜃 𝑥 𝑖𝑗 𝑖 𝑥𝑗 ) 𝑖𝑗 𝑙 𝑥 𝑙 )) + = 𝑁 ( log (exp (𝜃 𝑥 𝑖𝑗 𝑖 𝑗 𝑙 𝑖𝑗 log 𝑍 𝜃 ) = 𝑖 𝜃𝑖 𝑋𝑖 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 1 𝑁 𝑙=1 𝑍 𝜃 + 𝑋2 𝑋4 𝑋6 𝑋1 𝑋3 𝑋5 𝑙 exp (𝜃 𝑥 𝑖 𝑖 )) 𝑖 𝑙 log(exp (𝜃 𝑥 𝑖 𝑖 )) − 𝑖 𝑙 𝜃 𝑥 𝑖 𝑖 𝑖 − log 𝑍 𝜃 ) 𝑇𝑒𝑟𝑚 𝑙𝑜𝑔𝑍 𝜃 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑒! 24 Derivatives of log likelihood 𝑙 𝜃, 𝐷 = 1 𝑁 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 1 𝑁 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 1 𝑁 = = 1 𝑁 𝑁 𝑙 𝑙 𝑙 𝑥𝑖 𝑥𝑗 𝑁 𝑙 𝑁 𝑙 𝑁 𝑙 − ( 𝑙𝑥 𝑙 + 𝜃 𝑥 𝑖𝑗 𝑖 𝑖𝑗 𝑗 𝑙𝑥 𝑙 𝑥 𝑖𝑗 𝑖 𝑗 𝑙𝑥 𝑙 𝑥 𝑖𝑗 𝑖 𝑗 1 Z(𝜃) 𝑥 − − 𝑙 𝜃 𝑥 𝑖 𝑖 𝑖 − log 𝑍 𝜃 ) 𝜕 log 𝑍 𝜃 𝜕𝜃𝑖𝑗 𝐴 𝑐𝑜𝑛𝑣𝑒𝑥 𝑝𝑟𝑜𝑏𝑙𝑒𝑚 Can find global optimum 1 𝜕𝑍 𝜃 Z(𝜃) 𝜕𝜃𝑖𝑗 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 ) 𝑖 exp(𝜃𝑖 𝑋𝑖 ) 𝑋𝑖 𝑋𝑗 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑑𝑜 𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 25 Moment matching condition 𝜕𝑙 𝜃,𝐷 = 𝜕𝜃𝑖𝑗 1 𝑁 𝑙 𝑙 𝑥 𝑥 𝑁 𝑙 𝑖 𝑗 − 1 Z 𝜃 𝑥 𝑖𝑗 exp Moment 1 𝑁 𝑖 exp 𝜃𝑖 𝑋𝑖 𝑋𝑖 𝑋𝑗 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 from model 𝑒𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 P 𝑋𝑖 , 𝑋𝑗 = 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑙 𝑙 𝑁 𝛿(𝑋 , 𝑥 ) 𝛿(𝑋 , 𝑥 𝑖 𝑗 𝑙=1 𝑖 𝑗) 𝜕𝑙 𝜃,𝐷 matching: 𝜕𝜃𝑖𝑗 = 𝐸P 𝑋𝑖 ,𝑋𝑗 𝑋𝑖 𝑋𝑗 − 𝐸𝑃(𝑋|𝜃) [𝑋𝑖 𝑋𝑗 ] 26 Optimize MLE for undirected models max 𝑙 𝜃, 𝐷 is a convex optimization problem. 𝜃 Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters 𝜃 Loop until convergence Compute 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 𝐸P Update 𝜃𝑖𝑗 ← 𝜃𝑖𝑗 − 𝜂 𝑋𝑖 ,𝑋𝑗 𝑋𝑖 𝑋𝑗 − 𝐸𝑃 𝑋𝜃 𝑋𝑖 𝑋𝑗 𝜕𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 Or use the gradient equation for fixed point iteration: iterative proportional fitting 27 Conditional random Fields Focus on conditional distribution 𝑃(𝑋1 , … , 𝑋𝑛 , |𝑌1 , … , 𝑌𝑛 , 𝑌) Do not explicitly model dependence between 𝑌1 , … , 𝑌𝑛 , 𝑌 𝑌 Only model relation between 𝑋 − 𝑌 and 𝑋 − 𝑌 𝑃(𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 |𝑌1 , 𝑌2 , 𝑌3 , 𝑌4 , 𝑌) =𝑍 1 𝑌1 ,𝑌2 ,𝑌3 ,𝑌4 ,𝑌 Ψ 𝑌1 , 𝑌2 , 𝑋1 , 𝑋2 , 𝑌 Ψ(𝑌2 , 𝑌3 , 𝑋2 , 𝑋3 , 𝑌) 𝑋1 𝑋2 𝑋3 𝑌1 𝑌2 𝑌3 𝑍 𝑌1 , 𝑌2 , 𝑌3 , 𝑌4 , 𝑌 = 𝑦1 𝑦2 𝑦3 Ψ 𝑌1 , 𝑌2 , 𝑋1 , 𝑋2 , 𝑋 Ψ(𝑌2 , 𝑌3 , 𝑋2 , 𝑋3 , 𝑋) 28 Parameter Learning for Conditional Random Fields 𝑃 𝑋1 , … , 𝑋𝑘 |𝑌, 𝜃 = 1 = 𝑍 𝑌,𝜃 1 𝑍 𝑌,𝜃 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑌) 𝑍 𝑌, 𝜃 = 𝑥 exp 𝑖𝑗 𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑌 + 𝑖 𝜃𝑖 𝑋𝑖 𝑌 𝑖 exp(𝜃𝑖 𝑋𝑖 𝑌) 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑌) 𝑖 exp(𝜃𝑖 𝑋𝑖 𝑌) 𝑋1 𝑋2 𝑋3 𝑌 Maximize long conditional likelihood 𝑐𝑙 𝜃, 𝐷 = 1 𝑁 log( 𝑙=1 𝑙 𝑍 𝑦 ,𝜃 = = 𝑁 𝑙 ( 𝑁 𝑙 ( 𝑖𝑗 log(exp(𝜃𝑖𝑗 𝑥𝑖 𝑙𝑥 𝑙 𝑦 𝑙 + 𝜃 𝑥 𝑖𝑗 𝑖 𝑖𝑗 𝑗 𝑐𝑎𝑛 𝑏𝑒 𝑜𝑡ℎ𝑒𝑟 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑥𝑖 𝑙 𝑙 𝑙 exp (𝜃 𝑥 𝑖𝑗 𝑖 𝑥𝑗 𝑦 ) 𝑖𝑗 𝑙 𝑙 exp (𝜃 𝑥 𝑖 𝑖 𝑦 )) 𝑖 𝑙 𝑙 𝑙 𝑥𝑗 𝑦 )) + 𝑖 log(exp(𝜃𝑖 𝑥𝑖𝑙 𝑦 𝑙 )) − 𝑙 𝑙 𝑙, 𝜃 ) 𝜃 𝑥 𝑦 − log 𝑍 𝑦 𝑖 𝑖 𝑖 log 𝑍 𝑦 𝑙 , 𝜃 ) 𝑇𝑒𝑟𝑚 𝑙𝑜𝑔𝑍 𝑦 𝑙 , 𝜃 𝑑𝑜𝑒𝑠 𝑛𝑜𝑡 𝑑𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑒! 29 Derivatives of log likelihood 𝑐𝑙 𝜃, 𝐷 = 𝜕𝑐𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 1 𝑁 = 1 𝑁 𝑁 𝑙 = 1 𝑁 𝑁 𝑙 1 𝑁 𝑁 𝑙 𝑙𝑥 𝑙 𝑥 𝑖𝑗 𝑖 𝑗 𝑁 𝑙 𝑙 𝑙 𝑙 𝑥𝑖 𝑥𝑗 𝑦 − 𝑙𝑥 𝑙 𝑦 𝑙 + 𝜃 𝑥 𝑖𝑗 𝑖 𝑖𝑗 𝑗 ( 𝑙𝑥 𝑙 𝑥 𝑖 𝑖𝑗 𝑗 𝑦𝑙 − 1 Z(𝑦 𝑙 ,𝜃) 𝑦𝑙 − 𝑙 𝑙 𝑙, 𝜃 ) 𝜃 𝑥 𝑦 − log 𝑍 𝑦 𝑖 𝑖 𝑖 𝜕 log 𝑍 𝑦 𝑙 ,𝜃 𝜕𝜃𝑖𝑗 𝐴 𝑐𝑜𝑛𝑣𝑒𝑥 𝑝𝑟𝑜𝑏𝑙𝑒𝑚 Can find global optimum 𝜕 𝑍 𝑦 𝑙 ,𝜃 1 Z(𝑦 𝑙 ,𝜃) 𝜕𝜃𝑖𝑗 𝑥 𝑖𝑗 exp(𝜃𝑖𝑗 𝑋𝑖 𝑋𝑗 𝑦𝑙 ) 𝑖 exp(𝜃𝑖 𝑋𝑖 𝑦𝑙 ) 𝑋𝑖 𝑋𝑗 𝑦 𝑙 𝑛𝑒𝑒𝑑 𝑡𝑜 𝑑𝑜 𝑖𝑛𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑓𝑜𝑟 𝑒𝑎𝑐ℎ 𝑦 𝑙 ‼! 30 Moment matching condition 𝜕𝑐𝑙 𝜃,𝐷 = 𝜕𝜃𝑖𝑗 1 𝑁 𝑙 𝑙 𝑙 𝑥 𝑥 𝑦 𝑁 𝑙 𝑖 𝑗 − 1 Z 𝑦 𝑙 ,𝜃 P 𝑌 = 1 𝑁 𝑙 𝑙 exp 𝜃 𝑋 𝑦 𝑋 𝑋 𝑦 𝑖 𝑖 𝑖 𝑖 𝑗 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 from model 𝑒𝑚𝑝𝑖𝑟𝑖𝑐𝑎𝑙 𝑐𝑜𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑚𝑎𝑡𝑟𝑖𝑥 P 𝑋𝑖 , 𝑋𝑗 , 𝑌 = 𝑥 𝑙 exp 𝜃 𝑋 𝑋 𝑦 𝑖𝑗 𝑖 𝑗 𝑖𝑗 1 𝑁 𝑙 𝑙 𝑁 𝑙=1 𝛿(𝑋𝑖 , 𝑥𝑖 ) 𝛿(𝑋𝑗 , 𝑥𝑗 ) 𝛿(𝑌, 𝑦 𝑙 ) 𝑁 𝑙 𝑙=1 𝛿(𝑌, 𝑦 ) Moment matching: 𝜕𝑐𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 𝐸P 𝑋𝑖 ,𝑋𝑗 ,𝑌 𝑋𝑖 𝑋𝑗 𝑌 − 𝐸𝑃(𝑋|𝑌,𝜃)P 𝑌 [𝑋𝑖 𝑋𝑗 𝑌] 31 Optimize MLE for undirected models max 𝑐𝑙 𝜃, 𝐷 is a convex optimization problem. 𝜃 Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters 𝜃 Loop until convergence Compute 𝜕𝑐𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 = 𝐸P Update 𝜃𝑖𝑗 ← 𝜃𝑖𝑗 − 𝜂 𝑋𝑖 ,𝑋𝑗 ,𝑌 𝑋𝑖 𝑋𝑗 𝑌 − 𝐸𝑃(𝑋|𝑌,𝜃)P 𝑌 [𝑋𝑖 𝑋𝑗 𝑌] 𝜕𝑐𝑙 𝜃,𝐷 𝜕𝜃𝑖𝑗 32

Sequence Learning Le Song Machine Learning II: Advanced Topics

Related documents

Products

Support

Sequence Learning Le Song Machine Learning II: Advanced Topics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib