Sequence Learning Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Topic Modeling in Document Analysis Document Collections Topic Discovery: Discover Sets of Frequently Co-occuring Words Topics Attribution: Attribute Words to a Particular Topic [Blei et al. 03] Basic Model: Latent Dirichlet Allocation A generative model for text documents For each document, choose a topic distribution Θ For each word w in the document Choose a specific topic z for the word Choose the word w from the topic (a distribution over vocabulary) LDA pLSI Simplified version without the prior distribution π π πΌ corresponds to probabilistic latent semantic indexing (pLSI) [Blei et al. 03] Learning Problem in Topic Modeling Automatically discover the topics from document collections Given: a collection of M documents Basic learning problems: discover the following latent structures k topics Per-document topic distribution Per-word topic assignment Application of topic modeling Visualization and interpretation Use topic distribution as document features for classification or clustering Predict the occurrence of words in new documents (collaborative filtering) Gibbs Sampling for Learning Topic Models Variational Inference for Learning Topic Models Approximate the posterior distribution P of latent variable by a simpler distribution Q(Φ*) in a parametric family Φ* = argmin KL(Q(Φ) || P), where KL = Kullback-Leibler The approximated posterior is then used in the E-step of the EM algorithm Only find a local minima Variational approximation of mixture of 2 Gaussians by a single Gaussian Each figure showing a different local optimum Sequence Learning Hidden Markov Models Speech recognition Genome sequence modeling … Kalman Filter (Gaussian noise models and linear models) Tracking Weather forecasting … Conditional random fields (discriminative model) Natural language processing Image processing … 7 More complex sequence models 8 Hidden Markov Models π»πππππ ππ‘ππ‘ππ ππ‘ πππππ ππ‘πππ πππππ π(ππ‘+1 |ππ‘ ) πππ πππ£ππ‘πππ πππππ π(ππ‘ |ππ‘ ) πππ πππ£ππ‘πππ ππ‘ 9 Three Problems in Hidden Markov Models Given an observation sequence and a model, how to compute the probability of the observed sequence? Given an observation sequence and a model, how do we choose the hidden states that best explain the observations? How do we learn the model given the observation sequence? π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 10 Hidden Markov Model Directed graphical model The joint distribution factorizes according to parent child relation π(π1 , … , ππ , π1 , … , ππ ) = π(π1 ) π ππ‘ ππ‘ ) π‘=1…π π ππ‘ ππ‘−1 ) π‘=2…π π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 11 HMM Problem I: sequence probability Given an observation sequence and a model, how to compute the probability of the observed sequence? Inference problem in graphical models: compute the marginal distribution of observation by summing out the hidden variables π π1 , … , ππ , π1 , … , ππ = π π1 , … , ππ , π1 , … , ππ π1 ,…,ππ = π(π1 ) π1 ,…,ππ π ππ‘ ππ‘ ) π‘=1…π π ππ‘ ππ‘−1 ) π‘=2…π 12 HMM Problem I: sequence probability π΅ π΄ ππ΄π΅ π ππ΅πΆ π πΈ π· πΆ ππΆπ· π ππ·πΈ πΈ Nice localization in computation π= π= π π π ππ π π πΈπ ππ π)π π π π π π π π π π(πΈ|π ππ π π ( ππ π π π π π π π π) ππ΄π΅ π ππ΅πΆ π ππΆπ· π ππ·πΈ πΈ π 13 HMM Problem II: best hidden states Given an observation sequence and a model, how to compute the hidden state with highest probability? (decoding) Inference problem in graphical models: compute the maximum a posterior assignment (MAP) for hidden states given the observations ππππππ₯π1,…,ππ π π1 , … , ππ , π1 , … , ππ = ππππππ₯π1,…,ππ π(π1 ) π ππ‘ ππ‘ ) π‘=1…π π ππ‘ ππ‘−1 ) π‘=2…π 14 HMM Problem II: best hidden states The general problem solve by the junction tree algorithm is the sum-of-product problem π ππ ∝ π π1 , … , ππ = π\Xπ Ψ π·π π\Xπ π The property used by the junction tree algorithm is the distributivity of × over + For HMM, it is distributivity of max over product. Or if you take log of the probability, it is distributivity of max over sum. You just need to keep track of the hidden sequence that achieve the maximum score 15 HMM problem II: find model parameters Given just an observation sequence, find the model parameters that maximize the likelihood of the data Suppose the set of parameters are Θ ππππππ₯Θ log π π1 , … , ππ aπππππ₯Θ log π1 ,…,ππ π(π1 ) π‘=1…π π ππ‘ ππ‘ ) π‘=2…π π ππ‘ ππ‘−1 ) 16 EM algorithm EM: Expectation-maximization for finding π π π; π· = log π§π π₯, π§ π = log π§ π(π₯π | π§, π2 )π π§|π1 Iterate between E-step and M-step until convergence Expectation step (E-step) π π = πΈπ π§ log π π₯, π§ π , π€βπππ π π§ = π(π§|π₯, π π‘ ) Maximization step (M-step) π π‘+1 = ππππππ₯π π π Also called Baum-Welch algorithm 17 More complex sequence models 18 Kalman Filter Also called state space models, or linear dynamical system ππ‘ = π΄ππ‘ + ππ‘ ππ‘ = πΆππ‘−1 + ππ‘ ππ‘ ∼ π 0; π , ππ‘ ∼ π 0; π π₯0 ∼ π(0; Σ) πΌπ πππππππ πππ ππ πππππππππ π π¦π π‘ππ πΉ ππ‘ and πΊ ππ‘−1 HMM with special transition and observation models π ππ‘ ππ‘−1 ∼ π(ππ‘ − πΆππ‘−1 ; π ) π ππ‘ ππ‘ ∼ π ππ‘ − π΄ππ‘ ; π π π0 ∼ π(0; Σ) π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 19 Inference problem in Kalman filter Filtering: given an observation sequence π₯1 … π₯π‘ , estimate π¦π‘ Inference problem in graphical models: estimate the marginal probability: π(π¦π‘ |π₯1 … π₯π‘ ) By summing out all value of π¦1 , … , π¦π‘−1 Forward message passing π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 20 Inference problem in Kalman filter Smoothing: given an observation sequence π₯1 … π₯π‘ , estimate y1 , … , π¦π‘ Inference problem in graphical models: estimate the marginal probability: For all t, π(π¦π‘ |π₯1 … π₯π‘ ) By summing out all internal variables other than π¦π‘ Forward and backward message passing π1 π2 ππ‘ ππ‘+1 π1 π2 ππ‘ ππ‘+1 21 Learning Kalman Filters HMM with special transition and observation models π ππ‘ ππ‘−1 ∼ π(ππ‘ − πΆππ‘−1 ; π ) π ππ‘ ππ‘ ∼ π ππ‘ − π΄ππ‘ ; π π π0 ∼ π(0; Σ) Usually, people have chosen some dynamics for A and C, just estimate the noise model for this case New position = old position + Δ x velocity + noise Observation = true position + noise If no true position is provided, use EM algorithm for learning 22 Tracking and Smoothing with Kalman Filters 23 Parameter Learning for Markov Random Fields π π1 , … , ππ |π = 1 =ππ 1 π π exp ππ exp(πππ ππ ππ ) π π = π π, π· = log( π₯ ππ πππ ππ ππ π π ( π π π π₯ π₯π + ππ π ππ πππ ππ ππ‘βππ ππππ‘π’ππ ππ’πππ‘πππ π π₯π π exp(ππ ππ ) π π exp (π π₯ ππ π π₯π ) ππ π π₯ π )) + = π ( log (exp (π π₯ ππ π π π ππ log π π ) = π ππ ππ π exp(ππ ππ ) ππ exp(πππ ππ ππ ) 1 π π=1 π π + π2 π4 π6 π1 π3 π5 π exp (π π₯ π π )) π π log(exp (π π₯ π π )) − π π π π₯ π π π − log π π ) ππππ ππππ π ππππ πππ‘ ππππππππ π! 24 Derivatives of log likelihood π π, π· = 1 π ππ π,π· ππππ 1 π ππ π,π· ππππ = 1 π = = 1 π π π π π π₯π π₯π π π π π π π − ( ππ₯ π + π π₯ ππ π ππ π ππ₯ π π₯ ππ π π ππ₯ π π₯ ππ π π 1 Z(π) π₯ − − π π π₯ π π π − log π π ) π log π π ππππ π΄ ππππ£ππ₯ πππππππ Can find global optimum 1 ππ π Z(π) ππππ ππ exp(πππ ππ ππ ) π exp(ππ ππ ) ππ ππ ππππ π‘π ππ πππππππππ 25 Moment matching condition ππ π,π· = ππππ 1 π π π π₯ π₯ π π π π − 1 Z π π₯ ππ exp Moment 1 π π exp ππ ππ ππ ππ πππ£πππππππ πππ‘πππ₯ from model πππππππππ πππ£πππππππ πππ‘πππ₯ P ππ , ππ = πππ ππ ππ π π π πΏ(π , π₯ ) πΏ(π , π₯ π π π=1 π π) ππ π,π· matching: ππππ = πΈP ππ ,ππ ππ ππ − πΈπ(π|π) [ππ ππ ] 26 Optimize MLE for undirected models max π π, π· is a convex optimization problem. π Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters π Loop until convergence Compute ππ π,π· ππππ = πΈP Update πππ ← πππ − π ππ ,ππ ππ ππ − πΈπ ππ ππ ππ ππ π,π· ππππ Or use the gradient equation for fixed point iteration: iterative proportional fitting 27 Conditional random Fields Focus on conditional distribution π(π1 , … , ππ , |π1 , … , ππ , π) Do not explicitly model dependence between π1 , … , ππ , π π Only model relation between π − π and π − π π(π1 , π2 , π3 , π4 |π1 , π2 , π3 , π4 , π) =π 1 π1 ,π2 ,π3 ,π4 ,π Ψ π1 , π2 , π1 , π2 , π Ψ(π2 , π3 , π2 , π3 , π) π1 π2 π3 π1 π2 π3 π π1 , π2 , π3 , π4 , π = π¦1 π¦2 π¦3 Ψ π1 , π2 , π1 , π2 , π Ψ(π2 , π3 , π2 , π3 , π) 28 Parameter Learning for Conditional Random Fields π π1 , … , ππ |π, π = 1 = π π,π 1 π π,π ππ exp(πππ ππ ππ π) π π, π = π₯ exp ππ πππ ππ ππ π + π ππ ππ π π exp(ππ ππ π) ππ exp(πππ ππ ππ π) π exp(ππ ππ π) π1 π2 π3 π Maximize long conditional likelihood ππ π, π· = 1 π log( π=1 π π π¦ ,π = = π π ( π π ( ππ log(exp(πππ π₯π ππ₯ π π¦ π + π π₯ ππ π ππ π πππ ππ ππ‘βππ ππππ‘π’ππ ππ’πππ‘πππ π π₯π π π π exp (π π₯ ππ π π₯π π¦ ) ππ π π exp (π π₯ π π π¦ )) π π π π π₯π π¦ )) + π log(exp(ππ π₯ππ π¦ π )) − π π π, π ) π π₯ π¦ − log π π¦ π π π log π π¦ π , π ) ππππ ππππ π¦ π , π ππππ πππ‘ ππππππππ π! 29 Derivatives of log likelihood ππ π, π· = πππ π,π· ππππ = 1 π = 1 π π π = 1 π π π 1 π π π ππ₯ π π₯ ππ π π π π π π π π₯π π₯π π¦ − ππ₯ π π¦ π + π π₯ ππ π ππ π ( ππ₯ π π₯ π ππ π π¦π − 1 Z(π¦ π ,π) π¦π − π π π, π ) π π₯ π¦ − log π π¦ π π π π log π π¦ π ,π ππππ π΄ ππππ£ππ₯ πππππππ Can find global optimum π π π¦ π ,π 1 Z(π¦ π ,π) ππππ π₯ ππ exp(πππ ππ ππ π¦π ) π exp(ππ ππ π¦π ) ππ ππ π¦ π ππππ π‘π ππ πππππππππ πππ πππβ π¦ π βΌ! 30 Moment matching condition πππ π,π· = ππππ 1 π π π π π₯ π₯ π¦ π π π π − 1 Z π¦ π ,π P π = 1 π π π exp π π π¦ π π π¦ π π π π π πππ£πππππππ πππ‘πππ₯ from model πππππππππ πππ£πππππππ πππ‘πππ₯ P ππ , ππ , π = π₯ π exp π π π π¦ ππ π π ππ 1 π π π π π=1 πΏ(ππ , π₯π ) πΏ(ππ , π₯π ) πΏ(π, π¦ π ) π π π=1 πΏ(π, π¦ ) Moment matching: πππ π,π· ππππ = πΈP ππ ,ππ ,π ππ ππ π − πΈπ(π|π,π)P π [ππ ππ π] 31 Optimize MLE for undirected models max ππ π, π· is a convex optimization problem. π Can be solve by many methods, such as gradient descent, conjugate gradient. Initialize model parameters π Loop until convergence Compute πππ π,π· ππππ = πΈP Update πππ ← πππ − π ππ ,ππ ,π ππ ππ π − πΈπ(π|π,π)P π [ππ ππ π] πππ π,π· ππππ 32