Hidden Markov Models Outline Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm 236607 Visual Recognition Tutorial 1 Markov Models 236607 Visual Recognition Tutorial 2 Observable Markov Models Observable states: S1 ,S2 ,…,SN Below we will designate them simply as 1,2,…,N Actual state at time t is qt. q1 ,q2 ,…,qt ,…,qT First order Markov assumption: P(qt j | qt 1 i, qt 2 k ,...) P(qt j | qt 1 i ) Stationarity: P(qt j | qt 1 i) P(qt l j | qt l 1 i) 236607 Visual Recognition Tutorial 3 Markov Models State transition matrix A: a11 a 21 A ai1 aN 1 a12 ... a1 j a22 ... a2 j ai 2 ... aij aN 2 ... aNj ... a1N ... a2 N ... aiN ... aNN aij P(qt j | qt 1 i )1 i, j N where Constraints on aij : aij 0,i, j N a j 1 ij 1,i 236607 Visual Recognition Tutorial 4 Markov Models: Example States: 1. 2. 3. Rainy (R) Cloudy (C) Sunny (S) State transition probability matrix: 0.4 A Compute the probability of observing SSRRSCS given that today is S. 236607 Visual Recognition Tutorial 5 Markov Models: Example Basic conditional probability rule: P( A, B) P( A | B) P( B) The Markov chain rule: P(q1 , q2 ,..., qT ) P(qT | q1 , q2 ,..., qT 1 ) P(q1 , q2 ,..., qT 1 ) P(qT | qT 1 ) P(q1 , q2 ,..., qT 1 ) P(qT | qT 1 ) P(qT 1 | qT 2 ) P(q1 , q2 ,..., qT 2 ) P(qT | qT 1 ) P(qT 1 | qT 2 )...P(q2 | q1 ) P(q1 ) 236607 Visual Recognition Tutorial 6 Markov Models: Example Observation sequence O: O ( S , S , S , R, R, S , C , S ) Using the chain rule we get: P(O | model) P( S , S , S , R, R, S , C , S | model) P( S ) P( S | S ) P( S | S ) P ( R | S ) P ( R | R ) P( S | R) P(C | S ) P( S | C ) a33a33a31a11a13a32 a23 (1)(0.8) 2 (0.1)(0.4)(0.3)(0.1)(0.2) 1.536 10 4 236607 Visual Recognition Tutorial 7 Markov Models: Example What is the probability that the sequence remains in state i for exactly d time units? pi (d ) P(q1 i, q2 i,..., qd i, qd 1 i,...) i (aii )d 1 (1 aii ) Exponential Markov chain duration density. What is the expected value of the duration d in state i? di dpi (d ) d 1 d (aii ) d 1 (1 aii ) d 1 236607 Visual Recognition Tutorial 8 Markov Models: Example (1 aii ) d (aii ) d 1 d 1 (1 aii ) aii (a ) (1 aii ) aii aii 1 aii d 1 d ii 1 1 aii 236607 Visual Recognition Tutorial 9 Markov Models: Example • Avg. number of consecutive sunny days = 1 1 5 1 a33 1 0.8 • Avg. number of consecutive cloudy days = 2.5 • Avg. number of consecutive rainy days = 1.67 236607 Visual Recognition Tutorial 10 Hidden Markov Models States are not observable Observations are probabilistic functions of state State transitions are still probabilistic 236607 Visual Recognition Tutorial 11 Urn and Ball Model N urns containing colored balls M distinct colors of balls Each urn has a (possibly) different distribution of colors Sequence generation algorithm: 1. 2. 3. 4. Pick initial urn according to some random process. Randomly pick a ball from the urn and then replace it Select another urn according a random selection process asociated with the urn Repeat steps 2 and 3 236607 Visual Recognition Tutorial 12 The Trellis N 4 3 2 1 1 2 o1 o2 t-1 t t+1 t+2 TIME ot-1 ot ot+1 ot+2 OBSERVATION 236607 Visual Recognition Tutorial T-1 T oT-1 oT 13 Elements of Hidden Markov Models N – the number of hidden states Q – set of states Q={1,2,…,N} M – the number of symbols V – set of symbols V ={1,2,…,M} A – the state-transition probability matrix ai , j P(qt 1 j | qt i )1 i, j N B – Observation probability distribution: B j (k ) P(ot vk | qt j )1 i, j M - the initial state distribution: i P(q1 i)1 i N l– the entire model l ( A, B, ) 236607 Visual Recognition Tutorial 14 Three Basic Problems 1. EVALUATION – given observation O=(o1 , o2 ,…,oT ) and P (O | l ). l ( A, B, , )efficiently compute model Hidden states complicate the evaluation Given two models l1 and l1, this can be used to choose the better one. 2. DECODING - given observation O=(o1 , o2 ,…,oT ) and model l find the optimal state sequence q=(q1 , q2 ,…,qT ) . Optimality criterion has to be decided (e.g. maximum likelihood) “Explanation” of the data. 3. LEARNING – given O=(o1 , o2 ,…,oT ), estimate model parameters l ( A, B, ) that maximize P(O | l ). 236607 Visual Recognition Tutorial 15 Solution to Problem 1 Problem: Compute P(o1 , o2 ,…,oT |l). Algorithm: – – Let q=(q1 , q2 ,…,qT ) be a state sequence. Assume the observations are independent: T P(O | q, l ) P(ot |qt , l ) i 1 bq1 (o1 )bq 2 (o2 )...bqT (oT ) – Probability of a particular state sequence is: P(q | l ) q1aq1q 2 aq 2 q 3 ...aqT 1qT – Also, P(O, q | l ) P(O | q, l ) P(q | l ) 236607 Visual Recognition Tutorial 16 Solution to Problem 1 – Enumerate paths and sum probabilities: P(O | l ) P(O | q, l ) P(q | l ) q N T state sequences and O(T) calculations. Complexity: O(T N T) calculations. 236607 Visual Recognition Tutorial 17 Forward Procedure: Intuition N aNk a3k 3 2 k v a1k 1 t t+1 TIME 236607 Visual Recognition Tutorial 18 Forward Algorithm Define forward variable t (i ) as: t (i) P(o1 , o1 ,..., ot , qt i | l ) t (i ) is the probability of observing the partial sequence (o1 , o1 ,..., ot ) such that the state qt is i. Induction: 1. 2. Initialization: Induction: 3. Termination: 4. Complexity: 1 (i) i bi (o1 ) N t 1 ( j ) t (i)aij b j (ot 1 ) iN1 P(O | l ) T (i ) i 1 O( N 2T ) 236607 Visual Recognition Tutorial 19 Example Consider the following coin-tossing experiment: P(H) P(T) – – State 1 State2 State3 0.5 0.5 0.75 0.25 0.25 0.75 state-transition probabilities equal to 1/3 initial state probabilities equal to 1/3 236607 Visual Recognition Tutorial 20 Example 1. 3. You observe O=(H,H,H,H,T,H,T,T,T,T). What state sequence q, is most likely? What is the joint probability, P (O, q | l ), of the observation sequence and the state sequence? What is the probability that the observation sequence came entirely of state 1? Consider the observation sequence 4. O ( H , T , T , H , T , H , H , T , T , H ). How would your answers to parts 1 and 2 change? 2. 236607 Visual Recognition Tutorial 21 Example 4. If the state transition probabilities were: 0.90.45 A ' 0.050.1 0.050.45 How would the new model l’ change your answers to parts 1-3? 236607 Visual Recognition Tutorial 22 Backward Algorithm N 4 3 2 1 1 2 o1 o2 t-1 t t+1 t+2 TIME ot-1 ot ot+1 ot+2 OBSERVATION 236607 Visual Recognition Tutorial T-1 T oT-1 oT 23 Backward Algorithm Define backward variable t (i ) as: t (i) P(ot 1 , ot 2 ,..., oT | qt i, l ) t (i ) is the probability of observing the partial sequence (ot 1 , ot 2 ,..., oT ) such that the state qt is i. Induction: 1. Initialization: T (i) 1 2. Induction: N t (i) aij b j (ot 1 ) t 1 ( j ),1 i N ,t T 1,...,1 j 1 236607 Visual Recognition Tutorial 24 Solution to Problem 2 Choose the most likely path Find the path (q1 , q2 ,…,qT ) that maximizes the likelihood: P(q1 , q2 ,..., qT | O, l ) Solution by Dynamic Programming Define t (i) max P(q1, q2 ,..., qt i, o1, o1,..., ot | l) q1 , q2 ,..., qt 1 t (i) is the highest prob. path ending in state i . By induction we have: t 1 ( j ) max[ t (i)aij ] b j (ot 1 ) i 236607 Visual Recognition Tutorial 25 Viterbi Algorithm N aNk k 4 3 2 a1k 1 1 2 o1 o2 t-1 t t+1 t+2 TIME ot-1 ot ot+1 ot+2 OBSERVATION 236607 Visual Recognition Tutorial T-1 T oT-1 oT 26 Viterbi Algorithm Initialization: 1 (i) i bi (o1 ),1 i N 1 (i) 0 Recursion: t ( j ) max[ t 1 (i)aij ]b j (ot ) 1i N t ( j ) arg max[ t 1 (i)aij ] 1i N Termination: t T ,1 j N PT* max[ T (i )] 1i N qT* arg max[ T (i )] 1 i N Path (state sequence) backtracking: qt* t 1 qt*1 t T 1, T 2,...,1 236607 Visual Recognition Tutorial 27 Solution to Problem 3 Estimate l ( A, B, ) to maximize P(O | l ) No analytic method because of complexity – iterative solution. Baum-Welch Algorithm (actually EM algorithm) : 1. Let initial model be l 2. Compute new l based on l and observation O. If log P(O | l ) log P(O | l0 ) DELTAstop 4. Else set lland go to step 236607 Visual Recognition Tutorial 28 Baum-Welch: Preliminaries t (i, j ) P(qt i, qt 1 j | O, l ) 236607 Visual Recognition Tutorial 29 Baum-Welch: Preliminaries Define (i, j ) as the probability of being in state i at time t, and in state j at time t+1. t (i )aij b j (ot 1 ) t 1 ( j ) (i, j ) P(O | l ) t (i)aij b j (ot 1 ) t 1 ( j ) N N i 1 j 1 t (i)aij b j (ot 1 ) t 1 ( j ) Define t (i) as the probability of being in state i at time t, given the observation sequence N t (i) t (i, j ) j 1 236607 Visual Recognition Tutorial 30 Baum-Welch: Preliminaries T (i) is the expected number of times state i is visited. t 1 t T 1 t (i, j ) is the expected number of transitions from state i to state j. t 1 236607 Visual Recognition Tutorial 31 Baum-Welch: Update Rules _ _ i expected frequency in state i at time (t=1) 1 (i) aij (expected number of transitions from state i to state j) / _ expected number of transitions from state i): _ t (i, j ) aij t (i) bi (k ) (expected number of times in state j and observing symbol k) / (expected number of times in state j ): ( j) b (k ) ( j) _ j t , ot k t t 236607 Visual Recognition Tutorial 32 References L. Rabiner and B. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, p. 4--16, Jan. 1986. Thad Starner, Joshua Weaver, and Alex Pentland Machine Vision Recognition of American Sign Language Using Hidden Markov Model Thad Starner, Alex Pentland Visual Recognition of American Sign Language Using Hidden Markov Models. In International Workshop on Automatic Face and Gesture Recognition, pages 189--194, 1995. http://citeseer.nj.nec.com/starner95visual.html 236607 Visual Recognition Tutorial 33