Hidden Markov Models (HMM) Rabiner’s Paper Markoviana Reading Group Computer Eng. & Science Dept. Arizona State University Stationary and Non-stationary Stationary Process: Its statistical properties do not vary with time Non-stationary Process: The signal properties vary over time Markoviana Reading Group Fatih Gelgi – Feb, 2005 2 HMM Example - Casino Coin 0.9 Fair Two CDF tables 0.2 0.1 State transition Pbbties. Unfair States 0.8 0.5 H 0.5 T 0.7 H 0.3 Symbol emission Pbbties. T Observation Symbols HTHHTTHHHTHTHTHHTHHHHHHTHTHH FFFFFFUUUFFFFFFUUUUUUUFFFFFF Observation Sequence State Sequence Motivation: Given a sequence of H & Ts, can you tell at what times the casino cheated? Markoviana Reading Group Fatih Gelgi – Feb, 2005 3 Properties of an HMM First-order Markov process qt only depends on qt-1 Time is discrete Markoviana Reading Group Fatih Gelgi – Feb, 2005 4 Elements of an HMM a S1 S2 ... SN N, the number of States M, the number of Symbols States S1, S2, … SN Observation Symbols O1, O2, … OM l, the Probability Distributions a, b, S1 . . . . S2 . . . . ... S N . . . . . . . . Markoviana Reading Group b S1 S2 ... SN O1 . . . . O2 . . . . O3 ... OM . . . . . . . . . . . . Fatih Gelgi – Feb, 2005 S1 . S2 . ... S N . . 5 HMM Basic Problems 1. Given an observation sequence O=O1O2O3…OT and l, find P(O|l) Forward Algorithm / Backward Algorithm 2. Given O=O1O2O3…OT and l, find most likely state sequence Q=q1q2…qT Viterbi Algorithm 3. Given O=O1O2O3…OT and l, re-estimate l so that P(O|l) is higher than it is now Baum-Welch Re-estimation Markoviana Reading Group Fatih Gelgi – Feb, 2005 6 Forward Algorithm Illustration at(i) is the probability of observing a partial sequence O1O2O3…Ot such that the state Si. Markoviana Reading Group Fatih Gelgi – Feb, 2005 7 Forward Algorithm Illustration (cont’d) at(i) is the probability of observing a partial sequence O1O2O3…Ot such State Sj SN NbN(O1) S (a1(i) aiN) bN(O2) … … … S6 6b6(O1) S (a1(i) ai6) b6(O2) S5 5b5(O1) S (a1(i) ai5) b5(O2) S4 4b4(O1) S (a1(i) ai4) b4(O2) S3 3b3(O1) S (a1(i) ai3) b3(O2) S2 2b2(O1) S (a1(i) ai2) b2(O2) S1 1b1(O1) S (a1(i) ai1) b1(O2) at(j) O1 O2 O3 O4 … Total of this column gives solution that the state Si. OT Observations Ot Markoviana Reading Group Fatih Gelgi – Feb, 2005 8 Forward Algorithm Definition: a t (i) P(O1O2 ...Ot , qt Si | l ) Initialization: a1 (i) i bi (O1 ) at(i) is the probability of observing a partial sequence O1O2O3…Ot such that the state Si. 1 i N Induction: N a t 1 ( j ) a t (i) aij b j (Ot 1 ) 1 t T 1, 1 j N i 1 Problem 1 Answer: N Complexity: O(N2T) P(O | l ) a T (i ) i 1 Markoviana Reading Group Fatih Gelgi – Feb, 2005 9 Backward Algorithm Illustration t(i) is the probability of observing a partial sequence Ot+1Ot+2Ot+3…OT such that the state Si. Markoviana Reading Group Fatih Gelgi – Feb, 2005 10 Backward Algorithm Definition: t(i) is the probability of Initialization: observing a partial sequence Ot+1Ot+2Ot+3…OT such that the state Si. Induction: Markoviana Reading Group Fatih Gelgi – Feb, 2005 11 Q2: Optimality Criterion 1 * Maximize the expected number of correct individual states Definition: Initialization: Problem 2 Answer: Markoviana Reading Group t(i) is the probability of being in state Si at time t given the observation sequence O and the model l. Problem: If some aij=0, the optimal state sequence may not even be a valid state sequence. Fatih Gelgi – Feb, 2005 12 Q2: Optimality Criterion 2 * Find the single best state sequence (path), i.e. maximize P(Q|O,l). dt(i) is the highest probability of a state path for the partial observation sequence O1O2O3…Ot such that the state Si. Definition: Markoviana Reading Group Fatih Gelgi – Feb, 2005 13 Viterbi Algorithm The major difference from the forward algorithm: Maximization instead of sum Markoviana Reading Group Fatih Gelgi – Feb, 2005 14 Viterbi Algorithm Illustration dt(i) is the highest probability of a state path for the partial State Sj SN N bN(O1) max [d1(i) aiN] … … … S6 6 b6(O1) max [d1(i) ai6] b6(O2) S5 5 b5(O1) max [d1(i) ai5] b5(O2) S4 4 b4(O1) max [d1(i) ai4] b4(O2) S3 3 b3(O1) max [d1(i) ai3] b3(O2) S2 2 b2(O1) max [d1(i) ai2] b2(O2) S1 1 b1(O1) max [d1(i) ai1] b1(O2) dt(j) O1 O2 bN(O2) O3 Observations Ot Markoviana Reading Group Fatih Gelgi – Feb, 2005 O4 … OT Max of this col indicates traceback start observation sequence O1O2O3…Ot such that the state Si. 15 Relations with DBN Forward Function: at+1(j) bj(Ot+1) aij at(i) Backward Function: t(i) t+1(j) bj(Ot+1) aij T(i)=1 Viterbi Algorithm: dt+1(j) bj(Ot+1) Markoviana Reading Group aij Fatih Gelgi – Feb, 2005 dt(i) 16 Some more definitions t(i) is the probability of being in state Si at time t xt(i,j) is the probability of being in state Si at time t, and Sj at time t+1 Markoviana Reading Group Fatih Gelgi – Feb, 2005 17 Baum-Welch Re-estimation Expectation-Maximization Algorithm Expectation: Markoviana Reading Group Fatih Gelgi – Feb, 2005 18 Baum-Welch Re-estimation (cont’d) Maximization: Markoviana Reading Group Fatih Gelgi – Feb, 2005 19 Notes on the Re-estimation If the model does not change, it means that it has reached a local maxima. Depending on the model, many local maxima can exist Re-estimated probabilities will sum to 1 Markoviana Reading Group Fatih Gelgi – Feb, 2005 20 Implementation issues Scaling Multiple observation sequences Initial parameter estimation Missing data Choice of model size and type Markoviana Reading Group Fatih Gelgi – Feb, 2005 21 Scaling calculation: Recursion to calculate: Markoviana Reading Group Fatih Gelgi – Feb, 2005 22 Scaling (cont’d) calculation: Desired condition: * Note that Markoviana Reading Group is not true! Fatih Gelgi – Feb, 2005 23 Scaling (cont’d) Markoviana Reading Group Fatih Gelgi – Feb, 2005 24 Maximum log-likelihood Initialization: Recursion: Termination: Markoviana Reading Group Fatih Gelgi – Feb, 2005 25 Multiple observations sequences Problem with re-estimation Markoviana Reading Group Fatih Gelgi – Feb, 2005 26 Initial estimates of parameters For and A, Random or uniform is sufficient For B (discrete symbol prb.), Good initial estimate is needed Markoviana Reading Group Fatih Gelgi – Feb, 2005 27 Insufficient training data Solutions: Increase the size of training data Reduce the size of the model Interpolate parameters using another model Markoviana Reading Group Fatih Gelgi – Feb, 2005 28 References L Rabiner. ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.’ Proceedings of the IEEE 1989. S Russell, P Norvig. ‘Probabilistic Reasoning Over Time’. AI: A Modern Approach, Ch.15, 2002 (draft). V Borkar, K Deshmukh, S Sarawagi. ‘Automatic segmentation of text into structured records.’ ACM SIGMOD 2001. T Scheffer, C Decomain, S Wrobel. ‘Active Hidden Markov Models for Information Extraction.’ Proceedings of the International Symposium on Intelligent Data Analysis 2001. S Ray, M Craven. ‘Representing Sentence Structure in Hidden Markov Models for Information Extraction.’ Proceedings of the 17th International Joint Conference on Artificial Intelligence 2001. Markoviana Reading Group Fatih Gelgi – Feb, 2005 29