Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Hidden Markov Models Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © Andrew W. Moore Slide 1 A Markov System Has N states, called s1, s2 .. sN s2 s1 There are discrete timesteps, t=0, t=1, … s3 N=3 t=0 Copyright © Andrew W. Moore Slide 2 A Markov System Has N states, called s1, s2 .. sN s2 Current State s1 s3 There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } N=3 t=0 qt=q0=s3 Copyright © Andrew W. Moore Slide 3 A Markov System Has N states, called s1, s2 .. sN Current State s2 s1 N=3 t=1 s3 There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. qt=q1=s2 Copyright © Andrew W. Moore Slide 4 P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 A Markov System P(qt+1=s3|qt=s2) = 0 Has N states, called s1, s2 .. sN P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 s2 P(qt+1=s3|qt=s1) = 1 s1 s3 N=3 t=1 qt=q1=s2 Copyright © Andrew W. Moore P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 There are discrete timesteps, t=0, t=1, … On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. Slide 5 P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 A Markov System P(qt+1=s3|qt=s2) = 0 Has N states, called s1, s2 .. sN P(qt+1=s1|qt=s1) = 0 s2 P(qt+1=s2|qt=s1) = 0 1/2 P(qt+1=s3|qt=s1) = 1 2/3 1/2 s1 N=3 t=1 qt=q1=s2 1/3 s3 On the t’th timestep the system is in exactly one of the available states. Call it qt Note: qt {s1, s2 .. sN } 1 P(qt+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 Often notated with arcs between states Copyright © Andrew W. Moore There are discrete timesteps, t=0, t=1, … Between each timestep, the next state is chosen randomly. The current state determines the probability distribution for the next state. Slide 6 P(qt+1=s1|qt=s2) = 1/2 P(qt+1=s2|qt=s2) = 1/2 qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt. P(qt+1=s3|qt=s2) = 0 P(qt+1=s1|qt=s1) = 0 s2 P(qt+1=s2|qt=s1) = 0 1/2 P(qt+1=s3|qt=s1) = 1 N=3 t=1 qt=q1=s2 Copyright © Andrew W. Moore In other words: P(qt+1 = sj |qt = si ) = 2/3 1/2 s1 Markov Property P(qt+1 = sj |qt = si ,any earlier history) 1/3 s3 1 P(qt+1=s1|qt=s3) = 1/3 Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, … q3,q4 )? P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 Slide 7 P(qt+1=s1|qt=s2) = 1/2 Answer: P(qt+1=s2|qt=s2) = 1/2 q 0=s3|qt=s2) = 0 P(qt+1 P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 q1 s2 qt+1 is conditionally independent of { qt-1, qt-2, … q1, q0 } given qt. 1/2 P(qt+1=s3|qt=s1) = 1 1/2 s1 N=3 t=1 qt=q1=s2 q1/3 2 Markov Property In other words: P(qt+1 = sj |qt = si ) = 2/3 P(qt+1 = sj |qt = si ,any earlier history) s3 1 q P(q3t+1=s1|qt=s3) = 1/3 Question: what would be the best Bayes Net structure to represent the Joint Distribution of ( q0, q1, q2,q3,q4 )? P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 q4 Copyright © Andrew W. Moore Slide 8 P(qt+1=s1|qt=s2) = 1/2 Answer: Markov Property P(qt+1=s2|qt=s2) = 1/2 q 0=s3|qt=s2) = 0 P(qt+1 i P(qt+1=s1|qt=s1) = 0 P(qt+1=s2|qt=s1) = 0 q1 P(qt+1=s3|qt=s1) = 1 1/2 s1 Each of these probability N=3 tables is identical t=1 qt=q1=s2 q1/3 2 1 t s1 2a11 2 a21 3 a31 1/2 :2/3 : i ai1 s3 N aN1 q P(q3t+1=s1|qt=s3) = 1/3 P(qt+1=s2|qt=s3) = 2/3 P(qt+1=s3|qt=s3) = 0 q4 Copyright © Andrew W. Moore qt+1 is conditionally independent P(qt+1=s1|q =sof =s,2|qq =si) ,… P(q q=s |q =s ) … P(qt+1=sq i) P({qt+1 N|q.=si) qt-1 t-2 … t+1 1, j q0 i } given t t a12 t …a 1j …a 1N a2j …a 2N …a 3N In other words: … a22 P(q a32t+1 = sj |q… t =a3jsi ) = : P(qt+1 = ai2 t : : sj |q… t= : : si ,any earlier history) … aij aiN Question: what would be the best …a Bayes to …represent aN2 Net structure aNN Nj the Joint Distribution of ( q0, q1, q2,q3,q4 )? Notation: aij P(qt 1 s j | qt si ) Slide 9 A Blind Robot A human and a robot wander around randomly on a grid… R H STATE q = Copyright © Andrew W. Moore Location of Robot, Location of Human Slide 10 Dynamics of System q0 = R H Each timestep the human moves randomly to an adjacent cell. And Robot also moves randomly to an adjacent cell. Typical Questions: • “What’s the expected time until the human is crushed like a bug?” • “What’s the probability that the robot will hit the left wall before it hits the human?” • “What’s the probability Robot crushes human on next time step?” Copyright © Andrew W. Moore Slide 11 Example Question “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … We’ll do this first Too Easy. We won’t do this Main Body of Lecture Hidden Markov Models are applicable! Copyright © Andrew W. Moore Slide 12 What is P(qt =s)? slow, stupid answer Step 1: Work out how to compute P(Q) for any path Q = q1 q2 q3 .. qt Given we know the start state q1 (i.e. P(q1)=1) P(q1 q2 .. qt) = P(q1 q2 .. qt-1) P(qt|q1 q2 .. qt-1) WHY? = P(q1 q2 .. qt-1) P(qt|qt-1) = P(q2|q1)P(q3|q2)…P(qt|qt-1) Step 2: Use this knowledge to get P(qt =s) P (qt s ) Copyright © Andrew W. Moore P(Q) QPaths of length t that end in s Slide 13 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition i p0 (i) j pt 1 ( j ) P(qt 1 s j ) Copyright © Andrew W. Moore Slide 14 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition 1 if si is the start state i p0 (i) otherwise 0 j pt 1 ( j ) P(qt 1 s j ) Copyright © Andrew W. Moore Slide 15 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition 1 if si is the start state i p0 (i) otherwise 0 j pt 1 ( j ) P(qt 1 s j ) N P(q i 1 Copyright © Andrew W. Moore t 1 s j qt si ) Slide 16 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition 1 if si is the start state i p0 (i) otherwise 0 j pt 1 ( j ) P(qt 1 s j ) N P(q t 1 i 1 Remember, s j qt si ) N P(q i 1 Copyright © Andrew W. Moore t 1 aij P(qt 1 s j | qt si ) s j | qt si ) P(qt si ) N a i 1 ij pt (i ) Slide 17 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition 1 if si is the start state i p0 (i) otherwise 0 j pt 1 ( j ) P(qt 1 s j ) N P(q t 1 i 1 P(q Copyright © Andrew W. Moore t pt(1) pt(2) … pt(N) 0 0 0 1 1 : tfinal s j qt si ) N i 1 • Computation is simple. • Just fill in this table in this order: t 1 s j | qt si ) P(qt si ) N a i 1 ij pt (i ) Slide 18 What is P(qt =s) ? Clever answer • For each state si, define pt(i) = Prob. state is si at time t = P(qt = si) • Easy to do inductive definition 1 if si is the start state i p0 (i) otherwise 0 j pt 1 ( j ) P(qt 1 s j ) N P(q t 1 i 1 s j qt si ) N P(q i 1 Copyright © Andrew W. Moore • Cost of computing Pt(i) for all states Si is now O(t N2) • The stupid way was O(Nt) • This was a simple example • It was meant to warm you up to this trick, called Dynamic Programming, because HMMs do many tricks like this. t 1 s j | qt si ) P(qt si ) N a i 1 ij pt (i ) Slide 19 Hidden State “It’s currently time t, and human remains uncrushed. What’s the probability of crushing occurring at time t + 1 ?” If robot is blind: We can compute this in advance. If robot is omnipotent: (I.E. If robot knows state at time t), can compute directly. If robot has some sensors, but incomplete state information … We’ll do this first Too Easy. We won’t do this Main Body of Lecture Hidden Markov Models are applicable! Copyright © Andrew W. Moore Slide 20 Hidden State • The previous example tried to estimate P(qt = si) unconditionally (using no observed evidence). • Suppose we can observe something that’s affected by the true state. • Example: Proximity sensors. (tell us the contents of the 8 adjacent squares) R0 W W W H H True state qt Copyright © Andrew W. Moore W denotes “WALL” What the robot sees: Observation Ot Slide 21 Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 W W W W denotes “WALL” H H True state qt Uncorrupted Observation W W H W H What the robot sees: Observation Ot Copyright © Andrew W. Moore Slide 22 Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 2 W W W W denotes “WALL” H H True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, … q1, q0 ,Ot-1, Ot-2, … O1, O0 } given qt. In other words: Uncorrupted Observation W W H W H What the robot sees: Observation Ot P(Ot = X |qt = si ) = P(Ot = X |qt = si ,any earlier history) Copyright © Andrew W. Moore Slide 23 Noisy Hidden State • Example: Noisy Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) R0 2 W W W W denotes “WALL” H H True state qt Ot is noisily determined depending on the current state. Assume that Ot is conditionally independent of {qt-1, qt-2, … q1, q0 ,Ot-1, Ot-2, … O1, O0 } given qt. In other words: Uncorrupted Observation W W H W H What the robot sees: Observation Ot Question: what’d be the best Bayes Net structure to represent the Joint Distribution P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? P(Ot = X |qt = si ) = Copyright © Andrew W. Moore Slide 24 Noisy Hidden State Answer: q0 • Example: Noisy O0 Proximity sensors. (unreliably tell us the contents of the 8 adjacent squares) q1 R0 H O1 2 W W W W denotes “WALL” H True state qt q2 determined depending on Ot is noisily O2 the current state. Assume that Ot is conditionally q3 of {qt-1, qt-2, … q1, q0 ,Ot-1, independent O3qt. Ot-2, … O1, O0 } given In other words: Uncorrupted Observation W W H W H What the robot sees: Observation Ot P(Ot = q X4|qt = si ) = Question: what’d be the best Bayes Net structure to represent the Joint Distribution O4 P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? Copyright © Andrew W. Moore Slide 25 Noisy Hidden StateNotation: Answer: q0 bi (k )(unreliably P(Ot k | qtell • Example: Noisy O0 Proximity sensors. t sus i) the contents of the 8i adjacent squares) P(O =1|q =s ) P(O =2|q =s ) … P(O =k|q =s ) … P(O =M|q =s ) t q1 R0 H O1 2 t i t t i t i t b1(M) b (k) 2 … b (k) 3 … W b2 (M) … b3 (M) “WALL” : : : : bi (2) … bi(k) … bi (M) : : b1 (2) 2 b2 (1) b2 (2) 3 b3 (1) b3 (2) W H … b1 (k) W : : True state qt q2 determined depending i bi(1)on Ot is noisily O2 the current state. : : : : : : N bN (1) Assume that Ot is conditionally q3 of {qt-1, qt-2, … q1, q0 ,Ot-1, independent O3qt. Ot-2, … O1, O0 } given bN (2) … bN(k) In other words: t … 1 b1(1) … i t W denotes Uncorrupted Observation W H W W … bN (M) H What the robot sees: Observation Ot P(Ot = q X4|qt = si ) = Question: what’d be the best Bayes Net structure to represent the Joint Distribution O4 P(Ot = X |qt = si ,any earlier history) of (q0, q1, q2,q3,q4 ,O0, O1, O2,O3,O4 )? Copyright © Andrew W. Moore Slide 26 Hidden Markov Models Our robot with noisy sensors is a good example of an HMM • Question 1: State Estimation What is P(qT=Si | O1O2…OT) It will turn out that a new cute D.P. trick will get this for us. • Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? And what is that probability? Yet another famous D.P. trick, the VITERBI algorithm, gets this. • Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Very very useful. Uses the E.M. Algorithm Copyright © Andrew W. Moore Slide 27 Are H.M.M.s Useful? You bet !! • Robot planning + sensing when there’s uncertainty (e.g. Reid Simmons / Sebastian Thrun / Sven Koenig) • Speech Recognition/Understanding Phones Words, Signal phones • Human Genome Project Complicated stuff your lecturer knows nothing about. • Consumer decision modeling • Economics & Finance. Plus at least 5 other things I haven’t thought of. Copyright © Andrew W. Moore Slide 28 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 29 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 30 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Copyright © Andrew W. Moore Slide 31 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 32 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 33 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Woke up at 8.35, Got on Bus at 9.46, Question 2: Most Probable Path Sat in lecture 10.05-11.22… Given O1O2…OT , what is the most probable path that I took? Copyright © Andrew W. Moore Slide 34 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…Ot) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Copyright © Andrew W. Moore Slide 35 Some Famous HMM Tasks Question 1: State Estimation What is P(qT=Si | O1O2…OT) Question 2: Most Probable Path Given O1O2…OT , what is the most probable path that I took? Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? Copyright © Andrew W. Moore Slide 36 Ot Some Famous HMM Tasks aBB Question 1: State Estimation What is P(qT=Si | O1O2…OT) Ot-1 Question 2: Most Probable Path b (O ) Given O1O2…OT , what is the most probable path that I a took? Question 3: Learning HMMs: Given O1O2…OT , what is the maximum likelihood HMM that could have produced this string of observations? A t-1 AA Copyright © Andrew W. Moore bB(Ot) Bus aAB aCB aBA Eat Ot+1 aBC bC(Ot+1) walk aCC Slide 37 Basic Operations in HMMs For an observation sequence O = O1…OT, the three basic HMM operations are: Problem Evaluation: Calculating P(qt=Si | O1O2…Ot) Inference: Computing Q* = argmaxQ P(Q|O) Learning: Computing * = argmax P(O|) Algorithm Complexity Forward-Backward O(TN2) Viterbi Decoding O(TN2) Baum-Welch (EM) O(TN2) + T = # timesteps, N = # states Copyright © Andrew W. Moore Slide 38 HMM Notation (from Rabiner’s Survey) The states are labeled S1 S2 .. SN *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. Available from For a particular trial…. Let T be the number of observations T is also the number of states passed through O = O1 O2 .. OT is the sequence of observations Q = q1 q2 .. qT is the notation for a path of states http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626 λ = N,M,i,,aij,bi(j) Copyright © Andrew W. Moore is the specification of an HMM Slide 39 HMM Formal Definition An HMM, λ, is a 5-tuple consisting of • N the number of states This is new. In our • M the number of possible observations previous example, • {1, 2, .. N} The starting state probabilities start state was P(q0 = Si) = i deterministic • a11 a22 … a1N a21 a22 … a2N The state transition probabilities : : : P(qt+1=Sj | qt=Si)=aij aN1 aN2 … aNN • b1(1) b2(1) : bN(1) Copyright © Andrew W. Moore b1(2) b2(2) : bN(2) … … … b1(M) b2(M) : bN(M) The observation probabilities P(Ot=k | qt=Si)=bi(k) Slide 40 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 Choose one of the output symbols in each state at random. 2/3 ZX 1/3 S3 N=3 M=3 1 = 1/2 2 = 1/2 3 = 0 a11 = 0 a12 = 1/3 a13 = 1/3 a12 = 1/3 a22 = 0 a32 = 1/3 a13 = 2/3 a13 = 2/3 a13 = 1/3 b1 (X) = 1/2 b2 (X) = 0 b3 (X) = 1/2 b1 (Y) = 1/2 b2 (Y) = 1/2 b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = 1/2 b3 (Z) = 1/2 Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 Slide 41 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 50-50 choice between S1 and S2 1/3 q0 = __ O0= __ q1 = __ O1 = __ q2 = __ O2 = __ Slide 42 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 50-50 choice between X and Y q0 = S1 O0= __ q1 = __ O1 = __ q2 = __ O2 = __ Slide 43 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 Goto S3 with probability 2/3 or S2 with prob. 1/3 q0 = S1 O0= X q1 = __ O1 = __ q2 = __ O2 = __ Slide 44 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 50-50 choice between Z and X q0 = S1 O0= X q1 = S3 O1 = __ q2 = __ O2 = __ Slide 45 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 Each of the three next states is equally likely 1/3 q0 = S1 O0= X q1 = S3 O1 = X q2 = __ O2 = __ Slide 46 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 50-50 choice between Z and X q0 = S1 O0= X q1 = S3 O1 = X q2 = S3 O2 = __ Slide 47 Here’s an HMM 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 q0 = S1 O0= X q1 = S3 O1 = X q2 = S3 O2 = Z Slide 48 State Estimation 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 ZX Choose one of the output symbols in each state at random. Let’s generate a sequence of observations: 2/3 1/3 S3 N=3 M=3 1 = ½ 2 = ½ 3 = 0 a11 = 0 a12 = ⅓ a13 = ⅓ a12 = ⅓ a22 = 0 a32 = ⅓ a13 = ⅔ a13 = ⅔ a13 = ⅓ b1 (X) = ½ b2 (X) = 0 b3 (X) = ½ b1 (Y) = ½ b2 (Y) = ½ b3 (Y) = 0 b1 (Z) = 0 b2 (Z) = ½ b3 (Z) = ½ Copyright © Andrew W. Moore Start randomly in state 1 or 2 1/3 This is what the observer has to work with… q0 = ? O0= X q1 = ? O1 = X q2 = ? O2 = Z Slide 49 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 S1 P(O Q) QPaths of length 3 P(O | Q) P(Q) ZY 1/3 XY Slow, stupid way: P(O) S2 2/3 1/3 2/3 ZX 1/3 S3 1/3 QPaths of length 3 How do we compute P(Q) for an arbitrary path Q? How do we compute P(O|Q) for an arbitrary path Q? Copyright © Andrew W. Moore Slide 50 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 S1 P(O Q) 2/3 1/3 2/3 ZX QPaths of length 3 1/3 S3 P(O | Q) P(Q) QPaths of length 3 ZY 1/3 XY Slow, stupid way: P(O) S2 1/3 P(Q)= P(q1,q2,q3) How do we compute P(Q) for an arbitrary path Q? =P(q1) P(q2,q3|q1) (chain rule) How do we compute P(O|Q) for an arbitrary path Q? =P(q1) P(q2|q1) P(q3| q2) (why?) =P(q1) P(q2|q1) P(q3| q2,q1) (chain) Example in the case Q = S1 S3 S3: =1/2 * 2/3 * 1/3 = 1/9 Copyright © Andrew W. Moore Slide 51 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 S1 2/3 P(O Q) 1/3 QPaths of length 3 P(O | Q) P(Q) QPaths of length 3 ZY 1/3 XY Slow, stupid way: P(O) S2 2/3 ZX 1/3 S3 1/3 P(O|Q) How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) for an arbitrary path Q? Example in the case Q = S1 S3 S3: = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8 Copyright © Andrew W. Moore Slide 52 Prob. of a series of observations What is P(O) = P(O1 O2 O3) = P(O1 = X ^ O2 = X ^ O3 = Z)? 1/3 S1 2/3 P(O Q) 1/3 QPaths of length 3 P(O | Q) P(Q) QPaths of length 3 ZY 1/3 XY Slow, stupid way: P(O) S2 2/3 ZX 1/3 S3 1/3 P(O|Q) How do we compute P(Q) for = P(O1 O2 O3 |q1 q2 q3 ) an arbitrary path Q? = P(O1 | q1 ) P(O2 | q2 ) P(O3 | q3 ) (why?) How do we compute P(O|Q) for an arbitrary path Q? Example in the case Q = S1 S3 S3: = P(X| S1) P(X| S3) P(Z| S3) = =1/2 * 1/2 * 1/2 = 1/8 So let’s be smarter… Copyright © Andrew W. Moore Slide 53 The Prob. of a given series of observations, non-exponential-cost-style Given observations O1 O2 … OT Define αt(i) = P(O1 O2 … Ot qt = Si | λ) where 1 ≤ t ≤ T αt(i) = Probability that, in a random trial, • We’d have seen the first t observations • We’d have ended up in Si as the t’th state visited. In our example, what is α2(3) ? Copyright © Andrew W. Moore Slide 54 αt(i): easy to define recursively αt(i) = P(O1 O2 … OT qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?) 1 i ) PO1 q1 Si ) Pq1 S i )PO1 q1 Si ) what? t 1 j ) PO1O2 ...Ot Ot 1 qt 1 S j ) Copyright © Andrew W. Moore Slide 55 αt(i): easy to define recursively αt(i) = P(O1 O2 … OT qt = Si | λ) (αt(i) can be defined stupidly by considering all paths length “t”. How?) 1 i ) PO1 q1 Si ) Pq1 Si )PO1 q1 Si ) what? t 1 j ) PO1O2 ...Ot Ot 1 qt 1 S j ) PO1O2 ...Ot qt Si Ot 1 qt 1 S j ) N i 1 POt 1 , qt 1 S j O1O2 ...Ot qt Si )PO1O2 ...Ot qt Si ) N i 1 POt 1 , qt 1 S j qt Si ) t i ) i ) Pqt 1 S j qt Si )P Ot 1 qt 1 S j t i ) i aijb j Ot 1 ) t i ) i Copyright © Andrew W. Moore Slide 56 in our example t i ) PO1O2 ..Ot qt Si ) 1 i ) bi O1 ) i t 1 j ) aijb j Ot 1 ) t i ) 1/3 S1 S2 ZY 1/3 XY 2/3 1/3 2/3 ZX 1/3 S3 i 1/3 WE SAW O1 O2 O3 = X X Z 1 1 1) 4 Copyright © Andrew W. Moore 1 2) 0 1 3) 0 2 1) 0 2 2) 0 3 1) 0 1 3 2) 72 1 2 3) 12 1 3 3) 72 Slide 57 Easy Question We can cheaply compute αt(i)=P(O1O2…Otqt=Si) (How) can we cheaply compute P(O1O2…Ot) ? (How) can we cheaply compute P(qt=Si|O1O2…Ot) Copyright © Andrew W. Moore Slide 58 Easy Question We can cheaply compute αt(i)=P(O1O2…Otqt=Si) (How) can we cheaply compute P(O1O2…Ot) ? N (i ) i 1 (How) can we cheaply compute P(qt=Si|O1O2…Ot) t t (i ) N ( j ) j 1 Copyright © Andrew W. Moore t Slide 59 Most probable path given observations What' s most probable path given O1O2 ...OT , i.e. argmax PQ O O ...O )? What is 1 2 T Q Slow, stupid answer : argmax PQ O O ...O ) 1 Q argmax Q 2 T PO1O2 ...OT Q )P(Q) PO1O2 ...OT ) argmax PO1O2 ...OT Q )PQ ) Q Copyright © Andrew W. Moore Slide 60 Efficient MPP computation We’re going to compute the following variables: δt(i)= max P(q1 q2 .. qt-1 qt = Si O1 .. Ot) q1q2..qt-1 = The Probability of the path of Length t-1 with the maximum chance of doing all these things: …OCCURING and …ENDING UP IN STATE Si and …PRODUCING OUTPUT O1…Ot DEFINE: mppt(i) = that path So: δt(i)= Prob(mppt(i)) Copyright © Andrew W. Moore Slide 61 The Viterbi Algorithm max t i ) q1q2 ...qt 1 Pq1q2 ...qt 1 qt Si O1O2 ..Ot ) arg max mppt i ) q q ...q Pq1q2 ...qt 1 qt Si O1O2 ..Ot ) 1 2 t 1 1 i ) one choice Pq1 Si O1 ) max Pq1 Si )PO1 q1 Si ) i bi O1 ) Now, suppose we have all the δt(i)’s and mppt(i)’s for all i. HOW TO GET mppt(1) δt+1(j) and Prob=δt(1) mppt(2) S1 S2 : mppt(N) mppt+1(j)? : Prob=δt(2) Prob=δt(N) SN qt Copyright © Andrew W. Moore ? Sj qt+1 Slide 62 The Viterbi Algorithm time t S1 : Si : time t+1 Sj The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si → Sj Copyright © Andrew W. Moore Slide 63 The Viterbi Algorithm time t time t+1 S1 : Si : Sj The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si → Sj What is the prob of that path? δt(i) x P(Si → Sj Ot+1 | λ) = δt(i) aij bj (Ot+1) SO The most probable path to Sj has Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) i Copyright © Andrew W. Moore Slide 64 The Viterbi Algorithm time t time t+1 S1 : Si : Sj The most prob path with last two states Si Sj is the most prob path to Si , followed by transition Si → Sj What is the prob of that path? δt(i) x P(Si → Sj Ot+1 | λ) Summary: = δt(i) aij bj (Ot+1) δt+1(j) = δt(i*) aij bj (Ot+1) with i* defined SO The most probable path to(j)Sj=has to the left mppt+1 mppt+1(i*)Si* Si* as its penultimate state where i*=argmax δt(i) aij bj (Ot+1) } i Copyright © Andrew W. Moore Slide 65 What’s Viterbi used for? Classic Example Speech recognition: Signal words HMM observable is signal Hidden state is part of word formation What is the most probable word given this signal? UTTERLY GROSS SIMPLIFICATION In practice: many levels of inference; not one big jump. Copyright © Andrew W. Moore Slide 66 HMMs are used and useful But how do you design an HMM? Occasionally, (e.g. in our robot example) it is reasonable to deduce the HMM from first principles. But usually, especially in Speech or Genetics, it is better to infer it from large amounts of data. O1 O2 .. OT with a big “T”. Observations previously in lecture O1 O2 .. OT Observations in the next bit O1 O2 .. O Copyright © Andrew W. Moore T Slide 67 Inferring an HMM Remember, we’ve been doing things like P(O1 O2 .. OT | λ ) That “λ” is the notation for our HMM parameters. Now We have some observations and we want to estimate λ from them. AS USUAL: We could use (i) MAX LIKELIHOOD λ = argmax P(O1 .. OT | λ) λ (ii) BAYES Work out P( λ | O1 .. OT ) and then take E[λ] or max P( λ | O1 .. OT ) λ Copyright © Andrew W. Moore Slide 68 Max likelihood HMM estimation Define γt(i) = P(qt = Si | O1O2…OT , λ ) εt(i,j) = P(qt = Si qt+1 = Sj | O1O2…OT ,λ ) γt(i) and εt(i,j) can be computed efficiently i,j,t (Details in Rabiner paper) T 1 t i ) t 1 Expected number of transitions out of state i during the path T 1 t i, j ) t 1 Copyright © Andrew W. Moore Expected number of transitions from state i to state j during the path Slide 69 t i ) Pqt Si O1O2 ..OT , ) t i, j ) Pqt Si qt 1 S j O1O2 ..OT , ) T 1 i ) expected number of transitio ns out of state i during t 1 t path T 1 i, j ) expected number of transitio ns out of i and into t 1 t j during path expected frequency t i, j ) i j Notice t T11 expected frequency ) i t i t 1 T 1 HMM estimation Estimate of Prob Next state S j This state Si ) We can re - estimate a ij i, j ) i ) t t We can also re - estimate b j Ok ) Copyright © Andrew W. Moore (See Rabiner) Slide 70 new a We want ij new estimate of P ( qt 1 s j | qt si ) Copyright © Andrew W. Moore Slide 71 new a We want ij new estimate of P ( qt 1 s j | qt si ) Expected # transitio ns i j | old , O1 , O2 ,OT N old Expected # transitio ns i k | , O1 , O2 , OT k 1 Copyright © Andrew W. Moore Slide 72 new a We want ij new estimate of P ( qt 1 s j | qt si ) Expected # transitio ns i j | old , O1 , O2 ,OT N old Expected # transitio ns i k | , O1 , O2 , OT k 1 T old P ( q s , q s | t 1 j t i , O1 , O2 ,OT ) t 1 N T old P ( q s , q s | t 1 k t i , O1 , O2 ,OT ) k 1 t 1 Copyright © Andrew W. Moore Slide 73 new a We want ij new estimate of P ( qt 1 s j | qt si ) Expected # transitio ns i j | old , O1 , O2 ,OT N old Expected # transitio ns i k | , O1 , O2 , OT k 1 T old P ( q s , q s | t 1 j t i , O1 , O2 ,OT ) t 1 N T old P ( q s , q s | t 1 k t i , O1 , O2 ,OT ) k 1 t 1 Sij N Sik k 1 Copyright © Andrew W. Moore where T Sij P(qt 1 s j , qt si , O1 , OT | ) old t 1 What? Slide 74 new a We want ij new estimate of P ( qt 1 s j | qt si ) Expected # transitio ns i j | old , O1 , O2 ,OT N old Expected # transitio ns i k | , O1 , O2 , OT k 1 T old P ( q s , q s | t 1 j t i , O1 , O2 ,OT ) t 1 N T old P ( q s , q s | t 1 k t i , O1 , O2 ,OT ) k 1 t 1 Sij N Sik k 1 where T Sij P(qt 1 s j , qt si , O1 , OT | ) old t 1 T aij t (i ) t 1 ( j )b j (Ot 1 ) t 1 Copyright © Andrew W. Moore Slide 75 We want new ij a Copyright © Andrew W. Moore Sij N S k 1 T ik where Sij aij t (i ) t 1 ( j )b j (Ot 1 ) t 1 Slide 76 We want new ij a T Sij N S k 1 N Copyright © Andrew W. Moore T ik where Sij aij t (i ) t 1 ( j )b j (Ot 1 ) t 1 T N Slide 77 EM for HMMs If we knew λ we could estimate EXPECTATIONS of quantities such as Expected number of times in state i Expected number of transitions i j If we knew the quantities such as Expected number of times in state i Expected number of transitions i j We could compute the MAX LIKELIHOOD estimate of λ = aij,bi(j), i Roll on the EM Algorithm… Copyright © Andrew W. Moore Slide 78 EM 4 HMMs 1. Get your observations O1 …OT 2. Guess your first λ estimate λ(0), k=0 3. k = k+1 4. Given O1 …OT, λ(k) compute γt(i) , εt(i,j) 1 ≤ t ≤ T, 1 ≤ i ≤ N, 5. Compute expected freq. of state i, and expected freq. i→j 6. Compute new estimates of aij, bj(k), i accordingly. Call them λ(k+1) 7. Goto 3, unless converged. • Also known (for the HMM case) as the BAUM-WELCH algorithm. Copyright © Andrew W. Moore 1 ≤ j ≤ N Slide 79 Bad News • There are lots of local minima Good News • The local minima are usually adequate models of the data. Notice • EM does not estimate the number of states. That must be given. • Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate λ(0) • Easy extension of everything seen today: HMMs with real valued outputs Copyright © Andrew W. Moore Slide 80 Bad News Trade-off between too few states (inadequately modeling the structure in the data) and too many (fitting the noise). • There are lots of local minima Thus #states is a regularization parameter. Good News Blah blah blah… bias variance tradeoff…blah blah…cross-validation…blah blah….AIC, • The local minimaBIC….blah are usually models blahadequate (same ol’ same ol’) of the data. Notice • EM does not estimate the number of states. That must be given. • Often, HMMs are forced to have some links with zero probability. This is done by setting aij=0 in initial estimate λ(0) • Easy extension of everything seen today: HMMs with real valued outputs Copyright © Andrew W. Moore Slide 81 What You Should Know • • • • • What is an HMM ? Computing (and defining) αt(i) DON’T PANIC: The Viterbi algorithm starts on p. 257. Outline of the EM algorithm To be very happy with the kind of maths and analysis needed for HMMs • Fairly thorough reading of Rabiner* up to page 266* [Up to but not including “IV. Types of HMMs”]. *L. R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. of the IEEE, Vol.77, No.2, pp.257--286, 1989. http://ieeexplore.ieee.org/iel5/5/698/00018626.pdf?arnumber=18626 Copyright © Andrew W. Moore Slide 82