Markov Chains 1 Hidden Markov Models 2 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution: Using a combined model 3 Hidden Markov Models The essential difference between a Markov chain and a hidden Markov model is that for a hidden Markov model there is not a one-to-one correspondence between the states and the symbols (Why Hidden?). It is no longer possible to tell what state the model was in when xi was generated just by looking at xi. In the previous example, there is no way to tell by looking at a single symbol C in isolation whether it was emitted by state C+ or C-. Many states to one letter. Many letters to one state. We now have to distinguish the sequence of states from the sequences of symbols 4 Hidden Markov Models States Observable symbols akl=P(pi=l|pi-1=k) Emission probabilities A, C, G, T X=x1,x2,…,xn Transition probabilities A path of state: p=p1, p2,…,pn ek(b)=P(xi=b|pi=k) Decouple states and observable symbols 5 Hidden Markov Models We can think of a HMM as a generative model, that generates or emits sequences. First state p1 is selected (either randomly or according to some prior probabilities), then symbol x1 is emitted at state p1 with possibility e1(x1). Then it transits to state p2 with possibility a12 , etc. 6 Hidden Markov Models 1. 2. X: G C A T A G C G G C T A G C T G A A T A G G A … P: G+ C+ A+ T+ A+ G+ C+ G+ G+ C+ T+ A+ G+ C+ T- G- A- A- T- A- G- G- A- … Now it is the path of hidden states that we want to find out Many paths can be used to generate X, we want to find out the most likely one. There are several ways to do this Brute Force method Dynamic programming We will talk about them later 7 The occasionally dishonest casino A casino uses a fair die most of the time, but occasionally switches to a loaded one Fair die: Prob(1) = Prob(2) = . . . = Prob(6) = 1/6 Loaded die: Prob(1) = Prob(2) = . . . = Prob(5) = 1/10, Prob(6) = ½ These are the emission probabilities at the two states, loaded and fair. Transition probabilities Prob(Fair Loaded) = 0.01 Prob(Loaded Fair) = 0.2 Transitions between states obey a Markov process 8 A HMM for the occasionally dishonest casino 9 The occasionally dishonest casino The casino won’t tell you when they use the fair or loaded die. Known: Hidden: What the casino did FFFFFLLLLLLLFFFF... Observable: The series of die tosses The structure of the model The transition probabilities 3415256664666153... What we must infer: When the fair die was used? When the loaded die was used? The answer is a sequence of states FFFFFFFLLLLLLFFF... 10 Making the inference Model assigns a probability to each explanation of the observation: P(326|FFL) = P(3|F)·P(FF)·P(2|F)·P(FL)·P(6|L) = 1/6 · 0.99 · 1/6 · 0.01 · ½ 11 Notation x is the sequence of symbols emitted by model A path, p, is a sequence of states xi is the symbol emitted at time i The i-th state in p is pi akr is the probability of making a transition from state akr = Pr(p i = r | p i 1 = k ) k to state r: ek(b) is the probability that symbol b is emitted when in state k ek (b ) = Pr(xi = b | p i = k ) 12 A path of a sequence 0 1 1 1 … 1 2 2 2 … 2 … … … K K K x1 x2 x3 0 … … L Pr(x , p ) = a0p 1 ep i (xi ) ap i p i 1 i =1 K xL 13 The occasionally dishonest casino x = x1, x2, x3 = 6,2,6 p (1) = FFF p ( 2) = LLL Pr(x , p (1) ) = a0F eF (6)aFF eF (2)aFF eF (6) 1 1 1 0.99 0.99 6 6 6 0.00227 = 0 .5 Pr(x , p (2) ) = a0LeL (6)aLLeL (2)aLLeL (6) = 0.5 0.5 0.8 0.1 0.8 0.5 = 0.008 p ( 3) = LFL Pr(x , p (3) ) = a0 LeL (6)aLF eF (2)aFL eL (6)aL 0 1 = 0.5 0.5 0.2 0.01 0.5 6 0.0000417 14 The most probable path The most likely path p* satisfies p * = arg max Pr(x , p ) p To find p*, consider all possible ways the last symbol of x could have been emitted Let v k (i ) = Prob. of path p 1 , , p i most likely Then to emit x1 , , xi such that p i = k v k (i ) = ek (xi ) max v r (i 1)ark r 15 The Viterbi Algorithm 1. 2. 3. Viterbi Algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called Veterbi Path – that results in a sequence of observed symbols Assumptions: Both the observed symbols and hidden states must be in a sequence These two sequences need to be aligned, and an observed symbol needs to correspond to exactly one hidden state Computing the most likely sequence of hidden states (path) up to a certain point t must depend only on the observed symbol at point t , and the most likely sequence of hidden states (path) up to point t − 1 These assumptions are all satisfied in a first-order hidden Markov model. 16 The Viterbi Algorithm Initialization (i = 0) v 0 (0) = 1, vk (0) = 0 for k 0 Recursion (i = 1, . . . , L): For each state k v k (i ) = ek (xi ) max v r (i 1)ark r ptri (l ) = arg max k (vk (i 1)akl ) Termination: P( x, p * ) = max vk ( L)ak 0 k p (l ) = arg max k (v k ( L)a k 0 ) * L 17 To find p*, use trace-back(i=L…1), as in dynamic programming Viterbi: Example B p F L x 6 2 6 1 0 0 0 0 (1/2)(1/6) = 1/12 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 0 (1/2) (1/2) = 1/4 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)max{0.013750.01, 0.020.8} = 0.08 v k (i ) = ek (xi ) max v r (i 1)ark r 18 Viterbi gets it right more often than not 19 Hidden Markov Models 20 Total probability Many different paths can result in observation x. The probability that our model will emit x is Pr(x ) = Pr(x , p ) Total Probability p If HMM models a family of objects, we want total probability to peak at members of the family. (Training) 21 Total probability Pr(x) can be computed in the same way as probability of most likely path. Let fk (i ) = Prob. of observing x1 ,, xi assuming that πi = k Then and fk (i ) = ek (xi )fr (i 1)ark r Pr(x ) = fk (L)ak 0 k 22 The Forward Algorithm Initialization (i = 0) f0 (0) = 1, fk (0) = 0 for k 0 Recursion (i = 1, . . . , L): For each state k fk (i ) = ek (xi )fr (i 1)ark r Termination: Pr( x) = f k ( L)ak 0 k 23 Hidden Markov Models Decoding Viterbi: Maximum Likelihood: Determine which explanation is most likely Find the path most likely to have produced the observed sequence Forward: Total probability: Determine probability that observed sequence was produced by the HMM Consider all paths that could have produced the observed sequence Forward and Backward : the probability that xi came from state k given the observed sequence, i.e. P(pi=k|x) 24 The Backward Algorithm Pr(x) can be computed in the same way as probability of most likely path. Let bk (i) = Prob. of observing xi ,, xL assuming that πi = k Then i=L-1, ...,1 bk (i) = akl el ( xi 1)bl (i 1) l and P( x) = a0l el ( x1 )bl (1) l 25 The Backward Algorithm Initialization (i= L) bk(L)=ako for all k Recursion (i = L-1, . . . , 1): For each state bk (i) = akl el ( xi 1)bl (i 1) Termination: l bk (i) = akl el ( xi 1)bl (i 1) l 26 Posterior state probabilities The probability that xi came from state k given the observed sequence, i.e. P(pi=k|x) P(x,pi=k)=P(x1…xi,pi=k) P(xi+1…xL|x1…xi, pi=k) =P(x1…xi,pi=k) P(xi+1…xL| pi=k) =fk(i) bk(i) P(pi=k|x)=fk(i)bk(i)/P(x) Posterior decoding: Assign xi the state k that maximize P(pi=k|x)=fk(i)bk(i)/P(x) 27 Estimating the probabilities 28 Estimating the probabilities (“training”) Baum-Welch algorithm Start with initial guess at transition probabilities Refine guess to improve the total probability of the training data in each step May get stuck at local optimum Special case of expectation-maximization (EM) algorithm Viterbi training Derive probable paths for training data using Viterbi algorithm Re-estimate transition probabilities based on Viterbi path Iterate until paths stop changing 29 30