Apaydin slides with a several modifications and additions by Christoph Eick. Introduction Modeling dependencies in input; no longer iid; e.g the order of observations in a dataset matters: Temporal Sequences: In speech; phonemes in a word (dictionary), words in a sentence (syntax, semantics of the language). Stock market (stock values over time) Spatial Sequences Base pairs in DNA Sequences Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2 Discrete Markov Process N states: S1, S2, ..., SN First-order Markov State at “time” t, qt = Si P(qt+1=Sj | qt=Si, qt-1=Sk ,...) = P(qt+1=Sj | qt=Si) Transition probabilities aij ≡ P(qt+1=Sj | qt=Si) aij ≥ 0 and Σj=1N aij=1 Initial probabilities πi ≡ P(q1=Si) Σj=1N πi=1 3 Stochastic Automaton/Markov Chain T P O Q | A , P q 1 P q t | q t 1 q 1 a q 1q 2 a q T 1q T t 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4 Example: Balls and Urns Three urns each full of balls of one color S1: blue, S2: red, S3: green 0 . 5 , 0 . 2 , 0 . 3 T 0 .4 A 0 .2 0 . 1 0 .3 0 .6 0 .1 0 .3 0 .2 0 . 8 O S1 , S1 , S 3 , S 3 P O | A , P S 1 P S 1 | S 1 P S 3 | S 1 P S 3 | S 3 1 a 11 a 13 a 33 0 . 5 0 . 4 0 . 3 0 . 8 0 . 048 5 Balls and Urns: Learning Given K example sequences of length T ˆ i aˆ ij # sequences starting # sequences with S i # transition s from S i to S j 1q 1 S i k k K # transition s from S i k T- 1 t 1 1q t S i and q t 1 S j k k k T- 1 t 1 1q t S i k Remark: Extract the probabilities from the observed sequences: s1-s2-s1-s3 s2-s1-s1-s2 1=1/3, 2=2/3, a11=1/3, a12=1/3, a13=1/3, a21=3/4,… s2-s3-s2-s1 6 http://en.wikipedia.org/wiki/Hidden_Markov_model Hidden Markov Models States are not observable Discrete observations {v1,v2,...,vM} are recorded; a probabilistic function of the state Emission probabilities bj(m) ≡ P(Ot=vm | qt=Sj) Example: In each urn, there are balls of different colors, but with different probabilities. For each observation sequence, there are multiple state sequences Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7 http://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter10.htm l HMM Unfolded in Time Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8 Now a more complicated problem 1 2 3 Markov Chains We observe: Hidden Markov Models What urn sequence create it? 1. 1-1-2-2 (somewhat trivial, as states are observable!) 2. (1 or 2)-(1 or 2)-(2 or 3)-(2 or 3) and the potential sequences have different probabilities—e.g drawing a blue ball from urn1 is more likely than from urn2! 9 Another Motivating Example 10 Elements of an HMM N: Number of states M: Number of observation symbols A = [aij]: N by N state transition probability matrix B = bj(m): N by M observation probability matrix Π = [πi]: N by 1 initial state probability vector λ = (A, B, Π), parameter set of HMM Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11 Three Basic Problems of HMMs 1. Evaluation: Given λ, and sequence O, calculate P (O | λ) 2. Most Likely State Sequence: Given λ and sequence O, find state sequence Q* such that P (Q* | O, λ ) = maxQ P (Q | O , λ ) 3. Learning: Given a set of sequence O={O1,…Ok}, find λ* such that λ* is the most like explanation for the sequences in O. P ( O | λ* )=maxλ k P ( Ok | λ ) (Rabiner, 1989) 12 Evaluation Probability of observing O1-…-Ot and additionally being in state i Forward variable: t i P O 1 O t , q t S i | Initializa tion : 1 i i b i O 1 Recursion : N t 1 j t i a ij b j O t 1 i 1 Using i the probability of the observed sequence can be computed as follows: P O | N i T i 1 Complexity: O(N2*T) 13 Probability of observing Ot+1-…-OT and additionally being in state i Backward variable: t i P O t 1 O T | q t S i , Initializa tion : T i 1 Recursion : t i N a b O j ij j t 1 t 1 j 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14 Finding the Most Likely State Sequence t i P q t S i O , t i t i N j 1 t(i):=Probability of being in state i at step t. t j t j Choose the state that has the highest probability, Observe: O1…OtOt+1…OT for each time step: qt*= arg maxi γt(i) t i t i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15 Only briefly discussed in 2014! Viterbi’s Algorithm δt(i) ≡ maxq1q2∙∙∙ qt-1 p(q1q2∙∙∙qt-1,qt =Si,O1∙∙∙Ot | λ) Initialization: δ1(i) = πibi(O1), ψ1(i) = 0 Recursion: δt(j) = maxi δt-1(i)aijbj(Ot), ψt(j) = argmaxi δt-1(i)aij Termination: p* = maxi δT(i), qT*= argmaxi δT (i) Path backtracking: qt* = ψt+1(qt+1* ), t=T-1, T-2, ..., 1 Idea: Combines path probability computations with backtracking over competing paths. 16 Baum-Welch Algorithm BaumWelch Algorithm O={O1,…,OK} Model =(A,B,) Hidden State Sequence O Observed Symbol Sequence Learning a Model from Sequences O An EM-style algorithm is used! E Step : This is a hidden(latent) variable, measuring the probability of going from state i to state j at step t+1 observing Ot+1, given a model and an observed sequence O Ok. t i , j P q t S i , q t 1 S j | O , t i , j t i a ij b j O t 1 t 1 j k t i K j 1 l t k a kl b l O t 1 t 1 l t i , j This is a hidden(latent) variable, measuring the probability of being in state i step t observing given a model and an observed sequence O Ok. 18 Baum-Welch Algorithm: M-Step M step : K ˆ i bˆ j 1 i k k 1 aˆ ij K K T k 1 k 1 K t 1 T k 1 k 1 K T k 1 k 1 t 1 K m k 1 t k j 1O tk T k 1 t 1 vm t 1 t i , j k t i k Probability going from i to j Probability being in i t i k Remark: k iterates over the observed sequences O1,…,OK; for each individual sequence OrO r and r are computed in the E-step; then, the actual model is computed in the M-step by averaging over the estimates of i,aij,bj (based on k and k) for each of the K observed sequences. 19 Baum-Welch Algorithm: Summary Estimate initial model (A, B, ) REPEAT E - Step : E stimate t i and t i , j based on model (A, B, ) and O M - step : Reestimate (A, B, ) based on t i , j UNTIL CONVERGENC E For more discussion see: http://www.robots.ox.ac.uk/~vgg/rg/slides/hmm.pdf O={O1,…,OK} BaumWelch Algorithm Model =(A,B,) See also: http://www.digplanet.com/wiki/Baum%E2%80%93Welch_algorithm Generalization of HMM: Continuous Observations The observations generated at each time step are vectors consisting of k numbers; a multivariate Gaussian with k dimensions is associated with each state j, defining the probabilities of k-dimensional vector v generated when being in state j: P O t | q t S j , ~ N j , j O Hidden State Sequence =(A, (j,j) j=1,…n,B) Observed Vector Sequence 21 Generalization: HMM with Inputs Input-dependent observations: P O t | q t S j , x , ~ N g j x | j , t t 2 j Input-dependent transitions (Meila and Jordan, 1996; Bengio and Frasconi, 1996): P q t 1 S j | q t S i , x Time-delay input: t x f O t ,..., O t 1 t Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22