Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter 3.3-3.7 Elze de Groot 1 Overview • Parameter estimation for HMMs – Baum-Welch algorithm • HMM model structure • More complex Markov chains • Numerical stability of HMM algorithms Elze de Groot 2 Specifying a HMM model • Most difficult problem using HMMs is specifying the model – Design of the structure – Assignment of parameter values Elze de Groot 3 Specifying a HMM model • Most difficult problem using HMMs is specifying the model – Design of the structure – Assignment of parameter values Elze de Groot 4 Parameter estimation for HMMs • Estimate transition and emission probabilities akl and ek(b) • Two ways of learning: – Estimation when state sequence is known – Estimation when paths are unknown • Assume that we have a set of example sequences (training sequences x1, …xn) Elze de Groot 5 Parameter estimation for HMMs • Assume that x1…xn independent. • joint probability n P( x ,..., x | ) P( x j | ) 1 n j 1 • Log space Since log ab = log a + logb Elze de Groot 6 Estimation when state sequence is known • Easier than estimation when paths unknown Akl akl Akl ' l' Ek (b) ek (b) Ek (b' ) b' • Akl = number of transitions k to l in trainingdata + rkl • Ek(b) = number of emissions of b from k in training data + rk(b) Elze de Groot 7 Estimation when paths are unknown • More complex than when paths are known • We can’t use maximum likelihood estimators • Instead, an iterative algorithm is used – Baum-Welch Elze de Groot 8 The Baum-Welch algorithm • We don’t know real values of Akl and Ek(b) 1. Estimate Akl and Ek(b) 2. Update akl and ek(b) 3. Repeat with new model parameters akl and ek(b) Elze de Groot 9 Baum-Welch algorithm Forward value Elze de Groot Backward value 10 Baum-Welch algorithm • Now that we have estimated Akl and Ek(b), use maximum likelihood estimators to compute akl and ek(b) • We use these values to estimate Akl and Ek(b) in the next iteration • Continue doing this iteration until change is very small or max number of iterations is exceeded Elze de Groot 11 Baum-Welch algorithm Elze de Groot 12 Example • Estimated model with 300 rolls and 30.000 rolls Elze de Groot 13 Drawbacks • ML estimators – Vulnerable to overfitting if not enough data – Estimations can be undefined if never used in training set (so use of pseudocounts) • Baum-Welch – Many local maximums instead of global maximum can be found, depending on starting values of parameters – This problem will be worse for large HMMs Elze de Groot 14 Viterbi Training • Most probable path derived using viterbi algorithm • Continue until none of paths change • Finds value of θ that maximises contribution to likelihood • Performs less well than baum welch Elze de Groot 15 Modelling of labelled sequences • Only -- and ++ are calculated • Better than using ML estimators, when many different classes are present Elze de Groot 16 Specifying a HMM model • Most difficult problem using HMMs is specifying the model – Design of the structure – Assignment of parameter values Elze de Groot 17 Design of the structure • Design: how to connect states by transitions • A good HMM is based on the knowledge about the problem under investigation • Local maxima are biggest disadvantage in models that are fully connected • After deleting a transition from model BaumWelch will still work: set transition probability to zero Elze de Groot 18 Example 1 • Geometric distribution p 1-p l 1 P(l residues) (1 p) p Elze de Groot 19 Example 2 • Model distribution of length between 2 and 10 Elze de Groot 20 Example 3 • Negative binomial distribution l 1 l n n P(l ) p (1 p) n 1 • p=0.99 • n≤5 Elze de Groot 21 Silent states • States that do not emit symbols B • Also in other places in HMM Elze de Groot 22 Example Silent states Elze de Groot 23 Silent states • Advantage: – Less estimations of transition probabilities needed • Drawback: – Limits the possibilities of defining a model Elze de Groot 24 Silent states • • • • Change in forward algorithm For ‘real’ states the same For silent states set fl (i 1) to k fk (i 1)akl Starting from lowest numbered silent state l add k fk (i 1)akl to fl (i 1) for all silent states k<l Elze de Groot 25 More complex Markov chains • So far, we assumed that probability of a symbol in a sequence depends only on the probability of the previous symbol • More complex – High order Markov chains – Inhomogeneous Markov chains Elze de Groot 26 High order Markov chains • An nth order Markov process P( xi | xi 1, xi 2,..., x1) P( xi | xi 1, xi 2,..., xi n) • Probability of a symbol in a sequence depends on the probability of the previous n symbols • An nth order Markov chain over some alphabet A is equivalent to a first order Markov chain over the alphabet An of n-tuples, because: P(AB|B) = P(A|B) Elze de Groot 27 Example • A second order Markov chain with two different symbols {A,B} • This can be translated into a first order Markov chain of 2-tuples {AA, AB, BA, BB} Sometimes the framework of high order model is convenient Elze de Groot 28 Finding prokaryotic genes • Gene candidates in DNA: -sequence of triplets of nucleotides: startcodon nr. of non-stopcodons stopcodon -open reading frame (ORF) • An ORF can be either a gene or a noncoding ORF (NORF) Elze de Groot 29 Finding prokaryotic genes • Experiment: – DNA from bacterium E.coli – Dataset contains 1100 genes (900 used for training, 200 for testing) • Two models: – Normal model with first order Markov chains – Also first order Markov chains, but codons instead of nucleotides are used as symbol Elze de Groot 30 Finding prokaryotic genes • Outcomes: Elze de Groot 31 Inhomogeneous Markov chains • Using the position information in the codon – Three models for position 1, 2 and 3 1 2 3 1 2 x1 x 2 x2 x3 x3x4 x4 x5 x5 x6 a a a a a ... 123123 CAT GCA Homogeneous P(C)aCA aAT aTG aGC aCA Inhomogeneous P(C)a2CA a3AT a1TG a2GC a3CA Elze de Groot 32 Numerical Stability of HMM algorithms • Multiplying many probabilities can cause numerical problems: – Underflow errors – Wrong numbers are calculated • Solutions: – Log transformation – Scaling of probabilities Elze de Groot 33 The log transformation • Compute log probabilities – Log 10-100000 = -100000 – Underflow problem is essentially solved • Sum operation is often faster than product operation • In the Viterbi algorithm: Vl (i 1) el ( xi 1) max (Vk (i) akl ) k Elze de Groot 34 Scaling of probabilities • Scale f and b variables • Forward variable: – For each i a scaling variable si is defined – New f variables are defined: fl (i ) fl (i ) i j 1 sj – New forward recursion: fl (i 1) 1 el ( x ) fk (i)akl i 1 si1 Elze de Groot k 35 Scaling of probabilities • Backward variable – Scaling has to be with same numbers as forward variable – New backward recursion: 1 bk (i ) aklbl (i 1)el ( xi 1) si l • This normally works well, however underflow errors can still occur in models with many silent states (chapter 5) Elze de Groot 36 Summary • Hidden Markov Models • Parameter estimation – State sequence known – State sequence unknown • Model structure – Silent states • More complex Markov chains • Numerical stability Elze de Groot 37