Probability Theory and Basic Alignment of String Sequences

advertisement
Parameter estimation for HMMs,
Baum-Welch algorithm, Model
topology, Numerical stability
Chapter 3.3-3.7
Elze de Groot
1
Overview
• Parameter estimation for HMMs
– Baum-Welch algorithm
• HMM model structure
• More complex Markov chains
• Numerical stability of HMM algorithms
Elze de Groot
2
Specifying a HMM model
• Most difficult problem using HMMs is
specifying the model
– Design of the structure
– Assignment of parameter values
Elze de Groot
3
Specifying a HMM model
• Most difficult problem using HMMs is
specifying the model
– Design of the structure
– Assignment of parameter values
Elze de Groot
4
Parameter estimation for HMMs
• Estimate transition and emission
probabilities akl and ek(b)
• Two ways of learning:
– Estimation when state sequence is known
– Estimation when paths are unknown
• Assume that we have a set of example
sequences (training sequences x1, …xn)
Elze de Groot
5
Parameter estimation for HMMs
• Assume that x1…xn independent.
• joint probability
n
P( x ,..., x |  )   P( x j |  )
1
n
j 1
• Log space
Since log ab = log a + logb
Elze de Groot
6
Estimation when state sequence is
known
• Easier than estimation when paths unknown
Akl
akl 
 Akl '
l'
Ek (b)
ek (b) 
 Ek (b' )
b'
• Akl = number of transitions k to l in trainingdata
+ rkl
• Ek(b) = number of emissions of b from k in
training data + rk(b)
Elze de Groot
7
Estimation when paths are unknown
• More complex than when paths are known
• We can’t use maximum likelihood
estimators
• Instead, an iterative algorithm is used
– Baum-Welch
Elze de Groot
8
The Baum-Welch algorithm
•
We don’t know real values of Akl and
Ek(b)
1. Estimate Akl and Ek(b)
2. Update akl and ek(b)
3. Repeat with new model parameters akl and
ek(b)
Elze de Groot
9
Baum-Welch algorithm
Forward value
Elze de Groot
Backward value
10
Baum-Welch algorithm
• Now that we have estimated Akl and Ek(b),
use maximum likelihood estimators to
compute akl and ek(b)
• We use these values to estimate Akl and
Ek(b) in the next iteration
• Continue doing this iteration until change
is very small or max number of iterations is
exceeded
Elze de Groot
11
Baum-Welch algorithm
Elze de Groot
12
Example
• Estimated model with 300 rolls and 30.000
rolls
Elze de Groot
13
Drawbacks
• ML estimators
– Vulnerable to overfitting if not enough data
– Estimations can be undefined if never used in
training set (so use of pseudocounts)
• Baum-Welch
– Many local maximums instead of global
maximum can be found, depending on
starting values of parameters
– This problem will be worse for large HMMs
Elze de Groot
14
Viterbi Training
• Most probable path derived using viterbi
algorithm
• Continue until none of paths change
• Finds value of θ that maximises
contribution to likelihood
• Performs less well than baum welch
Elze de Groot
15
Modelling of labelled sequences
• Only -- and ++ are calculated
• Better than using ML estimators, when
many different classes are present
Elze de Groot
16
Specifying a HMM model
• Most difficult problem using HMMs is
specifying the model
– Design of the structure
– Assignment of parameter values
Elze de Groot
17
Design of the structure
• Design: how to connect states by transitions
• A good HMM is based on the knowledge about
the problem under investigation
• Local maxima are biggest disadvantage in
models that are fully connected
• After deleting a transition from model BaumWelch will still work: set transition probability to
zero
Elze de Groot
18
Example 1
• Geometric distribution
p
1-p
l 1
P(l residues)  (1  p) p
Elze de Groot
19
Example 2
• Model distribution of length between 2 and
10
Elze de Groot
20
Example 3
• Negative binomial distribution
 l  1  l n
n
P(l )  
 p (1 p)
 n  1
• p=0.99
• n≤5
Elze de Groot
21
Silent states
• States that do not emit symbols

B
• Also in other places in HMM
Elze de Groot
22
Example
Silent states
Elze de Groot
23
Silent states
• Advantage:
– Less estimations of transition probabilities
needed
• Drawback:
– Limits the possibilities of defining a model
Elze de Groot
24
Silent states
•
•
•
•
Change in forward algorithm
For ‘real’ states the same
For silent states set fl (i  1) to k fk (i  1)akl
Starting from lowest numbered silent state
l add k fk (i  1)akl to fl (i  1)
for all silent states k<l
Elze de Groot
25
More complex Markov chains
• So far, we assumed that probability of a
symbol in a sequence depends only on the
probability of the previous symbol
• More complex
– High order Markov chains
– Inhomogeneous Markov chains
Elze de Groot
26
High order Markov chains
• An nth order Markov process
P( xi | xi  1, xi  2,..., x1)  P( xi | xi  1, xi  2,..., xi  n)
• Probability of a symbol in a sequence depends on the
probability of the previous n symbols
• An nth order Markov chain over some alphabet A is
equivalent to a first order Markov chain over the
alphabet An of n-tuples, because:
P(AB|B) = P(A|B)
Elze de Groot
27
Example
• A second order Markov chain with two
different symbols {A,B}
• This can be translated into a first order
Markov chain of 2-tuples {AA, AB, BA, BB}
Sometimes the
framework of high
order model is
convenient
Elze de Groot
28
Finding prokaryotic genes
• Gene candidates in DNA:
-sequence of triplets of nucleotides:
startcodon nr. of non-stopcodons stopcodon
-open reading frame (ORF)
• An ORF can be either a gene or a noncoding ORF (NORF)
Elze de Groot
29
Finding prokaryotic genes
• Experiment:
– DNA from bacterium E.coli
– Dataset contains 1100 genes (900 used for
training, 200 for testing)
• Two models:
– Normal model with first order Markov chains
– Also first order Markov chains, but codons
instead of nucleotides are used as symbol
Elze de Groot
30
Finding prokaryotic genes
• Outcomes:
Elze de Groot
31
Inhomogeneous Markov chains
• Using the position information in the codon
– Three models for position 1, 2 and 3
1
2
3
1
2
x1 x 2
x2 x3
x3x4
x4 x5
x5 x6
a a a a a ...
123123
CAT GCA
Homogeneous
P(C)aCA aAT aTG aGC aCA
Inhomogeneous
P(C)a2CA a3AT a1TG a2GC a3CA
Elze de Groot
32
Numerical Stability of HMM
algorithms
• Multiplying many probabilities can cause
numerical problems:
– Underflow errors
– Wrong numbers are calculated
• Solutions:
– Log transformation
– Scaling of probabilities
Elze de Groot
33
The log transformation
• Compute log probabilities
– Log 10-100000 = -100000
– Underflow problem is essentially solved
• Sum operation is often faster than product
operation
• In the Viterbi algorithm:
Vl (i  1)  el ( xi 1)  max (Vk (i)  akl )
k
Elze de Groot
34
Scaling of probabilities
• Scale f and b variables
• Forward variable:
– For each i a scaling variable si is defined
– New f variables are defined:
fl (i )
fl (i )  i
 j 1 sj
– New forward recursion:
fl (i  1) 
1
el ( x ) fk (i)akl
i 1
si1
Elze de Groot
k
35
Scaling of probabilities
• Backward variable
– Scaling has to be with same numbers as forward variable
– New backward recursion:
1
bk (i )   aklbl (i  1)el ( xi 1)
si l
• This normally works well, however underflow errors
can still occur in models with many silent states
(chapter 5)
Elze de Groot
36
Summary
• Hidden Markov Models
• Parameter estimation
– State sequence known
– State sequence unknown
• Model structure
– Silent states
• More complex Markov chains
• Numerical stability
Elze de Groot
37
Download