4 Pairwise alignment

advertisement
Pairwise alignment
- Score-based alignments, dynamic programming
- Problems with score-based alignments
- Markov chains, hidden Markov models, examples
- The Viterbi algorithm
- Pair HMMs
- Forward, Backward, and Posterior probabilities
- Posterior Decoding, comparison with score-based alignments
Sequence alignment
To study how nucleotide (or amino acid) sequences evolve, first a
precise matching between homologous nucleotides in two
(homologous) nucleotide sequences has to be made.
“Homology” is an old concept in biology, and means “having a
similar form”. In evolutionary biology, it means more
specifically, having a similar form because of shared ancestry.
For example, the skeletons of mammals are homologous in
this sense, but eyes have evolved independently a number of
times, so they’re not homologous (even though they may be
very similar in form and function).
The word is also used for nucleotide sequences. Here it means
sequence similarity because of shared ancestry (although it is
often used as synonym of “sequence similarity”). Ancestry is
tricky to determine, especially when sequences are short,
highly divergent, or contain many repetitive patterns
(because of the possibility that these repeats have arisen
independently). So in practice, sequence similarity is often
used as a proxy for homology. Other characteristics, such as
the function or 3D structure of the protein encoded by the
DNA sequence are used as well.
Hexokinase I (glucose metabolism enzyme) is found in
mammals, plants, and yeast.
Needleman-Wunsch algorithm
Two sequences are considered homologous is they’re likely to have evolved from a common ancestor, because
of their sequence similarity.
To quantify this, it is first assumed that during evolution the sequences have been changed by two processes:
(1) single nucleotide substitutions, and (2) short sequence insertions and deletions (indels). Other
mutations do occur (e.g. multiple-nucleotide substitutions, duplications, inversions, ...), but they’re usually
ignored because of their relatively low frequency and/or the technical difficulty of dealing with them.
Considering only nucleotide substitutions and indels, the homology between 2 sequences can be summarized
by an alignment. It implies a minimum set of substitutions and indels for that homology.
CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA
CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA
Historically
the first algorithm for calculating alignments is the Needleman-Wunsch algorithm (1970). It finds
the alignment that optimizes a score, which penalizes both nucleotide mis-matches at aligned positions,
and gaps. This approach is a form of weighted parsimony.
The score function has the following form:
S = n,m aligned M(n,m) + gaps h(length),
where M(m,n) is the score matrix, and h(k) is the gap penalty function. The latter is almost always taken
to be a linear affine function,
h(k) = a + b * (k-1)
where a is the gap opening penalty, and b is the gap extension penalty.
Needleman-Wunsch algorithm
CGACATTAA--ATA
CGACGTTAACGATT
An alignment can be represented as
a path through a table, with the
sequences along the edges of
the table.
C
C
G
A
C
G
T
T
A
A
A
T
A
G
A
C
G
T
T
A
A
C
G
A
T
T
Needleman-Wunsch algorithm
CGACATTAA--ATA
CGACGTTAACGATT
The problem with determining the topscoring alignment is that there are
very many paths.
If we ignore the diagonal steps (i.e. only
gaps...), the number of paths can be
computed recursively, by adding the
entries above and to the left and
writing the result in the box.
The numbers you get are called “Pascal’s
triangle”, or the binomial coefficients.
The number of alignments of
sequences of length N and M is
therefore at least
N M 


 N 
Simply going through all possible
alignments is therefore impossible.
C
G
A
C
G
T
T
A
A
C
C
1
1
1
1
1
1
1
1
1
1
G
1
2
3
4
5
6
7
8
9
A
1
3
6
10
15
21
28
36
C
1
4
10
20
35
54
84
G
1
5
15
35
70
126
T
1
6
21
56
126
T
1
7
28
84
A
1
8
36
A
1
9
A
1
T
A
G
A
T
T
Needleman-Wunsch algorithm
Needleman-Wunsch is, yet again, an
application of dynamic programming.
Here, the intermediate problem to solve
is the computation of the maximum score
S(n,m) of the first n nucleotides of
sequence u, and the first m nucleotides of
sequence v.
The recursive formula for this score is
S(n,m) = max{
S(n-1,m-1) + M( u(n), v(m) ),
maxk>1 S(n-k,m) + h(k),
maxk>1 S(n,m-k) + h(k) }
(To make the formulas look clean, boundary
conditions such as S(-1,n)=0, and the
allowed range of k, are not written
explicitly).
This algorithm takes O(n3) time. It can be
improved to O(n2) by using 3 tables which
collect the scores that (1) end in an
aligned pair, (2) end in a gap in u, and (3)
end in a gap in v. The linear affine form
of h allows simple updating of the gapending scores using neighbouring values.
CGACATTAA--ATA
CGACGTTAACGATT
C
G
A
C
C
10
-10
-15
-20
G
-10
20
A
-15
C
-20
G
A
A
C
G
A
T
90
70
65
70
85
65
75
65
80
60
85
60
90
70
T
40
50
T
30
A
T
30
G
T
T
60
40
70
50
80
60
A
A
T
A
gap opening= -10;
gap extension= -5;
match= +10
mismatch= -5
80
Needleman-Wunsch algorithm
With the table filled out, the bottom-right
entry gives the score of the best-scoring
alignment.
By starting in this entry, and re-computing the
score, you can figure out which term of
the max{ } formula gave rise to this score.
Go to the corresponding entry and repeat
until you arrive at the top-left entry. The
path now corresponds to the alignment.
This algorithm computes the best-scoring
global alignment: all nucleotides
participate, and gaps at the beginning or
end are penalized as much as gaps in the
middle.
CGACATTAA--ATA
CGACGTTAACGATT
C
G
A
C
C
10
-10
-15
-20
G
-10
20
A
-15
C
-20
G
T
30
A
C
G
A
T
90
70
65
70
85
65
75
65
80
60
85
60
90
70
T
60
40
70
50
80
60
A
A
The Smith-Waterman algorithm is a variation
of Needleman-Wunsch which does not
penalize gaps at the beginning or end. It
computes the best-scoring local
alignment.
A
40
50
A
T
30
G
T
T
T
A
gap opening= -10;
gap extension= -5;
match= +10
mismatch= -5
80
Alignments are often wrong
Sensitivity for true homology
Incorrect bases in alignment
Biases in alignments
A: gap wander
(Holmes & Durbin, JCB 5 1998)
B,C: gap attraction
D: gap annihilation
Problems with score-based alignments
The Needleman-Wunsch algorithm (and its elaborations) optimize a score to arrive at a single preferred
alignment. This approach is unsatisfactory for several reasons:
1.
It is unclear how to determine the score matrix and gap penalties given only unaligned sequences.
2.
Even if somehow the underlying evolutionary process is known (e.g. as substitution and indel rates) it is
unclear how the scores and penalties relate to this.
3.
It is impossible to recover the “true” alignment with certainty, particularly if the sequences are quite
divergent. But how to quantify the alignment’s reliability?
4.
The N-W alignments are biased in certain ways, and provide no obvious ways to address the problem.
These observations suggests that a probabilistic (or statistical) approach to alignments might be helpful.
Markov chains
A Markov chain is a discrete-time stochastic process X(t) that has the Markov property.
“Discrete time” means that t  T and T is the set of integers {...,-2,-1,0,1,2,...}, a discrete set.
The state space may be either discrete or continuous. We’ll only consider discrete state spaces here (but
continuous state spaces are also useful in biology, e.g. to model the evolution of gene expression levels).
The “Markov property” is the property that the future dynamics of X(t), given the present state, is independent
of the past states. Symbolically,
P( X(t+1)=yt+1 | X(t)=yt, X(t-1)=yt-1, X(t-2)=yt-2,.... ) =
P( X(t+1)=yt+1 | X(t)=yt )
This means that the behaviour of X(t) is completely determined by the matrices
Apq(t) = P( X(t+1)=q | X(t)=p )
(Strictly speaking, they are matrices only if the state space is finite.) These matrices are called the
transition matrices of X(t). They may well depend on t. If they do not, X(t) is called a time-homogeneous
Markov chain.
Markov chains
Suppose you know the probabilities that the Markov chain is in a particular state for time t; let’s say you know
the vector u where
ua = P( X(t) = a )
Then the probability that X(t+1)=b is
(*)
a P( X(t)=a ) * P( X(t+1)=b | X(t)=a ) =
a ua Aab =
[ u A ]b
because the summation (*) is just the definition of the product of a vector and a matrix. Applying the
same formula with “u” = u A some number of times, you get
P( X(t+n)=b ) = [ u An ] b
Conditioning on X(t)=a means settings u = (0,..,1,...,0), with the 1 at the coordinate corresponding to a:
P( X(t+n)=b | X(t)=a ) = [ An ] ab
Markov chains
Simple example: Two states, and transition probabilities between them.
This system models the fact that yesterday’s weather is a reasonable prediction for
today’s weather.
Here is a typical realization:
Hidden Markov model (HMM)
Elaboration of the simple weather model, now as an HMM. The underlying states,
which are considered unobservable, are H and L representing periods of high and
low atmospheric pressure. The observable is the weather (sunny or rainy). Here is
a typical realization of this HMM:
H H H H H H H H H L
L
L H H H H H L
L
L
L H H
Hidden Markov models
A hidden Markov model (HMM) is a Markov chain X(t), and a stochastic variable Y(t) representing “emissions”
or “observations”.
X(t) has the usual Markov property. Y(t) is only dependent on X(t) (more precisely, Y(t) is conditionally
independent of all other Y(s) given X(t).) These dependencies are summarized in this picture:
(wikipedia “Hidden Markov model”)
The state X(t) is considered unobservable; the only observables are the Y(t).
(Note: The Y(t) may also emit the “empty” symbol, denoted . If these are used, they are simply left out of
the emitted symbol sequence. Only this symbol sequence is considered to be observable; this then also
makes the time t unobservable, since it is unknown how many empty symbols have been output).
Usually the Markov chain starts at t=0, and in a particular state, which is called the ‘start’ state. Often one of
the states in the Markov chain is labeled ‘end’ state, and the process stops when it reaches this state. In
this way HMMs can be made to produce finite symbol sequences.
Hidden Markov models
The unobserved Markov chain is defined as usual: a number of states {1,2,...,N}, and a transition matrix A.
The Y(t) are a stochastic variable whose discrete state space is called the emission alphabet. A probability
distribution P( Y=y | X=x ) determines the emissions conditional on the state of X(t).
HMMs are generative models: they specify how a sequence of observations are generated randomly. This can be
useful for simulations. Their power however derives from the existence of a number of efficient algorithms that
compute the answer to important questions relating to HMMs.
Questions you may want to ask include:
(i)
(ii)
(iii)
(iv)
What is the most likely state path X(t) generating a given observed sequence Y?
What is the likelihood of a given model generating a sequence Y?
What is the posterior probability that a symbol Yi from the sequence Y was generated by some state X?
For a given sequence Y, what matrix A and emission probabilities P( Y=y | X=x ) maximize the likelihood of Y?
These questions are solved by (i) the Viterbi algorithm, (ii) the Forward or Backward algorithms, (iii) the Forward and
Backward algorithms combined, and (iv) the Baum-Welch algorithm.
Profile HMMs
Profile HMMs are used to model DNA sequence motifs in longer DNA sequences. Examples of sequence
motifs are DNA binding motifs for transcription factors, or protein domains (in which case amino acids
can also be used as alphabet).
Profile HMMs
Adding the “x” states (which emit nucleotides according to the background distribution) allow the motif to
be anywhere inside a longer DNA sequence. The black dots are silent states (states that do not emit
symbols).
Profile HMMs
Adding more silent states allow letters in the motif to be skipped (with a certain probability), so model e.g.
protein domains with slightly variable lengths.
Profile HMMs
Similarly, “insert” states allow the insertion of, in principle, any length of sequence somewhere within the
motif.
Profile HMMs
This is the profile HMM
architecture used in the
HMMER and SAM packages.
An extensive list of HMMs for
protein domains can be found
in the Pfam database.
Sean Eddy
HMMER
Washington U.
David Haussler
SAM
UCSC
A loopback transition allows multiple motifs to occur in a single sequence (but note that in this model, the
sequence always includes at least one motif. Using this model to find motifs will always get you one).
Andrew Viterbi (1935-), Italian-American engineer
Viterbi algorithm
This algorithm determines the most likely state path conditional on producing the
observed sequence Y. Straightforward computation would involve examining all
possible paths, which is virtually impossible. The Viterbi algorithm solves the
problem efficiently, using dynamic programming, in a way that is very similar to
the Needleman-Wunsch algorithm (except that this one is in 1 dimension – but see later)
The intermediate result that can be computed recursively is
v(i,x) := the probability of the most probable path that begins in the ‘start’ state,
emits the first i symbols of Y, and ends up in state x.
Suppose for convenience that all states except the start state emit a symbol. Then
v(0,start)=1, v(0,x)=0 for other states x.
For i>0, v can be computed recursively:
v(i,x) = maxs { v(i-1,s) * Asx * P( Y=Yi | X=x ) }
(*)
The red factor has already been computed. After all v(i,x) have been computed, a traceback procedure recovers
the most probable path.
(Note 1: The convention has been taken that a state emits its corresponding symbol as soon as it is entered. One can also choose to
emit once a state is left. Further, emissions may be assigned to transitions rather than states; this is called a “Mealy model” or
machine; the conventional choice is called a “Moore model”.)
(Note 2: If some states x do not emit a symbol, the factor v(i-1,s) must be replaced by v(i,s) in (*). This makes the formula selfreferential. For the Viterbi algorithm, the computation (*) can simply be carried out a number of times, until none of the terms
changes any further. Depending on the structure of the Markov chain, computing the v(i,x) in a suitable order may decrease the
number of iterations required.)
Pair HMMs
Up until now, the HMMs emitted a single
sequence of symbols. A straightforward
generalization allows the emission of two
symbol sequences at the same time. Such
HMMs are called pair HMMs.
Each state emits one symbol to each sequence.
As before, either or both may be the empty
symbol (— instead of )
Pair HMMs can be used to model sequence
alignments. Simultaneously emitted
nucleotides correspond to homologous pairs,
and are correlated because of their shared
ancestry. States that emit nucleotides to
only one sequence model gaps in the other
sequence.
Viterbi for pair HMMs
To show what changes going from standard to pair HMMs, here is the Viterbi algorithm for pair HMMs. The
observable sequences are denoted Y and Z.
The probabilities to be computed are v(i,j,x), and represent the probability of the most probable path that
begins in the ‘start’ state, emits the first i symbols from Y and the first j symbols from Z, and ends up in
state x.
The recursion is:
v(i,j,x) = maxs { v(i-p, j-q,s) * Asx * P(Y=Yi, Z=Zj | X=x) }
Here, p=1 if x emits into Y, 0 otherwise; similarly q=1 if x emits into Z, 0 otherwise. If x does not output to both
sequences, the emission probability should change accordingly (e.g. P(Y=Yi|X=x) for emission into Y only).
The maximum path probability can be read off from v(length(Y),length(Z),’end’); the actual path can be
obtained by backtracing.
For simplicity I assume that none of the states (except for ‘start’) is silent, i.e. they all emit into at least one
sequence. If not, the recursion again becomes self-referential, and the same remarks as with the 1dimensional Viterbi algorithm apply.
Viterbi for pair HMMs
The Viterbi algorithm computes the single path {X(0),X(1),...,X(t)} through the Markov
chain that has the highest probability of simultaneously emitting the two observed
sequences, denoted Y and Z. The path encodes which nucleotides from Y and Z
were emitted simultaneously, i.e. it encodes the alignment.
Symbolically, the Viterbi algorithm first computes
maxX P( X, Y, Z )
and then the traceback part computes
arg maxX P( X , Y, Z ) = arg max X P( X | Y, Z )
The last equation holds because P(X|Y,Z) = P(X,Y,Z) / P(Y,Z), and because the
denominator does not depend on X, the full and conditional probabilities reach
their maxima for the same path X.
Alignments: Best scoring / most likely path
A T C A T C T G C A G T
(Needleman-Wunsch, Smith-Waterman, Viterbi)
A C C G T T C A C A A T G G A T
Likelihood: the Forward algorithm
The Viterbi algorithm is very similar to the Needleman-Wunsch algorithm (the states
playing the role of index to the three tables mentioned briefly); it has the same
time complexity, and indeed for the three-state HMM of the example, you can give
a score matrix and gap penalties that make them compute identical alignments.
The HMM approach has the important advantage above score-based approaches such
as N-W that it is fairly straightforward to determine the model parameters if you
know how the sequences evolve, i.e. if you know the evolutionary model
(drawback 1 of score-based aligners).
Drawback 2 concerned determining the model parameters (or the evolutionary
model) from sequences alone. The HMM approach also has a method to do this,
by maximum likelihood parameter estimators. For this, it is necessary to be able
to calculate the likelihood. This involves considering (summing over) all possible
paths, instead of just the most likely one. The Forward algorithm does this.
The Forward algorithm (for pair HMMs)
The algorithm is very similar to the Viterbi algorithm (without traceback). The probabilities to be computed are
slightly different: w(i,j,x) now represent the sum of probabilities of paths through the Markov chain that
emit the first i symbols from Y and the first j symbols from Z, and end up in state x. The final result which
appears in the table entry w(length(Y),length(Z),’end’) is the likelihood of the sequences Y and Z.
Symbolically, the Forward algorithm computes
P(Y, Z) = X P( X, Y, Z )
The recursion is:
w(i,j,x) =  s { w(i-p, j-q,s) * Asx * P(Y=Yi, Z=Zj | X=x) }
Here, p and q have the same definitions as before, and the same caveats-for-the-sake-of-simplicity apply.
Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor
Z emits a symbol) create self-references in the recursion. In this case they are linear equations, which in
general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which of
course can be dealt with by a simple division.
Note: the numbers w(i,j,x) are not conditional probabilities; they are not even true probabilities, as in some
cases they can be >1.
The Backward algorithm (for pair HMMs)
The Backward algorithm computes the same thing as the Forward algorithm: the likelihood that the HMM
emits the sequences Y and Z; symbolically P(Y, Z) = X P( X, Y, Z ). So why bother? The answer is that the
table of intermediate results are different, and the Forward and Backward tables can be used together to
compute other interesting quantities, such as posterior probabilities.
The algorithm is very similar to the Forward algorithm, but starts the recursion and the bottom-right end of the
table, working its way (backwards) to the top-left.
The quantities u now represent the conditional probability of the model emitting the symbols i+1,i+2,... of Y
and j+1,j+2,... of Z, using paths that start in x and end up in the ‘end’ state. The emission associated with x
is not included in this likelihood.
Initialization is u(length(Y),length(Z),’end’)=1, and the recursion is:
u(i,j,x) =  s { u(i+p s, j+q s,s) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) }
Here, p and q have the same definitions as before, but now refer to the state s rather than x.
Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor
Z emits a symbol) create self-references in the recursion. In this case they are linear equations, which in
general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which can
be dealt with by a simple division.
The Backward algorithm &
Sampling from the posterior
Recall that the quantities u computed by the Backward algorithm represent the conditional probability of the
model emitting the symbols i+1,i+2,... of Y and j+1,j+2,... of Z, using paths that start in x and end up in the
‘end’ state.
In other words, u(i,j,x) gives the probability that the model will produce the remainder of Y and Z, conditional
on starting in state x. These probabilities can be used to sample from the posterior distribution of paths
that generate Y and Z:
x  ‘start’, i  0, j 0
While x is not ‘end’:
Compute r s = u(i+p s, j+q s,s) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) for all states s
Normalize the r so that they sum to 1, and draw a state s from the distribution
xs, ii+p s, jj+q s
Alignments: Posterior probabilities
A T C A T C T G C A G T
(Durbin, Eddy, Krogh, Mitchison 1998)
A C C G T T C A C A A T G G A T
Posterior probabilities
Using the table of intermediate results w(i,j,x) and u(i,j,x) computed by the Forward and Backward algorithms,
it is possible to compute certain posterior probabilities.
For example, let’s denote by [i j x] the event that the HMM emitted the symbols Yi and Zj from state x
(assuming that x emits into both sequences; an alternative interpretation can be formulated when x does
not.) The probability of this event, and the HMM producing the sequences Y and Z, is
P(Y, Z, [i j x]) = w(i,j,x) * u(i,j,x)
Since the likelihood of emitting Y and Z is
P(Y, Z) = w(length(Y),length(Z),’end’) = u(0,0,’start’),
the probability of the event [Yi Zj x] conditional on observing Y and Z is
P([i j x] | Y, Z ) = w(i,j,x) * u(i,j,x) / u(0,0,’start’)
If the state x represents the “aligned” state in an alignment HMM, this would be the posterior probability
of Yi and Zj to be aligned.
Posterior probabilities
1
0.5
0
A T T - - - - - - - - - GGGTGT GGAGCGT T T T T T TCC TGC AT T GTGC TCGAGA TGGAGT G - - - - - - - - - C AGAC AGCCGACGT GG
A T T T T T AGGT AGCGGTGT CGA - - - TGT T A T CCCGGC AA T GTC T T T GTCGAGGAGGGGGGT CCAGCC T CGCCGC A AGCGCGG
1
0.5
0
C T CA ACT ACGGGA AACCCCGAGGT T AGACT AGGGGGCCA A T T T AGTGGCC AGGT TGG - T CGGGAA T TC TCGCA T AA T AAGA
T T T A A AGACGGCC ACGCGGAAGGT T CT AGT A AGGTCC - - - TC TCGTGT CAC TGT TGGAT CGGG - - T GT TCGCAGAGT AT GA
Posterior probabilities (*)
As another example, let’s consider the event that the HMM transitioned from state x to s just after emitting the
symbols Yi and Zj (from state x, assuming it emits into both sequences; again, this is not essential.) Denote
this event by [i j x s]. The probability of this event, and the HMM producing the sequences Y and Z, is
P(Y, Z, [i j x s]) = w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s)
where again p and q refer to the emission of state s. Dividing by the total likelihood of observing the
sequences Y and Z results in the conditional probability of [i j x s]:
P( [i j x s] | Y, Z ) = w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’)
Conditional on the sequences Y and Z, and the position within these sequences, this is the expected
number of times that the transition x->s was taken by the HMM. Now we can marginalize over the
position variables i and j:
E( number of times that the transition xs was taken | emitting Y, Z ) =
 i,j w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’)
The conditional expected number of times that particular symbols were emitted from particular states can be
computed in a very similar way.
Posteriors: Good predictors of accuracy
Posterior Decoding
The good correspondence of posterior probabilities and the actual frequencies that particular states were
visited, suggests that posteriors may be interpreted as the “reliability” of any particular prediction. The
idea of Posterior Decoding is to use these posteriors to construct alignments. PD is not a single algorithm,
but a set of closely related procedures.
Generally, P.D. maximizes some overall measure of the posterior probability over a set of paths.
- The paths can consist of HMM states, or of alignment columns.
- The overall measure can be a (weighted) sum of posteriors, or the product of posteriors.
As before, this is accomplished by a dynamic programming algorithm.
For alignments, the product-of-posteriors measure turned out to work best.
PD is a heuristic procedure; in contrast to e.g. the Viterbi algorithm, there is no direct (probabilistic)
interpretation of the alignment. For instance, when HMM states are used to build paths, the resulting PD
path may not be a valid path through the HMM.
Posterior decoding
a
b
c
d
e
Paths:
Alignment columns
Posteriors: Sum of posteriors for all states contributing to a column.
E.g. The posterior for Yi and Zj to be aligned is
P([i j e])
The posterior for Yi to be involved in a gap is
j P([i j a] or [i j b] or [i j c] or [i j d] | Y, Z)
To emphasize the marginalization of the gap position j within the second
sequence, the resulting algorithm is called Marginalized Posterior Decoding.
Alignments: Posterior probabilities
A T C A T C T G C A G T
(Durbin, Eddy, Krogh, Mitchison 1998)
A C C G T T C A C A A T G G A T
Alignments: Posterior probabilities
A T C A T C T G C A G T
(Durbin, Eddy, Krogh, Mitchison 1998)
A C C G T T C A C A A T G G A T
Alignments: Posterior probabilities
A T C A T C T G C A G T
(Durbin, Eddy, Krogh, Mitchison 1998)
A C C G T T C A C A A T G G A T
Posterior Decoding is better than Viterbi
Posterior decoding is better than
score-based alignment
Summary
•
HMMs are a powerful framework to model sequential data with randomness
•
Modelling alignments using HMMs provide a connection between alignment
models and models of sequence evolution
•
Provides practical and conceptual framework to deal with uncertainties
•
Posterior decoding is more accurate and less biased than either score-based
approaches or Viterbi.
•
Baum-Welch algorithm (not described) allows efficient parameter estimation from
unaligned sequences
Download