Pairwise alignment - Score-based alignments, dynamic programming - Problems with score-based alignments - Markov chains, hidden Markov models, examples - The Viterbi algorithm - Pair HMMs - Forward, Backward, and Posterior probabilities - Posterior Decoding, comparison with score-based alignments Sequence alignment To study how nucleotide (or amino acid) sequences evolve, first a precise matching between homologous nucleotides in two (homologous) nucleotide sequences has to be made. “Homology” is an old concept in biology, and means “having a similar form”. In evolutionary biology, it means more specifically, having a similar form because of shared ancestry. For example, the skeletons of mammals are homologous in this sense, but eyes have evolved independently a number of times, so they’re not homologous (even though they may be very similar in form and function). The word is also used for nucleotide sequences. Here it means sequence similarity because of shared ancestry (although it is often used as synonym of “sequence similarity”). Ancestry is tricky to determine, especially when sequences are short, highly divergent, or contain many repetitive patterns (because of the possibility that these repeats have arisen independently). So in practice, sequence similarity is often used as a proxy for homology. Other characteristics, such as the function or 3D structure of the protein encoded by the DNA sequence are used as well. Hexokinase I (glucose metabolism enzyme) is found in mammals, plants, and yeast. Needleman-Wunsch algorithm Two sequences are considered homologous is they’re likely to have evolved from a common ancestor, because of their sequence similarity. To quantify this, it is first assumed that during evolution the sequences have been changed by two processes: (1) single nucleotide substitutions, and (2) short sequence insertions and deletions (indels). Other mutations do occur (e.g. multiple-nucleotide substitutions, duplications, inversions, ...), but they’re usually ignored because of their relatively low frequency and/or the technical difficulty of dealing with them. Considering only nucleotide substitutions and indels, the homology between 2 sequences can be summarized by an alignment. It implies a minimum set of substitutions and indels for that homology. CGACATTAA--ATAGGCATAGCAGGACCAGATACCAGATCAAAGGCTTCAGGCGCA CGACGTTAACGATTGGC---GCAGTATCAGATACCCGATCAAAG----CAGACGCA Historically the first algorithm for calculating alignments is the Needleman-Wunsch algorithm (1970). It finds the alignment that optimizes a score, which penalizes both nucleotide mis-matches at aligned positions, and gaps. This approach is a form of weighted parsimony. The score function has the following form: S = n,m aligned M(n,m) + gaps h(length), where M(m,n) is the score matrix, and h(k) is the gap penalty function. The latter is almost always taken to be a linear affine function, h(k) = a + b * (k-1) where a is the gap opening penalty, and b is the gap extension penalty. Needleman-Wunsch algorithm CGACATTAA--ATA CGACGTTAACGATT An alignment can be represented as a path through a table, with the sequences along the edges of the table. C C G A C G T T A A A T A G A C G T T A A C G A T T Needleman-Wunsch algorithm CGACATTAA--ATA CGACGTTAACGATT The problem with determining the topscoring alignment is that there are very many paths. If we ignore the diagonal steps (i.e. only gaps...), the number of paths can be computed recursively, by adding the entries above and to the left and writing the result in the box. The numbers you get are called “Pascal’s triangle”, or the binomial coefficients. The number of alignments of sequences of length N and M is therefore at least N M N Simply going through all possible alignments is therefore impossible. C G A C G T T A A C C 1 1 1 1 1 1 1 1 1 1 G 1 2 3 4 5 6 7 8 9 A 1 3 6 10 15 21 28 36 C 1 4 10 20 35 54 84 G 1 5 15 35 70 126 T 1 6 21 56 126 T 1 7 28 84 A 1 8 36 A 1 9 A 1 T A G A T T Needleman-Wunsch algorithm Needleman-Wunsch is, yet again, an application of dynamic programming. Here, the intermediate problem to solve is the computation of the maximum score S(n,m) of the first n nucleotides of sequence u, and the first m nucleotides of sequence v. The recursive formula for this score is S(n,m) = max{ S(n-1,m-1) + M( u(n), v(m) ), maxk>1 S(n-k,m) + h(k), maxk>1 S(n,m-k) + h(k) } (To make the formulas look clean, boundary conditions such as S(-1,n)=0, and the allowed range of k, are not written explicitly). This algorithm takes O(n3) time. It can be improved to O(n2) by using 3 tables which collect the scores that (1) end in an aligned pair, (2) end in a gap in u, and (3) end in a gap in v. The linear affine form of h allows simple updating of the gapending scores using neighbouring values. CGACATTAA--ATA CGACGTTAACGATT C G A C C 10 -10 -15 -20 G -10 20 A -15 C -20 G A A C G A T 90 70 65 70 85 65 75 65 80 60 85 60 90 70 T 40 50 T 30 A T 30 G T T 60 40 70 50 80 60 A A T A gap opening= -10; gap extension= -5; match= +10 mismatch= -5 80 Needleman-Wunsch algorithm With the table filled out, the bottom-right entry gives the score of the best-scoring alignment. By starting in this entry, and re-computing the score, you can figure out which term of the max{ } formula gave rise to this score. Go to the corresponding entry and repeat until you arrive at the top-left entry. The path now corresponds to the alignment. This algorithm computes the best-scoring global alignment: all nucleotides participate, and gaps at the beginning or end are penalized as much as gaps in the middle. CGACATTAA--ATA CGACGTTAACGATT C G A C C 10 -10 -15 -20 G -10 20 A -15 C -20 G T 30 A C G A T 90 70 65 70 85 65 75 65 80 60 85 60 90 70 T 60 40 70 50 80 60 A A The Smith-Waterman algorithm is a variation of Needleman-Wunsch which does not penalize gaps at the beginning or end. It computes the best-scoring local alignment. A 40 50 A T 30 G T T T A gap opening= -10; gap extension= -5; match= +10 mismatch= -5 80 Alignments are often wrong Sensitivity for true homology Incorrect bases in alignment Biases in alignments A: gap wander (Holmes & Durbin, JCB 5 1998) B,C: gap attraction D: gap annihilation Problems with score-based alignments The Needleman-Wunsch algorithm (and its elaborations) optimize a score to arrive at a single preferred alignment. This approach is unsatisfactory for several reasons: 1. It is unclear how to determine the score matrix and gap penalties given only unaligned sequences. 2. Even if somehow the underlying evolutionary process is known (e.g. as substitution and indel rates) it is unclear how the scores and penalties relate to this. 3. It is impossible to recover the “true” alignment with certainty, particularly if the sequences are quite divergent. But how to quantify the alignment’s reliability? 4. The N-W alignments are biased in certain ways, and provide no obvious ways to address the problem. These observations suggests that a probabilistic (or statistical) approach to alignments might be helpful. Markov chains A Markov chain is a discrete-time stochastic process X(t) that has the Markov property. “Discrete time” means that t T and T is the set of integers {...,-2,-1,0,1,2,...}, a discrete set. The state space may be either discrete or continuous. We’ll only consider discrete state spaces here (but continuous state spaces are also useful in biology, e.g. to model the evolution of gene expression levels). The “Markov property” is the property that the future dynamics of X(t), given the present state, is independent of the past states. Symbolically, P( X(t+1)=yt+1 | X(t)=yt, X(t-1)=yt-1, X(t-2)=yt-2,.... ) = P( X(t+1)=yt+1 | X(t)=yt ) This means that the behaviour of X(t) is completely determined by the matrices Apq(t) = P( X(t+1)=q | X(t)=p ) (Strictly speaking, they are matrices only if the state space is finite.) These matrices are called the transition matrices of X(t). They may well depend on t. If they do not, X(t) is called a time-homogeneous Markov chain. Markov chains Suppose you know the probabilities that the Markov chain is in a particular state for time t; let’s say you know the vector u where ua = P( X(t) = a ) Then the probability that X(t+1)=b is (*) a P( X(t)=a ) * P( X(t+1)=b | X(t)=a ) = a ua Aab = [ u A ]b because the summation (*) is just the definition of the product of a vector and a matrix. Applying the same formula with “u” = u A some number of times, you get P( X(t+n)=b ) = [ u An ] b Conditioning on X(t)=a means settings u = (0,..,1,...,0), with the 1 at the coordinate corresponding to a: P( X(t+n)=b | X(t)=a ) = [ An ] ab Markov chains Simple example: Two states, and transition probabilities between them. This system models the fact that yesterday’s weather is a reasonable prediction for today’s weather. Here is a typical realization: Hidden Markov model (HMM) Elaboration of the simple weather model, now as an HMM. The underlying states, which are considered unobservable, are H and L representing periods of high and low atmospheric pressure. The observable is the weather (sunny or rainy). Here is a typical realization of this HMM: H H H H H H H H H L L L H H H H H L L L L H H Hidden Markov models A hidden Markov model (HMM) is a Markov chain X(t), and a stochastic variable Y(t) representing “emissions” or “observations”. X(t) has the usual Markov property. Y(t) is only dependent on X(t) (more precisely, Y(t) is conditionally independent of all other Y(s) given X(t).) These dependencies are summarized in this picture: (wikipedia “Hidden Markov model”) The state X(t) is considered unobservable; the only observables are the Y(t). (Note: The Y(t) may also emit the “empty” symbol, denoted . If these are used, they are simply left out of the emitted symbol sequence. Only this symbol sequence is considered to be observable; this then also makes the time t unobservable, since it is unknown how many empty symbols have been output). Usually the Markov chain starts at t=0, and in a particular state, which is called the ‘start’ state. Often one of the states in the Markov chain is labeled ‘end’ state, and the process stops when it reaches this state. In this way HMMs can be made to produce finite symbol sequences. Hidden Markov models The unobserved Markov chain is defined as usual: a number of states {1,2,...,N}, and a transition matrix A. The Y(t) are a stochastic variable whose discrete state space is called the emission alphabet. A probability distribution P( Y=y | X=x ) determines the emissions conditional on the state of X(t). HMMs are generative models: they specify how a sequence of observations are generated randomly. This can be useful for simulations. Their power however derives from the existence of a number of efficient algorithms that compute the answer to important questions relating to HMMs. Questions you may want to ask include: (i) (ii) (iii) (iv) What is the most likely state path X(t) generating a given observed sequence Y? What is the likelihood of a given model generating a sequence Y? What is the posterior probability that a symbol Yi from the sequence Y was generated by some state X? For a given sequence Y, what matrix A and emission probabilities P( Y=y | X=x ) maximize the likelihood of Y? These questions are solved by (i) the Viterbi algorithm, (ii) the Forward or Backward algorithms, (iii) the Forward and Backward algorithms combined, and (iv) the Baum-Welch algorithm. Profile HMMs Profile HMMs are used to model DNA sequence motifs in longer DNA sequences. Examples of sequence motifs are DNA binding motifs for transcription factors, or protein domains (in which case amino acids can also be used as alphabet). Profile HMMs Adding the “x” states (which emit nucleotides according to the background distribution) allow the motif to be anywhere inside a longer DNA sequence. The black dots are silent states (states that do not emit symbols). Profile HMMs Adding more silent states allow letters in the motif to be skipped (with a certain probability), so model e.g. protein domains with slightly variable lengths. Profile HMMs Similarly, “insert” states allow the insertion of, in principle, any length of sequence somewhere within the motif. Profile HMMs This is the profile HMM architecture used in the HMMER and SAM packages. An extensive list of HMMs for protein domains can be found in the Pfam database. Sean Eddy HMMER Washington U. David Haussler SAM UCSC A loopback transition allows multiple motifs to occur in a single sequence (but note that in this model, the sequence always includes at least one motif. Using this model to find motifs will always get you one). Andrew Viterbi (1935-), Italian-American engineer Viterbi algorithm This algorithm determines the most likely state path conditional on producing the observed sequence Y. Straightforward computation would involve examining all possible paths, which is virtually impossible. The Viterbi algorithm solves the problem efficiently, using dynamic programming, in a way that is very similar to the Needleman-Wunsch algorithm (except that this one is in 1 dimension – but see later) The intermediate result that can be computed recursively is v(i,x) := the probability of the most probable path that begins in the ‘start’ state, emits the first i symbols of Y, and ends up in state x. Suppose for convenience that all states except the start state emit a symbol. Then v(0,start)=1, v(0,x)=0 for other states x. For i>0, v can be computed recursively: v(i,x) = maxs { v(i-1,s) * Asx * P( Y=Yi | X=x ) } (*) The red factor has already been computed. After all v(i,x) have been computed, a traceback procedure recovers the most probable path. (Note 1: The convention has been taken that a state emits its corresponding symbol as soon as it is entered. One can also choose to emit once a state is left. Further, emissions may be assigned to transitions rather than states; this is called a “Mealy model” or machine; the conventional choice is called a “Moore model”.) (Note 2: If some states x do not emit a symbol, the factor v(i-1,s) must be replaced by v(i,s) in (*). This makes the formula selfreferential. For the Viterbi algorithm, the computation (*) can simply be carried out a number of times, until none of the terms changes any further. Depending on the structure of the Markov chain, computing the v(i,x) in a suitable order may decrease the number of iterations required.) Pair HMMs Up until now, the HMMs emitted a single sequence of symbols. A straightforward generalization allows the emission of two symbol sequences at the same time. Such HMMs are called pair HMMs. Each state emits one symbol to each sequence. As before, either or both may be the empty symbol (— instead of ) Pair HMMs can be used to model sequence alignments. Simultaneously emitted nucleotides correspond to homologous pairs, and are correlated because of their shared ancestry. States that emit nucleotides to only one sequence model gaps in the other sequence. Viterbi for pair HMMs To show what changes going from standard to pair HMMs, here is the Viterbi algorithm for pair HMMs. The observable sequences are denoted Y and Z. The probabilities to be computed are v(i,j,x), and represent the probability of the most probable path that begins in the ‘start’ state, emits the first i symbols from Y and the first j symbols from Z, and ends up in state x. The recursion is: v(i,j,x) = maxs { v(i-p, j-q,s) * Asx * P(Y=Yi, Z=Zj | X=x) } Here, p=1 if x emits into Y, 0 otherwise; similarly q=1 if x emits into Z, 0 otherwise. If x does not output to both sequences, the emission probability should change accordingly (e.g. P(Y=Yi|X=x) for emission into Y only). The maximum path probability can be read off from v(length(Y),length(Z),’end’); the actual path can be obtained by backtracing. For simplicity I assume that none of the states (except for ‘start’) is silent, i.e. they all emit into at least one sequence. If not, the recursion again becomes self-referential, and the same remarks as with the 1dimensional Viterbi algorithm apply. Viterbi for pair HMMs The Viterbi algorithm computes the single path {X(0),X(1),...,X(t)} through the Markov chain that has the highest probability of simultaneously emitting the two observed sequences, denoted Y and Z. The path encodes which nucleotides from Y and Z were emitted simultaneously, i.e. it encodes the alignment. Symbolically, the Viterbi algorithm first computes maxX P( X, Y, Z ) and then the traceback part computes arg maxX P( X , Y, Z ) = arg max X P( X | Y, Z ) The last equation holds because P(X|Y,Z) = P(X,Y,Z) / P(Y,Z), and because the denominator does not depend on X, the full and conditional probabilities reach their maxima for the same path X. Alignments: Best scoring / most likely path A T C A T C T G C A G T (Needleman-Wunsch, Smith-Waterman, Viterbi) A C C G T T C A C A A T G G A T Likelihood: the Forward algorithm The Viterbi algorithm is very similar to the Needleman-Wunsch algorithm (the states playing the role of index to the three tables mentioned briefly); it has the same time complexity, and indeed for the three-state HMM of the example, you can give a score matrix and gap penalties that make them compute identical alignments. The HMM approach has the important advantage above score-based approaches such as N-W that it is fairly straightforward to determine the model parameters if you know how the sequences evolve, i.e. if you know the evolutionary model (drawback 1 of score-based aligners). Drawback 2 concerned determining the model parameters (or the evolutionary model) from sequences alone. The HMM approach also has a method to do this, by maximum likelihood parameter estimators. For this, it is necessary to be able to calculate the likelihood. This involves considering (summing over) all possible paths, instead of just the most likely one. The Forward algorithm does this. The Forward algorithm (for pair HMMs) The algorithm is very similar to the Viterbi algorithm (without traceback). The probabilities to be computed are slightly different: w(i,j,x) now represent the sum of probabilities of paths through the Markov chain that emit the first i symbols from Y and the first j symbols from Z, and end up in state x. The final result which appears in the table entry w(length(Y),length(Z),’end’) is the likelihood of the sequences Y and Z. Symbolically, the Forward algorithm computes P(Y, Z) = X P( X, Y, Z ) The recursion is: w(i,j,x) = s { w(i-p, j-q,s) * Asx * P(Y=Yi, Z=Zj | X=x) } Here, p and q have the same definitions as before, and the same caveats-for-the-sake-of-simplicity apply. Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor Z emits a symbol) create self-references in the recursion. In this case they are linear equations, which in general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which of course can be dealt with by a simple division. Note: the numbers w(i,j,x) are not conditional probabilities; they are not even true probabilities, as in some cases they can be >1. The Backward algorithm (for pair HMMs) The Backward algorithm computes the same thing as the Forward algorithm: the likelihood that the HMM emits the sequences Y and Z; symbolically P(Y, Z) = X P( X, Y, Z ). So why bother? The answer is that the table of intermediate results are different, and the Forward and Backward tables can be used together to compute other interesting quantities, such as posterior probabilities. The algorithm is very similar to the Forward algorithm, but starts the recursion and the bottom-right end of the table, working its way (backwards) to the top-left. The quantities u now represent the conditional probability of the model emitting the symbols i+1,i+2,... of Y and j+1,j+2,... of Z, using paths that start in x and end up in the ‘end’ state. The emission associated with x is not included in this likelihood. Initialization is u(length(Y),length(Z),’end’)=1, and the recursion is: u(i,j,x) = s { u(i+p s, j+q s,s) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) } Here, p and q have the same definitions as before, but now refer to the state s rather than x. Note: Dealing with silent states is more involved than with Viterbi. As before, silent states (where neither Y nor Z emits a symbol) create self-references in the recursion. In this case they are linear equations, which in general must be solved by a matrix inversion. In practice, most situations involve 1x1 matrices, which can be dealt with by a simple division. The Backward algorithm & Sampling from the posterior Recall that the quantities u computed by the Backward algorithm represent the conditional probability of the model emitting the symbols i+1,i+2,... of Y and j+1,j+2,... of Z, using paths that start in x and end up in the ‘end’ state. In other words, u(i,j,x) gives the probability that the model will produce the remainder of Y and Z, conditional on starting in state x. These probabilities can be used to sample from the posterior distribution of paths that generate Y and Z: x ‘start’, i 0, j 0 While x is not ‘end’: Compute r s = u(i+p s, j+q s,s) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) for all states s Normalize the r so that they sum to 1, and draw a state s from the distribution xs, ii+p s, jj+q s Alignments: Posterior probabilities A T C A T C T G C A G T (Durbin, Eddy, Krogh, Mitchison 1998) A C C G T T C A C A A T G G A T Posterior probabilities Using the table of intermediate results w(i,j,x) and u(i,j,x) computed by the Forward and Backward algorithms, it is possible to compute certain posterior probabilities. For example, let’s denote by [i j x] the event that the HMM emitted the symbols Yi and Zj from state x (assuming that x emits into both sequences; an alternative interpretation can be formulated when x does not.) The probability of this event, and the HMM producing the sequences Y and Z, is P(Y, Z, [i j x]) = w(i,j,x) * u(i,j,x) Since the likelihood of emitting Y and Z is P(Y, Z) = w(length(Y),length(Z),’end’) = u(0,0,’start’), the probability of the event [Yi Zj x] conditional on observing Y and Z is P([i j x] | Y, Z ) = w(i,j,x) * u(i,j,x) / u(0,0,’start’) If the state x represents the “aligned” state in an alignment HMM, this would be the posterior probability of Yi and Zj to be aligned. Posterior probabilities 1 0.5 0 A T T - - - - - - - - - GGGTGT GGAGCGT T T T T T TCC TGC AT T GTGC TCGAGA TGGAGT G - - - - - - - - - C AGAC AGCCGACGT GG A T T T T T AGGT AGCGGTGT CGA - - - TGT T A T CCCGGC AA T GTC T T T GTCGAGGAGGGGGGT CCAGCC T CGCCGC A AGCGCGG 1 0.5 0 C T CA ACT ACGGGA AACCCCGAGGT T AGACT AGGGGGCCA A T T T AGTGGCC AGGT TGG - T CGGGAA T TC TCGCA T AA T AAGA T T T A A AGACGGCC ACGCGGAAGGT T CT AGT A AGGTCC - - - TC TCGTGT CAC TGT TGGAT CGGG - - T GT TCGCAGAGT AT GA Posterior probabilities (*) As another example, let’s consider the event that the HMM transitioned from state x to s just after emitting the symbols Yi and Zj (from state x, assuming it emits into both sequences; again, this is not essential.) Denote this event by [i j x s]. The probability of this event, and the HMM producing the sequences Y and Z, is P(Y, Z, [i j x s]) = w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s) where again p and q refer to the emission of state s. Dividing by the total likelihood of observing the sequences Y and Z results in the conditional probability of [i j x s]: P( [i j x s] | Y, Z ) = w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’) Conditional on the sequences Y and Z, and the position within these sequences, this is the expected number of times that the transition x->s was taken by the HMM. Now we can marginalize over the position variables i and j: E( number of times that the transition xs was taken | emitting Y, Z ) = i,j w(i,j,x) * Axs * P(Y=Yi+1, Z=Zj+1 | X=s) * u(i+p,j+q,s) / u(0,0,’start’) The conditional expected number of times that particular symbols were emitted from particular states can be computed in a very similar way. Posteriors: Good predictors of accuracy Posterior Decoding The good correspondence of posterior probabilities and the actual frequencies that particular states were visited, suggests that posteriors may be interpreted as the “reliability” of any particular prediction. The idea of Posterior Decoding is to use these posteriors to construct alignments. PD is not a single algorithm, but a set of closely related procedures. Generally, P.D. maximizes some overall measure of the posterior probability over a set of paths. - The paths can consist of HMM states, or of alignment columns. - The overall measure can be a (weighted) sum of posteriors, or the product of posteriors. As before, this is accomplished by a dynamic programming algorithm. For alignments, the product-of-posteriors measure turned out to work best. PD is a heuristic procedure; in contrast to e.g. the Viterbi algorithm, there is no direct (probabilistic) interpretation of the alignment. For instance, when HMM states are used to build paths, the resulting PD path may not be a valid path through the HMM. Posterior decoding a b c d e Paths: Alignment columns Posteriors: Sum of posteriors for all states contributing to a column. E.g. The posterior for Yi and Zj to be aligned is P([i j e]) The posterior for Yi to be involved in a gap is j P([i j a] or [i j b] or [i j c] or [i j d] | Y, Z) To emphasize the marginalization of the gap position j within the second sequence, the resulting algorithm is called Marginalized Posterior Decoding. Alignments: Posterior probabilities A T C A T C T G C A G T (Durbin, Eddy, Krogh, Mitchison 1998) A C C G T T C A C A A T G G A T Alignments: Posterior probabilities A T C A T C T G C A G T (Durbin, Eddy, Krogh, Mitchison 1998) A C C G T T C A C A A T G G A T Alignments: Posterior probabilities A T C A T C T G C A G T (Durbin, Eddy, Krogh, Mitchison 1998) A C C G T T C A C A A T G G A T Posterior Decoding is better than Viterbi Posterior decoding is better than score-based alignment Summary • HMMs are a powerful framework to model sequential data with randomness • Modelling alignments using HMMs provide a connection between alignment models and models of sequence evolution • Provides practical and conceptual framework to deal with uncertainties • Posterior decoding is more accurate and less biased than either score-based approaches or Viterbi. • Baum-Welch algorithm (not described) allows efficient parameter estimation from unaligned sequences