Expected accuracy sequence alignment Usman Roshan Expected accuracy alignment • The dynamic programming formulation allows us to find the optimal alignment defined by a scoring matrix and gap penalties. This may not necessarily be the most “accurate” or biologically informative. • We now look at a different formulation of alignment that allows us to compute the most accurate one instead of the optimal one. Posterior probability of xi aligned to yj • Let A be the set of all alignments of sequences x and y, and define P(a|x,y) to be the probability that alignment a (of x and y) is the true alignment a*. • We define the posterior probability of the ith residue of x (xi) aligning to the jth residue of y (yj) in the true alignment (a*) of x and y as P(xi ~ y j a* | x, y) P(a | x, y)1{x i ~ y j a} Do et. al., Genome Research, 2005 a A Expected accuracy of alignment • We can define the expected accuracy of an alignment a as Do et. al., Genome Research, 2005 • The maximum expected accuracy alignment can be obtained by the same dynamic programming algorithm V (i 1, j 1) P( xi ~ y j ) V (i , j ) max V (i 1, j ) V ( i , j 1 ) Example for expected accuracy • • • • True alignment AC_CG ACCCA Expected accuracy=(1+1+0+1+1)/4=1 • • • • Estimated alignment ACC_G ACCCA Expected accuracy=(1+1+0.1+0+1) ~ 0.75 Estimating posterior probabilities • If correct posterior probabilities can be computed then we can compute the correct alignment. Now it remains to estimate these probabilities from the data • PROBCONS (Do et. al., Genome Research 2006): estimate probabilities from pairwise HMMs using forward and backward recursions (as defined in Durbin et. al. 1998) • Probalign (Roshan and Livesay, Bioinformatics 2006): estimate probabilities using partition function dynamic programming matrices Posterior probabilities from HMM • We need to sum the probabilities of all alignments where xi is aligned to yj. In other words we want: Pr(all alignments of x and y such that x i aligned to y j ) = Pr(x i aligned to y j | alignments of x and y) = Pr(alignments of x and y and x i aligned to y j ) Pr(alignments of x and y) Forward and backward probabilities • Define fk(i) as the probability of emitting x1x2…xi given that the ith hidden state is k. • Similarly the backward probability bk(i) as the probability of emitting xi+1xi+2…xn given that the ith hidden state is k. • Both fk(i) and bk(i) can be computed quickly by dynamic programming (see HMM lecture notes pages 9 to 11) • Once forward and backward are computed we can calculate Pr(all alignments of x and y such that x i aligned to y j ) = Pr(x i aligned to y j | alignments of x and y) = Pr(x i yj ) Pr(alignments of x and y and x i aligned to y j ) fM (i, j)bM (i, j) f (| x |,| y |) Pr(alignments of x and y) Partition function posterior probabilities • Standard alignment score: S (a ) T ln( M (i , j ) a ij / f i f j ) ( gap_ penalties) • Probability of alignment (Miyazawa, Prot. Eng. 1995) P(a ) e S (a )/ T • If we knew the alignment partition function then P(a, T ) e S (a)/ T / Z (T ) Partition function posterior probabilities • Alignment partition function (Miyazawa, Prot. Eng. 1995) Z(T) e S(a )/T a A • Subsequently Z M i, j e a Aij S ij ( a ) / T s( xi , y j ) / T S i 1, i 1 ( a ) / T e e a Ai 1 j 1 Partition function posterior probabilities • More generally the forward partition function matrices are calculated as M i, j E i, j F i, j Z Z Z Zi , j (Z Z Z Z e Z e Z e Z e Z Z Z M i 1, j 1 M g/T i , j 1 M g/T i 1, j M E i, j i, j E F i 1. j 1 i 1, j 1 E ext / T i . j 1 F ext / T i 1. j F i, j )e s( xi , y j ) / T Partition function matrices vs. standard affine recursions ZiM, j ZiE, j ZiF, j Zi , j ( ZiM 1, j 1 ZiE 1. j 1 ZiF 1, j 1 )e ZiM, j 1e g / T ZiE. j 1eext / T ZiM 1, j e g / T ZiF 1. j eext / T ZiM, j ZiE, j ZiF, j s( xi , y j ) / T V (i, j) max{E(i, j),F(i, j),M(i, j) M(i, j) V (i 1, j 1) s(x i , y j ) E(i, j) max{E(i, j 1) ext,V (i, j 1) g} F(i, j) max{F(i 1, j) ext,V (i 1, j) g} Posterior probability calculation • If we defined Z’ as the “backward” partition function matrices then P ( xi ~ y j ) Z M i 1, j 1 Z' Z M i 1, j 1 e s ( xi , y j ) / T Posterior probabilities using alignment ensembles • By generating an ensemble A(n,x,y) of n alignments of x and y we can estimate P(xi~yj) by counting the number of times xi is aligned to yj.. Note that this means we are assigning equal weights to all alignments in the ensemble. P(xi ~ y j a* | x, y) P(a | x, y)1{x i ~ y j a} a A Generating ensemble of alignments • We can use stochastic backtracking (Muckstein et. al., Bioinformatics, 2002) to generate a given number of optimal and suboptimal alignments. • At every step in the traceback we assign a probability to each of the three possible positions. • This allows us to “sample” alignments from their partition function probability distribution. • Posteror probabilities turn out to be the same when calculated using forward and backward partition function matrices. Probalign 1. For each pair of sequences (x,y) in the input set – a. Compute partition function matrices Z(T) – b. Estimate posterior probability matrix P(xi ~ yj) for (x,y) by M M P ( xi ~ y j ) 2. Zi 1, j 1 Z 'i 1, j 1 Z e s ( xi , y j ) / T Perform the probabilistic consistency transformation and compute a maximal expected accuracy multiple alignment: align sequence profiles along a guide-tree and follow by iterative refinement (Do et. al.). V (i 1, j 1) P( xi ~ y j ) V (i , j ) max V (i 1, j ) V (i , j 1) Multiple protein alignment • Protein sequence alignment: hard problem for multiple distantly related proteins • Several standard protein alignment benchmarks available: BAliBASE, HOMSTRAD, OXBENCH, PREFAB, and SABMARK • Benchmark alignments are based on manual and computational structural alignment of proteins with known structure. Measure of accuracy • Sum-of-pairs score: number of correctly aligned pairs divided by number of pairs in true alignment. AACAGT AA_ _GT AACAGT AAGT_ _ Blue: correct Red: incorrect Acc: 2/4=50% • Column score: number of correctly aligned columns • Statistical significance using Friedman rank test Experimental design • Methods compared: – – – – Probalign PROBCONS MUSCLE MAFFT • Probalign temperature parameter trained on RV11 subset of BAliBASE 3.0. • Default (optimized) parameters for remaining programs BAliBASE 3.0 Sum-of-pairs and column score accuracies Data Probalign MAFFT RV11 69.3 / 45.3 67.1 / 44.6 RV12 94.6 / 86.2 93.6 / 83.8 RV20 92.6 / 43.9 92.7 / 45.3 RV30 85.2 / 56.4 85.6 / 56.9 RV40 92.2 / 60.3 92.0 / 59.7 RV50 89.3 / 55.2 90.0 / 56.2 All 87.6 / 58.9 87.1 / 58.6 Friedman rank test P-values Method RV11 RV12 MAFFT NS < 0.005 Probcons 0.049 0.0233 MUSCLE < 0.005 < 0.005 RV20 NS NS 0.008 Probcons 67.0 / 41.7 94.1 / 85.5 91.7 / 40.6 84.5 / 54.4 90.3 / 53.2 89.4 / 57.3 86.4 / 55.8 RV30 NS NS < 0.005 MUSCLE 59.3 / 35.9 91.7 / 80.4 89.2 / 35.1 80.3 / 38.3 86.7 / 47.1 85.7 / 48.7 82.5 / 48.5 RV40 < 0.005 < 0.005 < 0.005 RV50 NS NS NS All < 0.005 < 0.005 < 0.005 Heterogeneous length data I BAliBASE datasets with maximum length and minimum devation Max length / Probalign MAFFT Probcons Standard dev. 500 / 100 88.4 / 56.6 88.0 / 58.0 86.7 / 51.6 500 / 200 88.5 / 54.6 87.0 / 51.9 87.2 / 48.9 1000 / 100 91.4 / 58.1 90.4 / 55.7 89.7 / 51.6 1000 / 200 90.7 / 55.0 89.3 / 51.4 89.2 / 48.7 BAliBASE datasets with long extensions Max length / Probalign Standard dev. RV40 1000 / 100 (25) 1000 / 200 (20) 92.7 / 59.3 93.0 / 57.3 MUSCLE 81.5 / 42.5 81.9 / 42.4 84.3 / 44.1 83.2 / 42.5 MAFFT Probcons 91.0 / 54.8 90.8 / 52.1 89.9 / 48.2 90.6 / 47.6 Heterogeneous length data II BAliBASE 2.0 reference 6 datasets with max length and minimum deviation Max length / Probalign MAFFT Probcons Standard dev. 500 / 100 (40) 89.1 / 44.9 87.3 / 49.0 87.4 / 38.6 500 / 200 (21) 88.3 / 43.8 85.0 / 46.4 86.7 / 40.0 500 / 300 (9) 95.3 / 61.0 82.6 / 51.3 87.3 / 46.6 500 / 400 (5) 94.6 / 55.0 72.0 / 38.2 79.8 / 38.0 1000 / 100 (15) 90.2 / 43.3 82.4 / 36.9 85.4 / 27.6 1000 / 200 (12) 89.2 / 38.2 79.7 / 32.4 83.6 / 27.7 1000 / 300 (7) 94.5 / 52.8 78.3 / 42.4 83.9 / 34.6 1000 / 400 (5) 94.6 / 55.0 72.0 / 38.2 79.8 / 38.0