Expected accuracy alignment

advertisement
Expected accuracy sequence
alignment
Usman Roshan
Optimal pairwise alignment
• Sum of pairs (SP) optimization: find the alignment of two
sequences that maximizes the similarity score given an arbitrary
cost matrix. We can find the optimal alignment in O(mn) time
and space using the Needleman-Wunsch algorithm.
• Recursion:
Traceback:
ì M(i -1, j -1) + s(x i , y j )
ï
M(i, j) = í
M(i, j -1) + g
ï
M(i -1, j) + g
î
where M(i,j) is the score of the optimal
alignment of x1..i and y1..j, s(xi,yj)
is a substitution scoring matrix, and
g is the gap penalty
Affine gap penalties
• Affine gap model allows for long insertions in distant
proteins by charging a lower penalty for extension
gaps. We define g as the gap open penalty (first gap)
and e as the gap extension penalty (additional gaps)
• Alignment:
– ACACCCT
– AC-CT-T
– Score = 0
ACACCCC
AC--CTT
Score = 0.9
• Trivial cost matrix: match=+1, mismatch=0,
gapopen=-2, gapextension=-0.1
Affine penalty recursion
M(i,j) denotes alignments of x1..i and y1..j ending with
a match/mismatch. E(i,j) denotes alignments of x1..i
and y1..j such that yj is paired with a gap. F(i,j) defined
similarly. Recursion takes O(mn) time where m and n
are lengths of x and y respectively.
V (i, j) = max{E(i, j),F(i, j), M(i, j)
M(i, j) = V (i -1, j -1) + s(x i , y j )
E(i, j) = max{E(i, j -1) + ext,V (i, j -1) + g}
F(i, j) = max{F(i -1, j) + ext,V (i -1, j) + g}
Expected accuracy alignment
• The dynamic programming formulation
allows us to find the optimal alignment
defined by a scoring matrix and gap
penalties. This may not necessarily be the
most “accurate” or biologically informative.
• We now look at a different formulation of
alignment that allows us to compute the
most accurate one instead of the optimal
one.
Posterior probability of
xi aligned to yj
• Let A be the set of all alignments of
sequences x and y, and define P(a|x,y) to be
the probability that alignment a (of x and y) is
the true alignment a*.
• We define the posterior probability of the ith
residue of x (xi) aligning to the jth residue of y
(yj) in the true alignment (a*) of x and y as
P(x i ~ y j Î a* | x, y) =
Do et. al., Genome Research, 2005
å P(a | x, y)1{x
a ÎA
i
~ y j Î a}
Expected accuracy of
alignment
•
We can define the expected accuracy of an alignment a as
Do et. al., Genome Research, 2005
•
The maximum expected accuracy alignment can be obtained by the
same dynamic programming algorithm
ìV (i - 1, j - 1) + P( xi ~ y j ) ü
ï
ï
V (i , j ) = maxí
V (i - 1, j )
ý
ï
ï
V
(
i
,
j
1
)
î
þ
Example for expected
accuracy
•
•
•
•
True alignment
AC_CG
ACCCA
Expected accuracy=(1+1+0+1+1)/4=1
•
•
•
•
Estimated alignment
ACC_G
ACCCA
Expected accuracy=(1+1+0.1+0+1)/4 ~ 0.75
Estimating posterior probabilities
• If correct posterior probabilities can be
computed then we can compute the correct
alignment. Now it remains to estimate these
probabilities from the data
• PROBCONS (Do et. al., Genome Research
2006): estimate probabilities from pairwise
HMMs using forward and backward
recursions (as defined in Durbin et. al. 1998)
• Probalign (Roshan and Livesay,
Bioinformatics 2006): estimate probabilities
using partition function dynamic programming
matrices
HMM posterior probabilities
• Consider the probability of all alignments of
sequences X and Y under a given HMM.
• Let M(i,j) be the sum of the probabilities of all
alignments of X1...i and Y1…j that end in match or
mismatch.
• Then M(i,j) is given by
 (1  2 ) M (i  1, j  1)  


M (i, j )  pm (or pmm )  (1   ) X (i  1, j  1)  
 (1   )Y (i  1, j  1) 


• We calculate X(i,j) and Y(i,j) in the same way.
• We call these forward probabilities:
– f(i,j) = M(i,j)+X(i,j)+Y(i,j)
HMM posterior probabilities
• Similarly we can calculate backward probabilties
M’(i,j).
• Define M’(i,j) as the sum of probabilities of all
alignments of Xi..m and Yj..n such that Xi and and Yj
are aligned to each other.
• The indices i and j start from m and n respectively
and decrease
 (1  2 ) M '(i  1, j  1)  


M '(i, j )  pm (or pmm )  (1   ) X '(i  1, j  1)  
 (1   )Y '(i  1, j  1) 


• These are also called backward probabilities.
– B(i,j)=M’(i,j)+X’(i,j)+Y’(i,j)
HMM posterior probabilities
• The posterior probability of xi aligned to
yj is given by
P( xi
y j )  f (i, j)b(i, j) / P( x, y)
Partition function posterior
probabilities
• Standard alignment score:
S (a ) = T
å ln( M
( i , j ) Îa
ij
/ f i f j ) + ( gap _ penalties)
• Probability of alignment (Miyazawa, Prot. Eng. 1995)
S
(
a
)/
T
P(a ) µ e
• If we knew the alignment partition function then
P(a , T ) = e S (a )/ T / Z (T )
Partition function posterior
probabilities
• Alignment partition function (Miyazawa,
Prot. Eng. 1995)
Z(T) =
S(a )/T
e
å
a ÎA
• Subsequently
Z =
M
i, j
åe
a Î Aij
S ij ( a ) / T
æ
ö s( xi , y j ) / T
S i - 1, i - 1 ( a ) / T
÷e
= ç åe
ç
÷
è a ÎAi - 1 j - 1
ø
Partition function posterior
probabilities
• More generally the forward partition
function matrices are calculated as
M
i, j
E
i, j
F
i, j
Z
Z
Z
Zi , j
= (Z
+Z
+Z
=Z e +Z e
=Z e +Z e
=Z +Z +Z
M
i - 1, j - 1
M
g/T
i , j -1
M
g/T
i - 1, j
M
E
i, j
i, j
E
F
i - 1. j - 1
i - 1, j - 1
E
ext / T
i. j -1
F
ext / T
i - 1. j
F
i, j
)e
s( xi , y j ) / T
Partition function matrices vs.
standard affine recursions
ZiM, j
ZiE, j
ZiF, j
Zi , j
= ( ZiM- 1, j - 1 + ZiE- 1. j - 1 + ZiF- 1, j - 1 ) e
= Zi M, j - 1e g / T + ZiE. j - 1e ext / T
= Zi M- 1, j e g / T + ZiF- 1. j e ext / T
= Zi M, j + ZiE, j + ZiF, j
s( xi , y j ) / T
V (i, j) = max{E(i, j),F(i, j), M(i, j)
M(i, j) = V (i -1, j -1) + s(x i , y j )
E(i, j) = max{E(i, j -1) + ext,V (i, j -1) + g}
F(i, j) = max{F(i -1, j) + ext,V (i -1, j) + g}
Posterior probability
calculation
• If we defined Z’ as the “backward”
partition function matrices then
P ( xi ~ y j ) =
Z
M
i - 1, j - 1
Z'
Z
M
i + 1, j + 1
e
s ( xi , y j ) / T
Posterior probabilities using
alignment ensembles
• By generating an ensemble A(n,x,y) of n
alignments of x and y we can estimate
P(xi~yj) by counting the number of times
xi is aligned to yj.. Note that this means
we are assigning equal weights to all
alignments in the ensemble.
P(x i ~ y j Î a* | x, y) =
å P(a | x, y)1{x
a ÎA
i
~ y j Î a}
Generating ensemble of
alignments
• We can use stochastic backtracking (Muckstein et.
al., Bioinformatics, 2002) to generate a given
number of optimal and suboptimal alignments.
• At every step in the traceback we assign a
probability to each of the three possible positions.
• This allows us to “sample” alignments from their
partition function probability distribution.
• Posteror probabilities turn out to be the same when
calculated using forward and backward partition
function matrices.
Probalign
1.
For each pair of sequences (x,y) in the input set
– a. Compute partition function matrices Z(T)
– b. Estimate posterior probability matrix P(xi ~ yj) for (x,y)
by
M
M
P ( xi ~ y j ) =
2.
Zi -1, j -1Z 'i +1, j +1
Z
e
s ( xi , y j ) / T
Perform the probabilistic consistency transformation and
compute a maximal expected accuracy multiple alignment:
align sequence profiles along a guide-tree and follow by
iterative refinement (Do et. al.).
ìV (i - 1, j - 1) + P( xi ~ y j ) ü
ï
ï
V (i , j ) = maxí
V (i - 1, j )
ý
ï
ï
V (i , j - 1)
î
þ
Experimental results
• http://bioinformatics.oxfordjournals.org/c
ontent/26/16/1958
Download