B S Presented by Mei Liu August 7, 2008

advertisement
Basic Local Alignment
Search Tool
Presented by Mei Liu
August 7, 2008
Introduction

BLAST
 Finds regions of local similarity between sequences
 Assesses which DNA or protein sequences in a large
database have significant similarity with a given query
sequence
 Infer functional and evolutionary relationships between
sequences
 Help identify members of gene families
 Two implementations of BLAST: one by NCBI and the
other at Washington University
Introduction

WU-BLAST printouts give the following
values




Score or High Score
Bit scores
Expect values
P-values
Outline

Comparison of two aligned sequences







BLAST random walk
Parameter calculations
Choice of score
Bounds and approximation for BLAST p-value
Normalized and bit scores
Number of high-scoring excursions
Karlin-Altschul sum statistic
Outline





Comparison of two unaligned sequences
Comparison of a query sequence against a
database
Minimum significance lengths
Parametric or non-parametric test?
Gapped BLAST and PSI BLAST
1. Two Aligned Sequences




Given an ungapped global alignment of two
protein sequences, both of length N
Null hypothesis: for each aligned pair of amino
acids, the two amino acids are generated by
independent mechanisms
Null hypothesis probability of the amino acid pair
'
p
p
(j, k) = j k
(10.1)
Alternative hypothesis probability of the amino
acid pair (j, k) = q ( j , k )
(10.2)
1.1 BLAST Random Walk



Number the positions from left to right as 1,
2, …, N
A score S(j, k) is allocated to each aligned
amino acid pair (j, k)
In application of BLAST, the score is found
by BLOSUM or PAM
1.1 BLAST Random Walk

PAM






Developed by Margaret Dayhoff in 1970s
calculated by observing the differences in closely
related proteins
PAM1 matrix estimates what rate of substitution would
be expected if 1% of the amino acids had changed
Derived matrices as high as PAM250
Higher numbers in the PAM matrix naming scheme
denote larger evolutionary distance
Not work very well for aligning evolutionarily
divergent sequences
1.1 BLAST Random Walk

BLOSUM





Henikoff and Henikoff constructed these matrices using
multiple alignments of evolutionarily divergent proteins
Probabilities used in the matrix calculation are
computed by looking at "blocks" of conserved
sequences found in multiple protein alignments
To reduce bias from closely related sequences,
segments in a block with a sequence identity above a
certain threshold were clustered
For the BLOSUM62, this threshold was set at 62%
Larger numbers in the BLOSUM matrix naming
scheme denote higher sequence similarity
1.1 BLAST Random Walk
1.1 BLAST Random Walk

Accumulated score at
position i is calculated
as the sum of scores
for various amino acid
comparisons at
positions 1, 2, … , i
Sequence 1: T Q L A A W C R M T C F E I E C K V
Sequence 2: R H L D S W R R A S D D A R I E E G
S(j, k): -1, 1, 5, -2, 1, 15, -4, 7, -1, 2, -4, etc.
Accumulated Score: -1, 0, 5, 3, 4, 19, 15, 22, etc.
1.1 BLAST Random Walk






Let Y1, Y2, … be the respective maximum heights of the
walk relative to the height of any ladder point after leaving
this ladder point and before arriving at the next
Define Ymax as the maximum of these maxima
Ymax is the test statistic used in BLAST, so it is necessary
to find its null hypothesis distribution
Random variables Yi exhibit geometric-like distribution
P(Y  y)  Ce y
C and  depends on the substitution matrix used and amino
acid frequencies { p j } and { pk' }
Probability distribution of Ymax, apart from C and , also
depends on the mean number of ladder points in the walk
1.2 Parameter Calculations
C


 j *

Q1   R j e

j

1


C
d
* 

k * 
1  e   kQk e 
 k 1






(7.61)
Step size is identified with a score S(j,k)
Null hypothesis probability of taking a step of any size is
found from the two sets of frequencies {p j} and {pk' }
When null hypothesis is true,  can be calculated
' S ( j , k )
p
p
1
 j ke
j ,k
(10.3)
1.2 Parameter Calculations


Ymax depends on C, , and mean number of ladder points
in BLAST walk
Mean number of ladder points in turn depends on the
distance A between ladder points
c
A
 jR
j
j 1
d
  jPj
j c


Calculation of A depends on the calculation of R-j
Two alternative approaches in calculation
(7.41)
1.2 Parameter Calculations

Decomposition of paths




Ex. A walk with 2 possible steps: +1, -2 with
respective probabilities p, q=1-p
Any ladder point reached in the walk is at a
distance 1 or 2 below the previous one
Respective probabilities of the two cases are R-1
and R-2 = 1 – R-1
Probability that -2 is a ladder point is:


Probability that it goes to -2 immediately, and
Probability that it first goes to +1  reaches 0  -2
1.2 Parameter Calculations
R2  q  p(1  R2 ) R2
Directly -2
+1
 q  4 pq  q 2
R2 
2p

(10.5)

A

j 1
d

0
-2
R1  1  R2
Then value of A follows from Eq. (7.41)
C

(10.4)
jR j
j C
(7.41)
jp j
Since two sequences compared are each of length N, and
mean distance between ladder points is A
The mean number of ladder points is N/A
1.3 Choice of a Score


BLAST score is a log likelihood ratio
Why?

Similar to sequence analysis

If random variable Y has a discrete probability distribution, this
“score” statistic is defined as the log likelihood ratio
P ( y; 1 )
S1, 0 ( y )  log
P ( y;  0 )

If amino acid pair (j,k) is observed at any position, and if pjpk'
and q(j,k) are null and alternative hypothesis probabilities
S ( j, k )  log
q( j , k )
p j pk'
(10.6)
1.3 Choice of a Score

Second argument leads to the choice of a specific
proportionality constant

Suppose some arbitrary substitution matrix is chosen
with (j,k) element S(j,k), let q(j,k) be defined implicitly
by
q( j , k )
(10.7)
S ( j, k )  1 log
'
p j pk

where  is defined in equation (10.3)
' S ( j , k )
p
p
1
 j ke
(10.3)
j ,k

Thus q(j,k) can be defined explicitly by
q( j, k )  p j pk' eS ( j ,k )
(10.8)

j ,k
q( j, k )  1
1.3 Choice of a Score

Karlin and Altschul (1990) and Karlin (1994)
showed that


When null hypothesis is true, the frequency with which
the observation (j,k) arises in high-scoring excursions is
asymptotically equal to q(j,k)
Then argued that a score scheme is “optimal” if the
frequency of the observation (j,k) in high-scoring
excursions is asymptotically equal to the “target”
frequency q(j,k), the frequency arising if the alternative
hypothesis is true

i.e. frequency in the most biologically relevant alignments of
conserved regions
1.3 Choice of a Score

Argument for the use of S(j,k) as the score
statistic lead to following procedures:
q( j , k )
S ( j, k )   log
p j pk'
1


(10.7)
Various possibilities for q(j,k)
One frequently adopted choice is derived from
the evolutionary arguments that lead to PAMn
matrix construction in 6.5.3
(n)
m
jk
( n)
S
(
j
,
k
)

log
q( j, k )  p j m jk
'
pk
(10.9)
(10.10)
1.3 Choice of a Score

Choice of S(j,k) can as be related to relative entropy
q( j , k )
S ( j, k )   log
p j pk'
1


(10.7)
Score defined is proportional to the support given by the
observation (j,k) in favor of the alternative hypothesis over
the null hypothesis
Eq. 1.124 shows that when the alternative hypothesis is
true, the mean support for the alternative over the null
hypothesis is
H   q( j, k ) log
j ,k
(10.11)
q( j , k )
p j pk'
H   q ( j , k )S ( j , k )  E ( S ( j , k ))
j ,k
(10.12)
1.3 Choice of a Score
q( j , k )
S ( j, k )   log
p j pk'
1

(10.7)
Mean score in high-scoring segments is
asymptotically

j ,k
q( j, k ) S ( j, k )
H   q ( j , k )S ( j , k )
(10.12)
j ,k
1 H
(10.13)
1.3 Choice of a Score


Simulations show that the convergence to this
asymptotic value is very slow
Direct computation of H is not possible
H   q ( j , k )S ( j , k )
(10.12)
j ,k


 and S(j,k) are known, but q(j,k) is unknown
BLAST uses indirect approach to calculate H
where q(j,k) is first calculated by
' S ( j , k )
k
q( j , k )  p j p e
(10.8)
1.4 Bounds and Approximation for
BLAST P-value

Test statistic used in BLAST is the maximum Ymax
of




n ≈ N/A random variables
Each being a random upwards excursion height
following a ladder point in the BLAST random walk
In section 7.6.4, it was shown that each upward
excursion has the geometric-like distribution
Obtain asymptotic bounds for the null hypothesis
distribution of Ymax and hence asymptotic bounds
for a BLAST P-value
1.4 Bounds and Approximation for
BLAST P-value




There exists an asymptotic distribution for the maximum
of n iid continuous random variables whose density
function has support of the form (A, +∞)
However, Ymax is a discrete random variable
Use the continuous distribution results to find asymptotic
bounds for the distribution of Ymax
If Xmax is the max of n iid continuous r.v. and if Ymax =
floor(Xmax), then Ymax is a discrete r.v.
X max  1  Ymax  X max

Thus, for any positive integer y
P( X max  y)  P(Ymax  y)  P( X max  y  1)
(10.14)
1.4 Bounds and Approximation for
BLAST P-value

Let Xmax be the max of n iid r.v. each having
exponential distribution and Ymax = Floor(Xmax)


Ymax has the same distribution as the max of n iid r.v.
each having geometric distribution
Applying Eq. (2.130) and bounds in (10.14), we
have a close approximation
e
 ne y
 P(Ymax  y )  e
1 e
 ney
1 e
 nCe y
 ne  ( y 1)
 P(Ymax  y )  1  e
(10.15)
 ne ( y 1)
 P(Ymax  y )  1  e
 nCe  ( y 1)
(10.16)
(10.17)
1.4 Bounds and Approximation for
BLAST P-value

If we replace n by N/A for the mean number of BLAST
ladder points and define a new parameter K by
K

(10.18)
The inequality (10.17) becomes
1 e

C 
e
A
 NKe   ( y1)
 P(Ymax  y)  1  e
 NKe   ( y2 )
(10.19)
If replace y by x+-1logN, we have
e
 Ke   ( x1)
 P(Ymax   log N  x)  e
1
1 e
 Ke x
1 e
 KNe ymax
 Ke  x
 P(Ymax   log N  x)  1  e
1
 P(Ymax  ymax )  1  e
 Ke  ( x1)
 KNe  ( ymax 1)
(10.20)
(10.21)
(10.22)
1.4 Bounds and Approximation for
BLAST P-value

These bounds for BLAST P-value are not directly
relevant in practice because




BLAST search involves comparison of short query
sequence with a large DB with many fragments
No a priori alignment
Nevertheless, P-value approximation derives
ultimately from the lower P-value bound in Eq.
(10.22)
More appropriate to use conservative
(overestimate the true P-value) upper bound in
(10.22) rather than lower bound
1.5 Normalized and Bit Scores




Karlin and Altschul (1993) call the following
expression a “normalized score”
S '  Ymax  log( NK )
(10.25)
In terms of this score, the inequalities (10.20) can
be written as e e s
e s
(10.26)
e
 P(S '  s)  e
e s
(10.27)
From the upper inequality P(S '  s)  1  e
P-value corresponding to an observed value s' is
P  value  1  e
e s '
(10.28)
1.5 Normalized and Bit Scores

BLAST record a score similar to the
normalized score S', namely the “bit” score
defined by
bit score 
Ymax  log K
log 2
1.6 Number of High-Scoring Excursions




Quantity E' = quantity “Expect” in BLAST
Under null hypothesis, for each excursion, the
maximum height Y has a geometric-like
distribution
P(Yi  v)  ce  v
# of excursions = N/A
In BLAST, mean number of excursions reaching a
height v or more is approximately
K
C 
e
A
(10.18)
NKe  v
(10.34)
1.6 Number of High-Scoring Excursions

Expected value of the number of excursions
corresponding to the observed maximal
score ymax
E '  NKe ymax
(10.35)
S '   log E '
(10.36)
P  value  1  e  E '
(10.37)
E '   log( 1  P  value)
1.7 Karlin-Altschul Sum Statistic


Focusing on Ymax loses information
provided by heights of the 2nd, 3rd, etc.
excursions in the random walk
Consider r largest Yi values
Y1 ( Ymax )  Y2    Yr

Compute r normalized scores where
Si'  Yi  log( NK )
(10.38)
1.7 Karlin-Altschul Sum Statistic

Karlin and Altschul (1993) showed that to a close
approximation, the null hypothesis joint density function is
  sr r 
f S ( s1 ,, sr )  exp   e   sk 
k 1





(10.39)
Any reasonable function of S1' , S2' , Sr' can be the test
statistic
Use transformation methods introduced in Chap. 2 to find
the distribution of this test statistic
In turn allows computations of P-value and E or Expect
value corresponding to any observed value of this statistic
1.7 Karlin-Altschul Sum Statistic

Statistic suggested is the sum of the normalized
scores, called the Karlin-Altschul sum statistic
Tr  S1'    S r'

Null hypothesis density function f(t) of Tr

e t
( r 2)
( y t ) / r
fTr (t ) 
y
exp(

e
)dy

0
r!(r  2)!

(10.40)
When t is sufficiently large, this density function
can be used to find the approximate expression
e t t r 1
P(Tr  t ) 
r!(r  1)!
(10.41)
1.7 Karlin-Altschul Sum Statistic

The approximation (10.41) is sufficiently accurate when t
> r(r+1), and BLAST uses it when the inequality holds
e t t r 1
P(Tr  t ) 
r!(r  1)!



(10.41)
If t is the observed value of Tr, the right hand side in
(10.41) provides the approximate P-value corresponding to
this observed value
This is used as a component of the eventual BLAST
printout P-value
Ex. s1 = 4.4 and s2 = 2.5


r = 1, P-value for the highest normalized score 4.4 = e-4.4 = 0.012
r = 2, P-value for the sum 6.9 = 6.9/2 * e-6.9 = 0.0035
2. Two Unaligned Sequences


Given two sequences of lengths N1 and N2,
but no specific alignment is given
Need to find the significance of highscoring segment pairs between all possible
(ungapped) local alignments
2.1 Theoretical and Empirical
Background


BLAST considers all ungapped alignments determined by
all possible relative positions of two sequences
For each relative position, alignment is extended as far as
possible in either direction, giving a total of N1+N2-1
ungapped alignments
2.1 Theoretical and Empirical
Background



Each alignment yields a random walk
Total N1N2 comparisons between two
sequences taking all possible positions
relative to each other
Many conclusions from previous section
can be carried over to the present case


with N replaced by N1N2
or a more refined function allowing for edge
effects
2.1 Theoretical and Empirical
Background

Ymax is the maximum score achieved in the random walk comparing
sequences, using all possible ungapped local alignments

Mean number of ladder points: N1 N 2 A
(10.42)
Assume null hypothesis is true, inequalities in (10.21) is replaced by

1 e

 Ke  x
 P(Ymax   log( N1 N 2 )  x)  1  e
1
(10.44)
Expected number E' of excursions reaching a height ymax or more is
E '  N1 N 2 Ke ymax

(10.43)
Normalized score S' is redefined as
S '  Ymax  log( N1 N 2 K )

 Ke   ( x1)
(10.45)
Null hypothesis mean of Ymax is
1 (log( N1 N 2 K )   )
(10.46)
2.2 Edge Effects





A high-scoring random walk excursion might be
cut short at the end of a sequence match
So the height of high-scoring excursions and the
number of such excursions will be less than
predicted by theory
Edge effects is an important factor in the
comparison of two comparatively short sequences
BLAST theory concerns two long sequences
In practice, BLAST considers databases of large
number of short sequences
2.2 Edge Effects



BLAST calculations allow for edge effects by subtracting
from both N1 and N2 a factor depending on the mean
length of any high-scoring excursion
Eq. (10.13) showed that the mean value of the step in highscoring excursion asymptotically approaches 1 H
Given the height achieved by a high-scoring excursion is
denoted by y, the mean length E(L|y) of this excursion,
conditional on y, is
y
E ( L | y) 

H
BLAST theory replaces N1 and N2 by
N1'  N1  E ( L) N 2'  N 2  E ( L)
(10.47)
2.2 Edge Effects

Specifically, the normalized score is replaced by
Ymax  log( N1' N 2' K )
Ymax
'
'
N1  N1 

H
(10.48)
N2  N2 
H
(10.49)
Expected number of excursions scoring v or higher is
replaced by
N1' N 2' Ke v
N1'  N1 

Ymax
v
H
(10.50)
N 2'  N 2 
v
H
E' is given by
E '  N1' N 2' Ke ymax
(10.51)
2.2 Edge Effects


The use of edge correction in (10.49) assumes that
asymptotic formula for the mean step size in a high-scoring
excursion is appropriate
Values calculated from Eq. (10.47) is inaccurate for
anything other than very large values of N
E ( L | y) 

y
(10.47)
H
Use of edge correction in (10.49) might in practice lead to
P-value estimates less than the correct values for anything
other than very large N
2.2 Edge Effects

In BLAST, edge effect correction factor for the
Karlin-Altschul sum statistic Tr is calculated as
follows


Raw edge effect correction is calculated as
 (Y1  Y2   Yr ) / H
Edge correction value E(L) is defined by
E ( L) 



 r 1 
(Y1  Y2    Yr )1 
f   r 1
H
r


(10.52)
f is an “overlap adjustment factor” that can be chosen
by the user
Default f = 0.125 implies that overlaps between
segments of up to 12.5% are allowed
2.3 Multiple Testing







No obvious choice for the value of r
BLAST considers all r = 1, 2, 3, … and choose the set of
HSPs with lowest sum statistic P-value as the most
significant
However, it implies that a sequence of tests, one for each r
So issue of multiple testing arises
Ignoring multiple testing issue can lead to a significant
overestimate of BLAST P-values
Unfortunately, no rigorous theory available to deal with
this issue
In practice, it is handled in an ad hoc manner
2.3 Multiple Testing

Ex. WU-BLAST



P-value is adjusted by dividing by a factor
(1   ) r 1
When r = 1, the factor became 1- π, which implies that
E' is divided by 1- π
BLAST default value 0.5 of π implies that E=2E', so
that
 ymax
'
'
(10.56)
E  2 N1 N 2 Ke

P-value is then found as
P  value  1  e  E
(10.57)
3. Query Sequence vs. Database



Compare query sequence to each database sequence to
obtain P-values for individual comparisons
For r = 1, probability that in a match with score v or more
is
(10.58)
1  eE
Expect, the mean number of HSPs scoring v or more in the
entire database is given by
(1  e  E ) D
Expect 
N2


(10.59)
D = total length of DB (sum of lengths of all database sequences)
N2 = length of the database sequence
P  value  1  e  Expect
(10.60)
3. Query Sequence vs. Database

For r > 1, from each P-value, a total
database value of Expect is calculated by
Expect 
( P  value) D
N2
P  value  1  e  Expect

(10.61)
(10.60)
Finally, all single (r = 1) HSPs or summed (r
> 1) HSPs with sufficiently low values of
Expect are listed
4. Minimum Significance Lengths

Correct Choice of n





When sequences are distantly related, similarities
between them might be subtle
Cannot detect significant similarity unless a long
alignment is available
On the other hand, if sequences are very similar, then a
relatively short alignment is sufficient
If the similarity is subtle, each aligned pair will tell us
less than an aligned pair in more similar sequences (in
terms of information)
This lead to the concept of information content per
position in an alignment
4. Minimum Significance Lengths





Using a PAMn matrix is to test:
Alternative hypothesis: n is the correct value to
use in the evolutionary process leading to the two
protein sequences
Null hypothesis: appropriate value of n is +∞
Here, assume that the alternative hypothesis is
correct (i.e. correct value of n is chosen)
Explore aspects of power of the testing procedure
by finding the mean length of protein sequence
needed before the alternative hypothesis is
accepted
4. Minimum Significance Lengths



Suppose that, we decide to adopt a testing procedure with
Type I error α (FP)
The value s of the normalized score statistic S' is given by
s = -logα
Corresponding value ymax of Ymax is
 NK 
ymax  1 log 

  

(10.64)
When alternative hypothesis is true, mean score for the
amino acids comparison at any position is
 q( j, k )S ( j, k )  1  q( j, k ) log
j ,k
j ,k
q( j , k )
p j pk
(10.65)
4. Minimum Significance Lengths

In Chapter 7, it showed that if




Mean final position in a random walk is F
Mean step size is G
Then mean number of steps needed to reach the final
position is F/G
Mean sequence length needed in the maximally
scoring local alignment in order to obtain
significance with Type I error α is
 NK 
log 




q( j, k )
q
(
j
,
k
)
log
 j ,k
p j pk
(10.66)
4. Minimum Significance Lengths

Since various components can be interpreted in terms of
bits of information, thus write the ratio (10.66) as
 NK 
log 2 

  
q( j, k )
 j ,k q( j, k ) log 2 p p
j k


(10.67)
Denominator = mean of the relative support, in terms of
bits, provided by one observation for the alternative
hypothesis against the null hypothesis, given that the
alternative hypothesis is true
Numerator = mean total number of bits of information
needed to claim that two sequences are similar
4. Minimum Significance Lengths

It is known that typically




K = 0.1, α = 0.05 or 0.01
Thus numerator is largely determined by length
N, which is approximately log2N
Ex. N = 1000, need 9.97 bits of information to
claim significant similarity between two
sequences
Main interest is the minimum significant
length
4. Minimum Significance Lengths

If n is large, q(j,k) is close to pjpk




Mean information per aligned pair given in the
denominator is small
Minimum significant length is large
If null and alternative hypotheses specify quite similar
probabilities for any aligned pair, many observations
will in general be needed to decide between two
hypotheses
If n is small


Mean relative support for the alternative hypothesis is
large
Minimum significant length is small
4. Minimum Significance Lengths

Limiting (n  0) values
 q(j,j) = pj
 q(j,k) = 0 for j ≠ k
 Denominator, mean support from each position in favor
of the alternative hypothesis, approaches   p j log 2 p j
j




If all amino acids are equally frequent, this mean support is
log220 = 4.32
In practice, actual frequencies of observed amino acids
imply that a more appropriate value is about 4.17
Thus, minimum significant length is (log2N)/4.17
If N = 1000, this is about 2.39
4. Minimum Significance Lengths

When N = 1000 and n = 250



Corresponds to a PAM250 substitution matrix
Probabilities q(j,k) are such that each amino
acid pair provides a mean of only 0.36 bits of
information
Minimum significance length is log(1000)/0.36
= 28 is required on average to accept the
alternative hypothesis
4. Minimum Significance Lengths

Incorrect Choice of n




Above calculations all assume the correct value of n is
chosen, thus correct alternative hypothesis probabilities
q(j,k) is used
In practice, it is impossible to choose a unique correct
value for n when using a PAM matrix
Suppose there is a unique correct value m leading to a
PAMm matrix, but an incorrect value n was chosen and
PAMn matrix is used instead
What does this imply?
4. Minimum Significance Lengths


Suppose that with the correct choice m, the probability of
the ordered pair (j,k) is r(j,k)
The mean score is then 1 r ( j, k ) log q( j, k )
(10.68)

j ,k




p j pk
r(j,k) = q(j,k) when n = m, mean score is positive
More generally, mean score is positive when n and m are
close
But, as m  +∞, r(j,k)  pjpk, mean score is negative
Thus for any choice of n there will be values of m
sufficiently large compared to n so that the mean score is
negative
4. Minimum Significance Lengths

When mean score is positive, minimal significance
length is
 NK 
log 

  
q( j, k )
r
(
j
,
k
)
log
 j ,k
p j pk


(10.69)
Minimal length depends on q(j,k), that is on the
choice of n
Choice of n involves substantial extrinsic
guesswork, thus it is important to assess the
implications of an incorrect choice
4. Minimum Significance Lengths

Negative means arise when m is sufficiently large
compared to n, that is




When two species being compared diverged a long time
in the past relative to the time assumed by the PAM
matrix used in analysis
The more negative this mean is, the more likely
that the null hypothesis will be accepted
In the limit m  +∞, when r(j,k) = pjpk, the
probability of rejecting the null hypothesis is equal
to the chosen Type I error
Ex. If n = 100 is chosen, the mean score is
negative when m is 193 or more
4. Minimum Significance Lengths

In conclusion,




Correctly chosen small value of n leads to shorter
minimal significance lengths
Incorrect small choice may lead to the possibility that a
real similarity between the two sequences will not be
picked up
In practice, to overcome this problem, sometimes uses a
variety of substitution matrices
However, it must be viewed with some caution,
especially in the light of multiple testing problem
5. Parametric or Non-parametric




Parametric test: test statistic is found from
likelihood ratio arguments
Non-parametric test: test statistic is found on
reasonable but nevertheless arbitrary grounds
Many of calculations and arguments used in
preceding sections derive from the derivation of
the score S(j,k) in a substitution matrix from
likelihood ratio arguments
In this sense, BLAST testing theory can be
thought of as a parametric procedure deriving
from the likelihood ratio theory
5. Parametric or Non-parametric


Assumptions made in the theory are, however, subject to
debate
Time homogeneity assumption implicit in calculations
cannot be sustained



Genetic code influenced substitutions earlier in time and various
chemical properties influenced substitutions more recently
Thus, comparisons of distantly related species can be problematic
Further, if data in a large database come from a collection
of species whose respective evolutionary divergence times
might differ widely, the concept of a uniformly correct
choice of n is not meaningful
5. Parametric or Non-parametric



Even if these claims are true, the statistical aspects
of the BLAST procedure are still valid
P-value calculations are still correct, so even if
these scores were chosen in any more or less
reasonable way, no problems arise with the
correctness of the calculations
In this sense, BLAST testing process can be
thought of as a non-parametric procedure
6.1 Gapped BLAST


Allows gaps in sequence alignments
In comparison of two sequences, there will be some
maximum scores
( gapped )
Ymax



Maximum score over all possible gapped alignments
Null hypothesis probability distribution is determined by
the substitution matrix used and gap penalty chosen
The distribution can be estimated through simulation



Randomly generate two sequences of lengths N1 and N2
From these sequences, find the observed maximum score denoted
by y1
Procedure is repeated n times yielding n observed highest scores
y1, y2, …, yn
6.1 Gapped BLAST

Approximation was made that the distribution of
Ymax in gapped case is of the same form in the
ungapped case with revised values of K and 

1
K  ( N1 N 2 ) e


_ 
y

   /( s 6 )
(10.72)
Approach described above depends on simulation
results
If a penalty of δ is assigned to each gap in the
alignment of two sequences, then (10.45) is
replaced by
T 

 y
E '  N1 N 2 Ke
max
1   *

 e 1 
(10.73)
6.2 PSI BLAST




PSI (Position Specific Iterated) BLAST
In regular BLAST, a fixed substitution matrix is used to
score positions in alignments
It relies on one matrix to provide the most meaningful
scores for all positions in the query sequence
simultaneously
PSI-BLAST



Uses a standard substitution matrix in the first step
Sequences found are then used to derived a separate scoring
scheme for each position in the query sequence and used for the
second BLAST search
The procedure is iterated until no further iteration seems useful
6.2 PSI BLAST



Query sequence is first compared to database sequences
All database sequence segments having a sufficiently close
similarity with the query (ex. Expect < 0.01) are reported
From this collection of sites, a frequency fi of amino acid i
is calculated, and used to estimate frequency Qi of amino
acid i at this site
f i   g i
g i   f j q(i, j ) / p j
Qi 
j


 
(10.75)
In PSI-BLAST, Σigi = 1 no longer holds
Shaffer et al. (2001) described a new implementation
where pi is the background frequency of amino acid i and
p(i,j) is the frequency with amino acids i and j aligned
through evolutionary descent
Qi 
f i    j f j p(i, j ) / p j
 
(10.76)
Any questions?
Download