PROJECT DESCRIPTION

advertisement
02/13/16
Leung: Page 1 of 16
Chapter 6
Sequence Alignment and Database Search
Many biological problems such as the construction of phylogenetic trees or deducing
putative gene functions can be approached by a sequence alignment. An alignment refers
to a display of a collection of two or more sequences with one sequence written above
another showing the similarities among the different members of the collection. All
sequences in an alignment must be of the same type. They can be all DNA, all RNA, all
amino acids, or all in those derived alphabets as introduced in Chapter 3.
When two sequences in different living organisms show a fundamental functional
similarity because of their having descended from a common ancestor, they are said to be
homologous to each other. Although the rule is not absolute, it is true in many cases that
similar nucleotide sequences or protein sequences are homologous to each other.
Conversely, homologous genes and proteins often have similar sequences because they
descend from a common ancestor. That is why a good sequence alignment can lend
insight to important biological information about phylogeny, and the function of a gene or
a protein.
It is important, however, to emphasize here that sequence similarity and sequence
homology are different concepts and must not be taken to be equivalent. Two sequences
can either be homologous, or non-homologous. We cannot talk about a degree of
homology. On the other hand, we can say that one pair of sequences are 95% similar
while another pair is only 30% similar. Even though sequence homology and similarity
are often observed together, one should never, without careful verification, take it for
granted that one would imply the other.
Finding a good alignment between even a pair of relatively short sequences (say, length
50 each) is a formidable task for the human eyes. However, with the computing power
available to us at present, good sequence alignments between a pair of sequences can be
obtained so quickly that one can align a query sequence against every sequence in a huge
database holding millions of sequences within a very reasonable amount of time (say, a
few minutes). Such a process of database search is getting very popular among geneticists
and molecular biologists as they can derive useful information for a newly sequenced
stretch of DNA from other similar sequences that have been previously studied.
We shall devote the first section of this chapter to familiarize the reader with how
sequence similarity is assessed. Section 2 explains the essence of dynamic algorithm used
in many popular sequence alignment programs. Section 3 turns the attention to database
search programs which is perhaps the most used application of sequence alignment.
Section 4 will describe the statistics involved in evaluating the significance of the amount
of similarity between sequences described by an alignment. Finally, we discuss some
multiple alignment techniques in Section 5.
02/13/16
Leung: Page 2 of 16
6.1 Sequence Similarity
Consider the pair of DNA fragments AGTAGTCAAGA and AGAAGCTCAAGA of
length 10 and 11 nucleotide bases respectively. One cannot help noticing that these
sequences kind of "look alike", and hence one would describe them as "similar" to each
other. The similarity between the fragments are much more obvious to our eyes if we
display them as follows:
A G T A G _ T C A A G A
(Alignment A)
| | : | |
| | | | | |
A G A A G C T C A A G A
A display of this kind is called an alignment. In an alignment, one sequence is stacked on
top of the other. Since the two sequences may have different lengths, gaps are inserted at
various places as necessary. Sandwiched between the two sequences is a line of symbols
indicating whether the letters on the two sequences at corresponding positions are
matches (|) or mismatches (:).
The above display is only one of the numerous possible alignment of the given pair of
DNA sequences. For example, these two DNA fragments can also be displayed as
(Alignment B)
A G T A G _ _ _ _ T C A A G A
| |
| | | | | |
_ _ _ A G A A G C T C A A G A
Indeed, if we allow ourselves to slide the first sequence on top of the second and
introduce gaps at any arbitrarily places as necessary, we can generate an enormous
number of different alignments. However, some of the alignments can better reveal the
similarity between the pair of sequences than others. For our example, alignment A
obviously reveals the similarity between the sequence pair better than alignment B. The
goal of sequence alignment is to find the best alignment that reveals the highest amount
of similarities between the two sequences. Sometimes there are actually more than one
such best alignments. This brings up the question of how do we measure the similarity
between two sequences when we are given an alignment of them.
A simple way to measure the similarity expressed by an alignment is to assign a score to
each individual position of the alignment according to whether there is a match, a
mismatch, or a gap. The total of the scores at all the individual positions will give an
overall score of the entire alignment. If we are interested only in a particular portion of
the alignment, we can simply sum the scores in that portion. For example, if we assign a
score of 1 to a match, -1 to a mismatch, and a -2 to a position with the gap letter in one of
the sequences, we will get a score of 1+1-1+1+1-2+1+1+1+1+1+1 = 7 for Alignment A
and a score of -2-2-2+1+1-2-2-2-2+1+1+1+1+1+1 = -6 for Alignment B. Clearly
Alignment A expresses more similarity of the sequence pair than Alignment B.
02/13/16
Leung: Page 3 of 16
Scoring functions of this kind, which depend only on the count of matches, mismatches,
and gap letters, do not take into account the various degrees of similarity in biochemical
properties among the different pairs of bases. This is particularly important when we are
aligning amino acid sequences because some of the 20 different amino acids are more
similar to each other than others in their biochemical properties. Substitution of one
amino acid by a different one similar in biochemical properties will not alter the function
of the protein molecule structure and function much. On the other hand, replacing one
amino acid by another which has entirely different biochemical properties can completely
destroy the function of the protein molecule.
Commonly, the amino acids are grouped into four families as displayed in Table 6.1. For
those of you who are interested in chemistry, you may want to look up a biochemistry or
molecular biology textbook (e.g., ...) to examine the chemical structure of these amino
acids. You will see that members within the same family resemble one another more than
members from different families. For example, glutamic acid would be much more
similar to aspartic acid (both being acids) than to say, cysteine. Leucine and Isoleucine are
almost identical in structure and they can easily substitute each other without altering too
much of the chemical properties of the proteins. To take into account the various degrees
of similarity and dissimilarity among amino acids, we make use of the scoring matrices.
These are 20 by 20 matrices in which each entry indicates the similarity between the
amino acid on the row and that on the column. Because of symmetry, it is sufficient to
give the entries above and including the diagonal, or those below and including the
diagonal. The other entries can be inferred by symmetry.
Family
Acidic
Basic
Uncharged Polar
Nonpolar
Members
Aspartic acid, Glutamic acid
Lysine, Arginine, Histidine
Asparagine, Glutamine, Serine, Threonine, Tyrosine
Alanine, Glysine, Valine, Leucine, Isoleucine,
Phenylalanine, Methionine, Tryptophan, Cysteine
Proline,
The two big classes of scoring matrices are the PAM (Dayhoff 1972) and BLOSUM
(Heinikoff and Heinikoff 1992) families of matrices. These families of matrices are
constructed based on statistical analysis of a carefully collected database and the
biologists knowledge of the evolutionary relationship of the sequences in the collection.
We shall explain in detail the construction processes of these matrices in the section 6.3.
Figure 6.1 shows the BLOSUM 62 matrix, a popularly used member of the BLOSUM
family.
When we are allowed to introduce gaps in an alignment, we need to assess how much is
the similarity affected by the insertion of a gap, and the length of the gap. It is believed
that extending the length of an already opened gap does not cause as devastating an effect
as opening a new gap. While there are no general rules dictating what gap opening
02/13/16
Leung: Page 4 of 16
penalty and gap length extension penalty to use, most sequence alignment programs use a
gap penalty function given by w(k) = a + bk for a gap of length k. Here a and b are
respectively the gap opening and gap extension penalties which are free parameters for
the users to choose values for. In practice, when we try different values for these
parameters and examine the alignments obtained, we generally get a feeling of which
values will produce the alignments that exhibit the similarities of the sequence under
comparison.
A
R
N
D
C
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
4
-1
-2
-2
0
-1
-1
0
-2
-1
-1
-1
-1
-2
-1
1
0
-3
-2
0
A
5
0
-2
-3
1
0
-2
0
-3
-2
2
-1
-3
-2
-1
-1
-3
-2
-3
R
6
1
-3
0
0
0
1
-3
-3
0
-2
-3
-2
1
0
-4
-2
-3
N
6
-3
0
2
-1
-1
-3
-4
-1
-3
-3
-1
0
-1
-4
-3
-3
D
9
-3
-4
-3
-3
-1
-1
-3
-1
-2
-3
-1
-1
-2
-2
-1
C
5
2
-2
0
-3
-2
1
0
-3
-1
0
-1
-2
-1
-2
Q
5
-2
0
-3
-3
1
-2
-3
-1
0
-1
-3
-2
-2
E
6
-2
-4
-4
-2
-3
-3
-2
0
-2
-2
-3
-3
G
8
-3
-3
-1
-2
-1
-2
-1
-2
-2
2
-3
H
4
2
-3
1
0
-3
-2
-1
-3
-1
3
I
4
-2
2
0
-3
-2
-1
-2
-1
1
L
5
-1
-3
-1
0
-1
-3
-2
-2
K
5
0
-2
-1
-1
-1
-1
1
M
6
-4
-2
-2
1
3
-1
F
7
-1 4
-1 1 5
-4 -3 -2
-3 -2 -2
-2 -2 0
P S T
11
2 7
-3 -1
W Y
4
V
Figure 6.1 The BLOSUM 62 amino acid substitution matrix.
6.2 Sequence alignment algorithms
6.2.1 Dot-matrix analysis
The first computer aided sequence comparison is called "dot-matrix analysis" or simply
dot-plot. The first published account of this method is by Gibbs and McIntyre (1970 The
diagram, a method for comparing sequences. Eur. J. Biochem 16: 1-11). Briefly, this
method involves constructing a matrix with one of the sequences to be compared running
horizontally across the bottom, and the other running vertically along the left-hand side.
Each entry of the matrix is a measure of similarity of those two residues on the horizontal
and vertical sequence. In the Gibbs and McIntyre paper, they use the simplest scoring
system, which distinguishes only between identical (dots) and non-identical (blank)
residues. However, one can also use graded measures that give chemically similar pairs of
02/13/16
Leung: Page 5 of 16
bases higher similarity scores such as the BLOSUM and PAM matrices and enter a dot
whenever the similarity exceeds a prescribed value.
Similar sequences tend to have many identical or chemically related residues along the
main diagonal; hence conspicuous diagonal runs of dots signal regions of similarity.
Simple as it is, dot matrix analysis is still a popular tool for researchers to visually inspect
the similarity between two sequences. It is often used as a first examination. From its
output, the researcher can pick out regions from the two sequences on which more
detailed alignment will be performed.
Maizel and Lenk (1981 "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein
Sequences", Proc. Natl. Acad. Sci. USA 78; 7665-7669) generalize the original ideas of
Gibbs and McIntyre. At every base of the two sequences, a window of fixed size is laid
down. A dot will be entered in the matrix if the total similarity score of the two windowed
fragments exceeds a prescribed threshold. Their algorithm is implemented in the GCG
program "compare". The output of compare can be fed into the "dot-plot" program to
draw the dot-matrix. Figure 6.2 is the dot-plot output of the amino acid sequences of the
human hemoglobin  and  chains.
02/13/16
Leung: Page 6 of 16
9
Figure 6.2 The dot-plot output of the amino acid sequences of the human hemoglobin
alpha and beta chain.
02/13/16
Leung: Page 7 of 16
6.2.2 The dynamic programming algorithm
In 1970, Needleman and Wunsch (1970, A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48: 443 - 453)
introduce an elegant algorithm for comparing two proteins sequences. This general
algorithm works also for aligning nucleic acid sequences as well. The algorithm actually
belongs to a very large class of algorithms for finding optimal solutions. The essence of
the algorithm is a technique known as dynamic programming.
For any letter sequence s, the segment of the sequence consisting of the letters from the
beginning of the sequence up to the ith letter in the sequence is called a prefix, and it is
denoted by s[i]. The dynamic programming technique basically tries to find the optimal
alignment by taking advantage of the optimal alignments already found for the prefixes of
the sequence. Suppose s and t are two sequences of size m and n respectively, there are
m+1 possible prefixes of s and n+1 prefixes of t, including the empty string. To explain
the calculations, we arrange our calculations in an (m+1) x (n+1) matrix where entry (i, j)
contains the similarity between the prefixes s[i] and t[j]. This entry will be denoted by
sim(i, j).
Let us illustrate the dynamic programming algorithm using an example. We shall try to
align the two DNA sequences s = GCTC and t = AGTCA with m = 4 and n = 5. Every
base match receives a similarity score of +1 and every mismatch -1. The gap penalty
function is chosen to be w(k) = -1-2k, where k is the length of the gap. In other words, a
penalty of -3 will be given to a gap of length 1, and the penalty increase by multiples of 2
as the gap lengthens.
Figure 6.3 Dynamic programming algorithm for global alignment
We place s on the left and t along the top margin of a rectangular array. A special
character "^" is introduced to indicate that the sequence will begin at the next position.
The 0th row and 0th column are initialized with the gap penalty function with k being the
length of the "gap" that has to be inserted at the beginning of either sequence. For
instance, cell (0,3) has a value -7 because having the 0th character "^" of sequence s lining
up with the 3rd character "T" of sequence t, producing a gap of length 3 at the beginning
of the alignment like this:
02/13/16
Leung: Page 8 of 16
_ _ _ | sequence s begins here
A G T | the rest of sequence t here
The gap penalty, accordingly, is -1-2(3) = -7.
For the rest of the array, cell (i, j) will be filled with the amount of similarity between the
prefixes s[i] and t[j] computed recursively. Suppose we have already filled the entries at
(i-1, j), (i-1, j-1), (i, j-1). Then we can compute
 sim(i, j  k )  w(k ); k  1,... j

(6.1)
sim(i, j )  max  sim(i  1, j  1)  p(i, j )
 sim(i  k , j )  w(k ); k  1,..., i

where p(i,j) is the similarity score between the ith letter of sequence s and the j letter of
sequence t. In our scheme of this example, p(i,j) can only be +1 or -1 depending on the
whether the letters are identical or different.
The reasoning behind equation (6.1) is that there are just these possible ways of obtaining
an alignment between s[i] and t[j]:
(A) Align s[i] and t[j-k], and match a new gap of length k with the next k letters on t.
(B) Align s[i-1] and t[j-1], and match the ith letter of s with the j letter of t.
(C) Align s[i-k] and t[j], and match a new gap of length k with the next k letters on s.
These possibilities are exhaustive because we cannot have two spaces paired in the last
column of the alignment. Scores of the best alignments between smaller prefixes are
already stored in the array if we choose an appropriate order in which to compute the
entries (e.g., fill the array row by row, left to right in each row; or fill the array column by
column, top to bottom on each column).
As we enter each entry in the array following equation (6.1), we draw an arrow to indicate
where the maximum value comes from. The options (A), (B), and (C) corresponds to
getting the value for the current cell from the horizontal, diagonal, and vertical direction
respectively. For instance, the cell in row 1 and column 3 of the matrix will contain the
value of sim(1,3). This is obtained by taking as the maximum among the following
numbers.
sim(1, 2) - 3 = -2 - 3 = -5 (horizontal)
sim(1, 1) - 5 = -1 -5 = -6 (horizontal)
sim(0,2) - 1 = -5 - 1 = -6 (diagonal)
sim(0,3) - 3 = -7 - 3 = -10 (vertical)
The maximum value comes from entry (1,2), and that is where the arrow shows. If there
are more than one way of getting the maximum, we put arrows to indicate all the
possibilities. See, for example, entry (2,1).
02/13/16
Leung: Page 9 of 16
After the array has been completely filled, we find the best alignment by tracing back
along the arrows. We start at the bottom right corner of the array and move according to
the direction of the arrow. The best alignment we get from Figure 6.3 is
A G T C A
: : | |
G C T C _
When there are multiple arrows emanating from an entry, we can follow any one of them.
So it is possible to have more than one optimal alignment. Most computer programs for
sequence alignment will report all different optimal alignments. It is important to note
that an optimal alignment is optimal only for the particular similarity score matrix and the
gap penalty functions. When any of these is altered, the optimal alignment will also
change.
The GCG program "Gap" uses the above algorithm to find the best global alignment of
two sequences.
Exercise With the same similarity scoring scheme, and gap penalty as in the example
above, find the best alignment between the pair of DNA sequences in the beginning of
this section.
In the description above, we try to find an alignment that gives an overall best similarity
scores between the entirety of the two sequences. This is called an optimal global
alignment. At times, our aim is to find the best segments from the given pair of the
sequences that lines up best with each other. This is called an optimal local alignment.
Local alignments are particularly useful when a new sequence is just obtained from the
laboratory. The researcher would first like to identify any parts of the sequence that have
high similarity to known functional domains. The popular database search program
BLAST uses a local alignment algorithm.
The dynamic programming local alignment algorithm was developed in the early 1980's
(Smith and Waterman 1981, “Identification of common molecular subsequences”. J Mol
Biol. 147(1):195-7.) and is frequently referred to as the Smith-Waterman algorithm. It
shares the same basic concepts with the global algorithm, differing only in a few details.
First, an extra possibility is added to equation (6.1), allowing sim(s[1..i], t[1..j]) to take
the value of 0 if all other options have value less than 0. That is,
(6.2)
0
sim ( s[1..i ], t[1.. j  k ])  w(k ); k  1,... j  1

sim ( s[1..i ], t[1.. j ])  max 
sim ( s[1..i  1], t[1.. j  1])  p(i, j )
sim ( s[1..i  k ], t[1.. j ])  w(k ); k  1,..., i  1
02/13/16
Leung: Page 10 of 16
Consequently, the top row and left column of the array in Figure 6.3 will now be filled
with 0's instead of the w(k)'s as in global alignment. Taking the option 0 corresponds to
starting a new alignment. Since we are only looking for a local alignment, the alignment
can start anywhere in the two sequences. So, if the best alignment up to a certain point
has a negative score, it is better to start a new one at that point.
Second, the alignment can end anywhere in the sequences. So, instead of starting the
traceback from the bottom right corner, we look for the highest similarity value in the
array and start the traceback from there. The traceback ends when we meet a cell with
value 0, which corresponds to the start of the alignment.
If we follow equation (6.2) to find the best local alignment to the same pair of sequences
s and t. We will have the array in Figure 6.4, which indicates that the best local alignment
between these two sequences is the match of two nucleotide bases TC in the 3rd and 4th
positions of both sequences.
Figure 6.4 Dynamic programming algorithm for local alignment
The GCG programs that use dynamic programming algorithms for local pairwise
sequence alignments are BestFit and FrameAlign.
6.3 Database similarity search -BLAST and FASTA
BLAST is the acronym for Basic Local Alignment Search Tool. It uses the method of
Altschul et al. (JMB 215:403-410, 1990) to pick out sequences already collected in a
database that are similar to the query sequence. BLAST takes the query sequence input by
the user and compares it with each entry in the database, looking for segments of high
degrees of similarities. It picks out from the database those sequences that contain a
segment so similar to part or all of the query that such similarity is deemed statistically
significant (i.e., unlikely to occur by chance).
02/13/16
Leung: Page 11 of 16
Exercise BLAST is available at the NCBI web site. Before you go on, it may be helpful
to visit http://www.ncbi.nlm.nih.gov/blast/ to take a look at the BLAST overview and go
through the exercise in the BLAST tutorial there.
The algorithm used in the current version of BLAST at NCBI can be summarized in three
main steps:
Step 1. Finding high-scoring segment pairs: For each sequence in the database, BLAST
will compare it with the query. BLAST first seeks from the sequence pair, equal length
sequence segments, which have maximal aggregate similarity score that cannot be
increased by extension or trimming. Such locally optimal alignments are called "highscoring segment pairs" or HSP's. The current version of BLAST requires that each HSP
must contain at least two non-overlapping pairs of words of length W (these word pairs
are called "hits" in BLAST jargon, default values for W are 3 for amino acid, and 11 for
nucleotide sequences) satisfying certain requirements:
a) Their similarity score exceeds a threshold value T.
b) The offset of the two word pairs are equal. If a word pair occurs at position x1 of the
first sequence and position x2 of the second sequence, the offset of the word pair is
defined to be x1- x2
c) The distance between the word pairs is no more than a preset upper limit A. The
distance between two word pairs (x1, x2) and (x'1, x'2) is defined to be the difference
between their first coordinates x1- x'1.
The rationale behind these criteria for finding HSP's is based on the observation that an
HSP with a large enough similarity score to eventually generate a statistically significant
local alignment is very likely to contain multiple hits with the same offset and within a
relatively short distance of one another. The chances of missing any HSP's of interest
using this procedure is relatively small.
Step 2. Gapped extensions of HSP's: BLAST will only retain those HSP's that exceed a
moderate score Sg, and further attempt to extend the alignment in both the leftward and
rightward directions while allowing gaps to be introduced. Sg is controlled so that no
more than about one gapped extension is invoked per 50 database sequences. A dynamic
type algorithm, with modifications to improve efficiency, is used. Whenever a gap is
opened or extended, a penalty will be imposed according to a gap penalty function of the
form w(k) = a+bk with k being the length of the gap. The alignment(s) with the maximal
score will be assessed for statistical significance.
Step 3. Assess statistical significance of the maximal alignment score: If the fully
extended gapped alignment is deemed significant, the database sequence will be picked
and described in the output. The evaluation of statistical significance is based on
comparison with the rolling-die random sequence models described before (in Chapter 3).
For example, the random amino acid sequence model will be generated by rolling an
icosahedral (20 faced) die for a number of times equal to the length of the sequences
02/13/16
Leung: Page 12 of 16
under comparison. The die is loaded according to the relative frequencies of occurrence
of the amino acids in the database.
The maximal alignment score M for two random sequences is a random variable.
Asymptotically, it follows an extreme value distribution when the lengths m and n of the
sequences  . In reality, when m and n large, the asymptotic distribution yields a good
approximation that can be used to calculate the probability of the maximal local
alignment score to exceed any given level. We shall discuss this more fully in the next
section.
From the probability distribution of the maximal alignment scores, one can determine the
probability of getting an alignment as good as the one observed. If this probability is
small (say < 0.05), the alignment is deemed statistically significant. In the BLAST output,
this probability p is converted to a bit score equal to -log 2 p. The smaller the probability,
the larger the bit score.
One can also calculate the expected number E of times an alignment with such a score
would occur in a database of the same size as the one searched. If this expected number is
high, it means that the alignment can occur quite frequently by chance. On the other hand,
a low value of E indicates that alignment is expected to occur very rarely and hence is
worth further examination. BLAST lets you specify a parameter which discards those
alignments expected to occur more than certain number (default is 10) of times.
6.4 Statistics for sequence alignments
Beneath the surface of the sequence alignment programs lie two important applications of
statistics. First, statistics play a key role in the construction of the similarity score
matrices. Second, the evaluation of the significance the "best" alignment found by any
sequence alignment algorithm also depends on statistics.
The BLOSUM family of similarity score matrices
BLOSUM is the acronym for blocks substitution matrix. The name comes from the fact
that the values of the matrices come from a large collection of blocks of biologically
similar proteins. S. Henikoff and J.G. Henikoff (1991, Nucleic Acid Res. 19, 6565-6572)
designed an automated system, PROTOMAT, for obtaining a set of blocks given a group
of related proteins. This system was applied to catalog of several hundred protein groups,
yielding a database of more than 2000 blocks.
Each block in this database consists of a number (say, d) of aligned amino acid
sequences. Suppose there are w columns in the alignment. We say that the block has
depth d and width w. From each column of d amino acids, one can form
1+2+...+(d -1) = d(d-1)/2
02/13/16
Leung: Page 13 of 16
unordered pairs of amino acids. For example, if a column contains nine alanines and one
serine, one can form 10(9)/2 = 45 pairs, 36 of them are [A, A] and 9 [A, S]. Gap letters in
the alignment will be ignored and no pair will be formed with gaps.
Exercise If a column contains 2 A's, 1 S, 1 T and 1 _, list the possible pairs formed and
their frequencies.
When the procedure is repeated on every column of every protein block in the database,
we obtain the frequency counts of all the 210 (i.e., 1 + 2 +...+ 20) amino acid pairs. For
simplicity, we shall index the amino acids from 1 to 20 in some convenient order, (say,
alphabetically by names) and denote the frequency counts by fij, where i=1, ..., 20, j =
1,..., i. These frequency counts will be used to calculate the score matrix.
First, we need to calculate an "odds ratio" which is defined to be the ratio of the observed
relative frequencies to the expected relative frequencies of the amino acid pairs. The
observed relative frequencies are calculated as
20 i
qij  f ij /   f ij .
i 1 j 1
Let us pretend that the entire database has only that column of 9 A's and 1 S described
before, where fAA = 36 and fAS = 9. Then qAA = 36/45 = 0.8 and qAS = 9/45 = 0.2.
The expected relative frequencies pij are calculated based on a rolling-die model where all
the amino acids in the protein block were generated independently. In such a model, we
can write pij = pipj where pi and pj are the probabilities of observing the two individual
amino acids in the database. These probabilities are estimated by the observed relative
frequencies. So, pˆ i  qii   qij / 2 . Hence the expected relative frequency of the pair is
j i
estimated by
 pˆ i2
pˆ ij  
2 pˆ i pˆ j
i j
i j
In the example, the expected relative frequency for [A, A] is 0.9 x 0.9 = 0.81, that of [A,
S] is 2 x 0.9 x 0.1 = 0.18, and that of [S, S] is 0.1 x 0.1 = 0.01.
The odds ratio is then calculated where each entry is qij / pˆ ij . The base 2 logarithm,
measured in number of bits, of this odds ratio is referred to as a "lod ratio"
lod  log 2 (qij / pˆ ij ) .
The lod ratio is positive, zero, or negative according to the amino acid pair occurs more
frequently than expected, just as frequently as expected, or less frequently than expected.
A positive lod ratio indicates that the pair of amino acids frequently substitute for each
other in proteins with like functions. The pair usually have similar molecular structures
and biochemical functions. The lod ratios are multiplied by a scaling factor of 2 and then
02/13/16
Leung: Page 14 of 16
rounded to the nearest integer value to produce the values in a BLOSUM matrix in halfbit units.
To reduce multiple contributions to amino acid pair frequencies from the most closely
related members of a family, sequences are clustered within blocks and each cluster is
weighted as a single sequence in counting pairs (Henikoff, S., Wallace, J.C., and Brown,
J.P., 1990, Methods Enzymol. 183, 111-132.). This is done by specifying a clustering
percentage in which sequence segments that are identical for at least that percentage of
amino acids are grouped together. The BLOSUM matrix computed from this reduced
block of proteins is associated with that percentage. That is why we have the
BLOSUM62, BLOSUM80 matrices, etc.
The clustering procedure is best explained by the example given in Henikoff and
Henikoff (1991). Suppose the clustering percentage is set at 80%, and sequence A is
identical to sequence B at  80% of their aligned positions, then A and B are clustered
and their contributions are averaged in calculating pair frequencies. If C is identical to
either A or B at  80% of aligned positions, it is also clustered with them and the
contributions of A, B, and C are averaged, even though C might not be identical to both A
and B at  80% of the aligned positions. In the above example, if 8 of the 9 sequences
with A residues in the 9A-1S column are clustered, then the contribution of this column
to the frequency table is equivalent to that of a 2A-1S column, which contributes 2[A,S]
pairs.
Assessing the statistical significance of alignment scores
The statistical theory was initially developed for the older version of BLAST which only
looks for ungapped local alignments (i.e., HSP's). For simplicity, we shall focus on the
discussion of this ungapped situation and indicate what are the changes required to allow
for gapped alignments.
Again, a simple rolling-die model is assumed. The twenty amino acids occur randomly at
all positions with background probabilities pi. We require that the expected score for two
random amino acids  Pi Pj sij be negative.
i, j
Based on the theory of extreme values, it can be proved that between two sufficiently
long random letter sequences of lengths m and n, the number of high-scoring segment
pairs exceeding a certain score S can be well approximated by a Poisson random variable
with mean
  KmneS .
Here K and  are mathematical parameters that depend on the letter composition of the
sequences as well as the matching scores in the matrix (see Karlin and Altschul 1990
"Methods for assessing the statistical significance of molecular sequence features by
using general scoring schemes" and references therein). They are computed by the
BLAST program and reported at the end of the BLAST output. The mathematical
derivation of the above statement is beyond the scope of this book, but at least the
formula makes intuitive sense. Doubling the length of either sequence under comparison
02/13/16
Leung: Page 15 of 16
should double the number of HSPs attaining the given score S. Also for an HSP to attain
the score 2x, it must attain the score x twice in a row, so one expects E to decrease
exponentially with the score.
Here let us recall some of the results we discussed in Chapter 4. For a Poisson random
variable X with mean , P( X  0)  e . Hence, P( X  1)  1  e   . For a small value
of  (say, < 0.01), 1  e   is very close to . This is seen very easily for those of you who
knows the Taylor series expansion for the exponential function. It can be demonstrated by
trying out a few small values of .
So if S is large enough to make  small, the P-value for the observing an HSP with score
S or more is approximately KmneS . The P-value tells how unlikely it is to have an
HSP with a score as high as S. Note, however, that this P-value depends on the length of
the sequences under comparison. To get a measure independent of the lengths, a bit-score
S' is defined to be the base 2 logarithm of the P-value with m and n factored out. The
mathematical properties of logarithms give the following relationship between the bitscores and raw scores.
S '   log 2 ( P / mn)
  log 2 ( Ke  s )

S  ln K
ln 2
The BLAST output also reports an E-value for each reported sequence. This is the
expected number of HSP's with score as high as that reported when the query sequence is
searched against a database of random sequences with the same base composition. If we
are comparing the query sequence, of length m, with just one database sequence of length
n, the expected number is given by . Now if the database contains s sequences of
average length n, we can expect to see s  sKmneS  KmNeS that many HSP's. If
this E-value is small, it is indicating that a match with such a high score in a database of
the same size and composition is statistically unusual.
The above discussion is done in terms of HSP's rather than gapped alignments such as the
current version of BLAST allows. The reasoning, even with gapped alignments, are quite
similar, except that the values of K and  are calculated somewhat differently. These new
parameters are used in the calculations of bit-scores and E-values in the BLAST output.
6.5 Multiple Sequence Alignment
Needleman and Wunsch (1970) remark that the dynamic programming algorithm can be
generalized to allow simultaneous alignment of more than two sequences. However, the
procedure requires large amounts of computer memory as well as computing time. Many
clever variation, adaptation of the basic algorithm for the pairwise alignment algorithms,
02/13/16
Leung: Page 16 of 16
some taking advantage of parallel processing computer systems, have been proposed.
Examples of these are included in GCG programs for multiple sequence alignments:
PileUp, SeqLab, PlotSimilarity, Pretty, PrettBox, Meme, ProfileMake, ProfileGap,
Overlap, NoOverlap, OldDistances. We shall not describe them here but the interested
readers can pursue the references contained in the GCG manual.
In many applications, however, a full alignment of multiple sequences is not necessary
nor practical, especially when the sequences under comparison are long. All we need is to
find the matching segments in the sequences under investigation, just like finding local
alignments in pairwise sequence comparison. There is a class of algorithm called the
hash-coding type algorithm which locates matching segments among multiple, long
sequences totally millions of bases very efficiently. The key feature of this type of
algorithm is the construction of a “lookup table” of k-letter words or k-tuples (e.g., all
possible dinucleotides and trinucleotides). The method was first introduced into
molecular biology by Dumas and Ninio (1982) and was the basis of the database search
programs FASTA which is also extensively used like BLAST.
Exercises Applications of Multiple Sequence Alignments
1. Use multiple DNA sequence alignments to look for concensus sequences for
prediction of promotion sites and splice junctions.
2. Multiple sequence alignments on
Histones - sequences highly conserved - for unwinding the DNA to conduct activities
Kinases Pleckstrin homolgy domains - highly conserved in structure but only 1/100 conserved
amino acids.
Chapter References:
Download