Database Searching and Pairwise Alignment Techniques

advertisement
Module 4
Database Searching and Pairwise Alignment Techniques
AIMS

To explain the principles underlying local and global alignment programs

To explain what substitution matrices are and how they are used

To introduce the commonly used pairwise alignment programs

To explore the significance of alignment results
OBJECTIVES
The student should be able to:

Carry out FastA and Blast searches

To select appropriate substitution matrices

To evaluate the significance of alignment/search results
INTRODUCTION
Regardless of whether you are dealing with a DNA or protein sequence you will commonly want to
compare the sequence you are analysing to DNA or protein sequences held in a database. The
principles underlying the comparison of a search sequence with sequences held in the database
involves pairwise comparison of the search sequence with each of the database sequences. It is the
methodology of such pairwise comparisons that is the subject matter of this module.
Similarity versus homology
One very important consideration to deal with before looking at the procedures which can be
employed to compare two DNBA or protein sequences is to define two terms that are subject to much
misuse. The terms in question are ‘homology’ and ‘similarity’. If two genes have a common ancestor
they said to be homologous or to exhibit homology. The two genes either had a common ancestor or
they didn’t, therefore terms employing percent homology are meaningless. However, when you
compare two related sequences there are likely to be many similarities between the two sequences and
these can be quantified to give a percent similarity. It is from the degree of similarity between two
sequences that homology is inferred.
Information theory examines the properties of messages. If we consider a protein to be a message then
we can calculate it information content in terms of bits. The calculation gives a value of about 4.19
bits per residue. Given an average protein size of 150 residues this corresponds to an information
content of 630 bits. From this we can work out that the probability that two random sequences would
specify the same message is 2-630 or about 10-190. This implies that convergent evolution giving rise to
two similar sequences would be very rare and consequently if two sequences exhibit significant
similarity it must have arisen through the fact that the two sequences arose from a common ancestor
and therefore are homologous.
The basic concepts
The basic concepts associated with the pairwise alignment of DNA and protein sequences can be
approached using a linguistic metaphor. The English alphabet contains 26 letters, that of DNA 4, and
that of protein 20. Sometimes additional characters may be added to these basic alphabets, particularly
where there is some degree of ambiguity over a position e.g. X is often used for unknown bases or
amino acid residues (see Module 1). The way in which to align two identical sequences of characters
is obvious.
We can measure similarity or dissimilarity between pairs of sequences to give scores.
There are several ways in which sequence (dis)similarity is measured:
The Hamming Distance measures the number of different characters there are between two
sequences, such as with the following two sequences:
AGATCTAG TCGA
AGGCATCATGCAGT
which differ in 10 places, so their Hamming distance is 10.
The proportional or p-distance. This is the Hamming distance divided by the total sequence
length, so ranges from 0 to 1. In the above example the p-distance is 10/14.
The log-odds ratio. This is a measure of how unlikely it is that two sequences should be so
similar. It is based on the observed frequencies of each of the characters (bases or amino
acids) in the sequences, and the probability of observing each homologous pair in the two
sequences. It is a positive score, measuring similarity, and is calculated by adding the scores
from pre-calculated matrices (see PAM and BLOSUM matrices below) with all the possible
pairs of characters.
Gaps
Obviously genes can suffer insertions and deletions of one more bases and the corresponding proteins
will contain insertions and deletions of amino acid residues. In order to compensate for such events it
is necessary to introduce gaps into an alignment, but not so many that the alignment becomes
unreliable. The most common method involves giving a penalty score (d) for opening a gap and
another penalty score for extending the length (x) of the gap. We make the gap-extension cost (x) less
than the gap-open cost (d) so that we don’t get too many gaps inserted, when fewer would do. This
method of assigning gap costs is called affine. Insertions and deletions are lumped together in the term
indels. Unfortunately, values for these penalty scores cannot be arrived at in the systematic way that
substitution matrices were constructed and the values used are arrived at empirically.
Local versus Global Pairwise Alignments
It will frequently be the case that the two sequences to be compared will not be homologous over their
entire length. There are several possible reasons for this:
1. We may have two sequences which have a gene in common but which we know have been
subjected to extensive recombination in the other regions, so we could not guarantee that they are
going to be similar throughout their length;
2. The process of evolution may to the formation of proteins that can be described as multimodular.
Within a single polypeptide there may be different functional and structural modules or domains.
Consequently when comparing two proteins there may be only be significant similarity between one
small region of the two proteins (see Module 5).
This is reflected in the fact programs designed to produce pairwise alignments of protein or DNA
sequence are designed either to produce global or local pairwise alignments. A global approach will
attempt to align two sequences along their entire length, whereas a local alignment will look for local
regions of similarity or subsequences.
The Needleman and Wunsch algorithm was devised for computing global alignment for two
sequences whereas the Smith-Waterman algorithm finds the best local alignments and both provide
the basis of several database searching programs. Both methods are dynamic programming algorithms
which operate by solving smaller, but similar sub-problems. If you would like to know more about
these algorithms try http://www.maths.tcd.ie/~lily/pres2/sld002.htm. ALION is a Web site where you
can use either method to compare two protein sequences using a variety of substitution matrices (see
below)
The most simple method of sequence comparison is known as a dotplot. Essentially we have a grid (or
two dimensional array) with one sequence along the X-axis and the other along the Y-axis. Each
residue in turn in one sequence is compared with every other residue in the other sequence and a dot is
put in the grid when any two residues are the same
T
T
H
E
R
A
T
S
A
T
O
N
T
H
E
C
A
H
E
C
A

T
S
A

T
O
N

T
H
E
M
A


T




































T


Within the dotplot identical words, or subsequences, are defined by diagonal line of dots (in red)
across the plot. If the two sequences were identical, the line would be unbroken. Identical, or similar,
subsequences are also detected by lines parallel to the main diagonal.
The dot plot for haemoglobin alpha chains from the Emperor penguin (along the top) and the rabbit
(down the side), show very high similarity
DOTTUP is a Web site where such dotplots can be produced.
PAM and BLOSUM matrices
It is well known that certain groups of amino acids have similar physico-chemical properties and
consequently the substitution of one amino acid by another from the same group is likely to have a less
deleterious effect on the protein than substitution by an amino acid from another group. These
substitutions are termed conservative and non-conservative respectively. In addition, a single base
change in a nucleotide sequence may not necessarily cause a change in the amino acid sequence
because of redundancy in the genetic code (a silent mutation). It would seem logical that a scoring
system that measured the similarity of two sequences i.e. the log-odds ratio would take account of
conservative and non-conservative substitutions and silent mutations. The first commonly accepted
approach to developing a scoring system which took account of observed patterns of substitution was
that of Dayhoff and his co-workers with the Point Accepted Mutation (PAM) model of evolution. 1
PAM unit is the extent of evolutionary divergence in which 1% of amino acid residues are altered.
They took an alignment of 15 very closely related proteins and then calculated a matrix that
represented the probability of a mutation altering one amino acid residue to any other amino acid on
the basis of 1 PAM. Obviously when comparing more distantly related protein PAM1 would not be
applicable and they extrapolated the PAM1 values to PAM250.
THE PAM250 MATRIX
A
R
N
D
C
2
-2
0
0
-2
6
0
-1
-4
2
2
-4
4
-5
12
Q
E
G
H
I
L
K
M
F
P
S
T
W
Y
V
0
0
1
-1
-1
-2
-1
-1
-3
1
1
1
-6
-3
0
A
1
-1
-3
2
-2
-3
3
0
-4
0
0
-1
-2
-4
-2
R
1
1
0
2
-2
-3
1
-2
-3
0
1
0
-4
-2
-2
N
2
3
1
1
-2
-4
0
-3
-6
-1
0
0
-7
-4
-2
D
-5
-5
-3
-3
-2
-6
-5
-5
-4
-3
0
-2
-8
0
-2
C
4
2
-1
3
-2
-2
1
-1
-5
0
-1
-1
-5
-4
-2
Q
4
0
1
-2
-3
0
-2
-5
-1
0
0
-7
-4
-2
E
5
-2
-3
-4
-2
-3
-5
0
1
0
-7
-5
-1
G
6
-2
-2
0
-2
-2
0
-1
-1
-3
0
-2
H
5
2
-2
2
1
-2
-1
0
-5
-1
-4
I
6
-3
4
2
-3
-3
-2
-2
-1
2
L
5
0
-5
-1
0
0
-3
-4
-2
K
6
0
-2
-2
-1
-4
-2
2
M
9
-5
-3
-3
0
7
-1
F
6
1
0
-6
-5
1
P
2
1
-2
-3
-1
S
3
-5
-3
0
T
17
0
-6
W
10
-2
Y
4
V
The PAM model of protein sequence evolution can be criticized in a number of ways. Perhaps the
most important criticism is that the model assumes that all position within a protein molecule are
equally changeable by mutation. In fact it is common to find that some residues, or indeed groups of
residues, are absolutely unchanged in a group of related proteins, whereas other vary. An example
would be critical active site residues in an enzyme.
Because the PAM matrices were derived from protein which exhibited only slight (~15%)
evolutionary divergence Henikov and Henikov (1992) derived a set of substitution matrices which
were based on sequences that were much more divergent than those used for the PAM matrices. The
BLOSUM (BLOcks SUbstition Matrix) matrices cover sequences with 80% or more similarity
(BLOSUM 80), 62% or greater similarity (BLOSUM 62) etc.
Pairwise alignments done with PAM or BLOSUM matrices look very similar, but differ in some of the
detail. (see Exercises)
Local alignment
Suppose we have two sequences that we want to compare for local, as opposed to global, alignment. A
generic local alignment procedure might work like this:
1. choose one sequence to be searched against the other – here we shall call sequence q the query
sequence and sequence t the target sequence
2. divide the query sequence s into small subsequences, called words
3. for each word w of q, look along t to find whether there are any other words in t which are very
similar to w
4. use these matching words as “anchors” from which to build up a better alignment between q and t
5. assess how good this alignment is.
Both the methods below, FASTA and BLAST, use this general approach, but they differ in the ways in
which they assess similarity between words, and in the way in which they go on to build up the
alignments.
FASTA AND BLAST
FASTA is a good tool for scanning databases to find sequences which are similar to your query
sequence. It uses a “Pearson and Lipman search” (Pearson & Lipman, 1988) to locate identical words
(k-tuples) in the sequences being compared, as detailed below, via a generalization of dot plot
approach. The sequence of events in a FASTA search is as follows:
(i)
(ii)
(iii)
(iv)
the words in the query sequence are compared with each of the sequences in the database
to get matching words (up to 6 nucleotides, or two amino acids in a row which match)
the regions in which a good match has been made are rescored to accommodate
ambiguities in the sequences, conservative changes (e.g., those which don't change the
amino acid), and matches of shorter words
the algorithm checks to see whether some of the matching words can be concatenated
(joined up) while retaining the good match score
the best sequences found so far are aligned completely with the query sequence for display
to the user.
BLAST stands for “Basic Local Alignment Search Tool”, and was developed by Altschul in 1990
(Altschul et al., 1990). It works by comparing the query sequence against all the sequences in the
database to find the maximal segment pair, or MSP. Gaps are NOT permitted. The database segment
and the query segment will both be the same length, though they need not be the length of the query
sequence. BLAST is slightly less accurate than FASTA, but is faster.
BLAST searches through all the sequences in the database being used, and for each pair of
sequences finds this maximal segment pair. A segment pair is a matching of one subsequence (a
segment) in one sequence to a subsequence (segment) in the other. Since BLAST doesn't allow gaps,
these are going to be the same length, and for that same reason we expect the MSP's to be much
shorter than the query and target sequences. BLAST uses a mathematical formula to calculate the
probability that a segment pair with a given score could arise by chance in the two sequences – if the
chance is very low (as with high scores) then we would attach statistical significance to getting such a
high score. The algorithm returns all the segment pairs which had significantly high scores, ranked in
decreasing order of that significance.
The significance of alignments
The most important question to be asked when to sequences have been aligned is whether the
alignment is significant i.e. can it be taken as evidence that the two sequences are indeed
evolutionarily related. There is no reliable mechanism of doing this for global alignments, but a good
method exists for local alignments without gaps, so called High-Scoring Segment Pairs (HSPs). The
likelihood of the alignment occurring by chance (p value) is derived from the observed score (S) to the
expected distribution of scores. The size of the database being searched will affect this probability
since the larger the database, the larger the probability of a sequence match by chance. The probability
of such an alignment occurring by chance (the p value) is derived from the observed score (S) relative
to the expected distribution of scores. The closer the p-value is to zero, the more significance can be
attached to the alignment.
Comparing FASTA with BLAST
There are a few obvious differences between FASTA and BLAST. On the one hand, FASTA permits
gaps, even though they're penalised, whereas BLAST, in its original form, does not (see below). That
leads to the following inevitable conclusion, that BLAST must find shorter matching subsequences,
because poor matches which could be accommodated with an insertion or deletion cannot fit into the
same segment pair (no gaps) without a significant loss of the score for the whole word.
On the other hand, BLAST will return a great many potential matches, simply because it does not
construct the complete alignment from a given MSP. This means that there may well be a lot of extra
matches which are biologically unimportant and have just arisen by chance. For instance if you did a
BLAST search and it returned two segments in the same sequence that matched up well with your
query sequence, but which were separated on it by a sizeable gap, then you would have to decide for
yourself whether you considered the sequence to be a real hit, or whether it was unimportant.
Flavours of BLAST
BLAST comes in a variety of flavours depending on the type of search to be done:
Program
blastp
blastn
blastx
tblastn
tblastx
Database searched
protein
nucleotide
translation of
nucleotide
protein
translation of
nucleotide
Query sequence
protein
nucleotide
protein
translation of
nucleotide
translation of
nucleotide
Recent developments of BLAST
The original BLAST program is restricted by its inability to introduce gaps into alignment, however, a
modified version has been developed GAP-BLAST (Altschul et al., 1997), which is far more sensitive,
and this is now the standard version of BLAST available at sites such as the NCBI. Another
development of BLAST is PSI-BLAST (Position-specific iterated BLAST), which is able to detect
very remote homologues by taking the results of one search, constructing a profile, and then using this
profile to search the database again to find other homologues (the process can be repeated until no new
sequences are found).
Exercises
ALION is a program that carries out pairwise alignments
1. Use this site to compare two proteins of your choice.
(a) What difference do you observe when using the Smith-Waterman and Needleman-Wunsch
algorithms?
(b) What effect does using a different substitution matrix have
(c) What effects do altering the gap-opening and gap-extension penalties have?
2. UseBLAST at the NCBI to do a search for homologues of human calnexin (accession number
AAB29309) (save the results for use in a later module)
References and useful links
Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic
Acids Res. 25:3389-3402.
http://www.maths.tcd.ie/~lily/pres2/sld002.htm - A good Powerpoint presentation on the Needleman
& Wunsch and Smith-Waterman algorithms.
There are many places you can perform your own FastA search. Here are a few which are accessible
on the web:
 http://www2.ebi.ac.uk/fasta3/?request;
 http://www.arabidopsis.org/cgi-bin/fasta/TAIRfasta.pl;
 http://www.bio.cam.ac.uk/cgi-bin/fasta3/fasta3.pl – Version 3.2;
 http://genome-www2.stanford.edu/cgi-bin/SGD/nph-fastasgd for comparison with S. cerevisiae
sequences;
 http://fasta.genome.ad.jp/ in Japan.
Download