Module 4 Database Searching and Pairwise Alignment Techniques AIMS To explain the principles underlying local and global alignment programs To explain what substitution matrices are and how they are used To introduce the commonly used pairwise alignment programs To explore the significance of alignment results OBJECTIVES The student should be able to: Carry out FastA and Blast searches To select appropriate substitution matrices To evaluate the significance of alignment/search results INTRODUCTION Regardless of whether you are dealing with a DNA or protein sequence you will commonly want to compare the sequence you are analysing to DNA or protein sequences held in a database. The principles underlying the comparison of a search sequence with sequences held in the database involves pairwise comparison of the search sequence with each of the database sequences. It is the methodology of such pairwise comparisons that is the subject matter of this module. Similarity versus homology One very important consideration to deal with before looking at the procedures which can be employed to compare two DNBA or protein sequences is to define two terms that are subject to much misuse. The terms in question are ‘homology’ and ‘similarity’. If two genes have a common ancestor they said to be homologous or to exhibit homology. The two genes either had a common ancestor or they didn’t, therefore terms employing percent homology are meaningless. However, when you compare two related sequences there are likely to be many similarities between the two sequences and these can be quantified to give a percent similarity. It is from the degree of similarity between two sequences that homology is inferred. Information theory examines the properties of messages. If we consider a protein to be a message then we can calculate it information content in terms of bits. The calculation gives a value of about 4.19 bits per residue. Given an average protein size of 150 residues this corresponds to an information content of 630 bits. From this we can work out that the probability that two random sequences would specify the same message is 2-630 or about 10-190. This implies that convergent evolution giving rise to two similar sequences would be very rare and consequently if two sequences exhibit significant similarity it must have arisen through the fact that the two sequences arose from a common ancestor and therefore are homologous. The basic concepts The basic concepts associated with the pairwise alignment of DNA and protein sequences can be approached using a linguistic metaphor. The English alphabet contains 26 letters, that of DNA 4, and that of protein 20. Sometimes additional characters may be added to these basic alphabets, particularly where there is some degree of ambiguity over a position e.g. X is often used for unknown bases or amino acid residues (see Module 1). The way in which to align two identical sequences of characters is obvious. We can measure similarity or dissimilarity between pairs of sequences to give scores. There are several ways in which sequence (dis)similarity is measured: The Hamming Distance measures the number of different characters there are between two sequences, such as with the following two sequences: AGATCTAG TCGA AGGCATCATGCAGT which differ in 10 places, so their Hamming distance is 10. The proportional or p-distance. This is the Hamming distance divided by the total sequence length, so ranges from 0 to 1. In the above example the p-distance is 10/14. The log-odds ratio. This is a measure of how unlikely it is that two sequences should be so similar. It is based on the observed frequencies of each of the characters (bases or amino acids) in the sequences, and the probability of observing each homologous pair in the two sequences. It is a positive score, measuring similarity, and is calculated by adding the scores from pre-calculated matrices (see PAM and BLOSUM matrices below) with all the possible pairs of characters. Gaps Obviously genes can suffer insertions and deletions of one more bases and the corresponding proteins will contain insertions and deletions of amino acid residues. In order to compensate for such events it is necessary to introduce gaps into an alignment, but not so many that the alignment becomes unreliable. The most common method involves giving a penalty score (d) for opening a gap and another penalty score for extending the length (x) of the gap. We make the gap-extension cost (x) less than the gap-open cost (d) so that we don’t get too many gaps inserted, when fewer would do. This method of assigning gap costs is called affine. Insertions and deletions are lumped together in the term indels. Unfortunately, values for these penalty scores cannot be arrived at in the systematic way that substitution matrices were constructed and the values used are arrived at empirically. Local versus Global Pairwise Alignments It will frequently be the case that the two sequences to be compared will not be homologous over their entire length. There are several possible reasons for this: 1. We may have two sequences which have a gene in common but which we know have been subjected to extensive recombination in the other regions, so we could not guarantee that they are going to be similar throughout their length; 2. The process of evolution may to the formation of proteins that can be described as multimodular. Within a single polypeptide there may be different functional and structural modules or domains. Consequently when comparing two proteins there may be only be significant similarity between one small region of the two proteins (see Module 5). This is reflected in the fact programs designed to produce pairwise alignments of protein or DNA sequence are designed either to produce global or local pairwise alignments. A global approach will attempt to align two sequences along their entire length, whereas a local alignment will look for local regions of similarity or subsequences. The Needleman and Wunsch algorithm was devised for computing global alignment for two sequences whereas the Smith-Waterman algorithm finds the best local alignments and both provide the basis of several database searching programs. Both methods are dynamic programming algorithms which operate by solving smaller, but similar sub-problems. If you would like to know more about these algorithms try http://www.maths.tcd.ie/~lily/pres2/sld002.htm. ALION is a Web site where you can use either method to compare two protein sequences using a variety of substitution matrices (see below) The most simple method of sequence comparison is known as a dotplot. Essentially we have a grid (or two dimensional array) with one sequence along the X-axis and the other along the Y-axis. Each residue in turn in one sequence is compared with every other residue in the other sequence and a dot is put in the grid when any two residues are the same T T H E R A T S A T O N T H E C A H E C A T S A T O N T H E M A T T Within the dotplot identical words, or subsequences, are defined by diagonal line of dots (in red) across the plot. If the two sequences were identical, the line would be unbroken. Identical, or similar, subsequences are also detected by lines parallel to the main diagonal. The dot plot for haemoglobin alpha chains from the Emperor penguin (along the top) and the rabbit (down the side), show very high similarity DOTTUP is a Web site where such dotplots can be produced. PAM and BLOSUM matrices It is well known that certain groups of amino acids have similar physico-chemical properties and consequently the substitution of one amino acid by another from the same group is likely to have a less deleterious effect on the protein than substitution by an amino acid from another group. These substitutions are termed conservative and non-conservative respectively. In addition, a single base change in a nucleotide sequence may not necessarily cause a change in the amino acid sequence because of redundancy in the genetic code (a silent mutation). It would seem logical that a scoring system that measured the similarity of two sequences i.e. the log-odds ratio would take account of conservative and non-conservative substitutions and silent mutations. The first commonly accepted approach to developing a scoring system which took account of observed patterns of substitution was that of Dayhoff and his co-workers with the Point Accepted Mutation (PAM) model of evolution. 1 PAM unit is the extent of evolutionary divergence in which 1% of amino acid residues are altered. They took an alignment of 15 very closely related proteins and then calculated a matrix that represented the probability of a mutation altering one amino acid residue to any other amino acid on the basis of 1 PAM. Obviously when comparing more distantly related protein PAM1 would not be applicable and they extrapolated the PAM1 values to PAM250. THE PAM250 MATRIX A R N D C 2 -2 0 0 -2 6 0 -1 -4 2 2 -4 4 -5 12 Q E G H I L K M F P S T W Y V 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 A 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 -2 -4 -2 R 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 N 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 D -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 C 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 Q 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 E 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 G 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 H 5 2 -2 2 1 -2 -1 0 -5 -1 -4 I 6 -3 4 2 -3 -3 -2 -2 -1 2 L 5 0 -5 -1 0 0 -3 -4 -2 K 6 0 -2 -2 -1 -4 -2 2 M 9 -5 -3 -3 0 7 -1 F 6 1 0 -6 -5 1 P 2 1 -2 -3 -1 S 3 -5 -3 0 T 17 0 -6 W 10 -2 Y 4 V The PAM model of protein sequence evolution can be criticized in a number of ways. Perhaps the most important criticism is that the model assumes that all position within a protein molecule are equally changeable by mutation. In fact it is common to find that some residues, or indeed groups of residues, are absolutely unchanged in a group of related proteins, whereas other vary. An example would be critical active site residues in an enzyme. Because the PAM matrices were derived from protein which exhibited only slight (~15%) evolutionary divergence Henikov and Henikov (1992) derived a set of substitution matrices which were based on sequences that were much more divergent than those used for the PAM matrices. The BLOSUM (BLOcks SUbstition Matrix) matrices cover sequences with 80% or more similarity (BLOSUM 80), 62% or greater similarity (BLOSUM 62) etc. Pairwise alignments done with PAM or BLOSUM matrices look very similar, but differ in some of the detail. (see Exercises) Local alignment Suppose we have two sequences that we want to compare for local, as opposed to global, alignment. A generic local alignment procedure might work like this: 1. choose one sequence to be searched against the other – here we shall call sequence q the query sequence and sequence t the target sequence 2. divide the query sequence s into small subsequences, called words 3. for each word w of q, look along t to find whether there are any other words in t which are very similar to w 4. use these matching words as “anchors” from which to build up a better alignment between q and t 5. assess how good this alignment is. Both the methods below, FASTA and BLAST, use this general approach, but they differ in the ways in which they assess similarity between words, and in the way in which they go on to build up the alignments. FASTA AND BLAST FASTA is a good tool for scanning databases to find sequences which are similar to your query sequence. It uses a “Pearson and Lipman search” (Pearson & Lipman, 1988) to locate identical words (k-tuples) in the sequences being compared, as detailed below, via a generalization of dot plot approach. The sequence of events in a FASTA search is as follows: (i) (ii) (iii) (iv) the words in the query sequence are compared with each of the sequences in the database to get matching words (up to 6 nucleotides, or two amino acids in a row which match) the regions in which a good match has been made are rescored to accommodate ambiguities in the sequences, conservative changes (e.g., those which don't change the amino acid), and matches of shorter words the algorithm checks to see whether some of the matching words can be concatenated (joined up) while retaining the good match score the best sequences found so far are aligned completely with the query sequence for display to the user. BLAST stands for “Basic Local Alignment Search Tool”, and was developed by Altschul in 1990 (Altschul et al., 1990). It works by comparing the query sequence against all the sequences in the database to find the maximal segment pair, or MSP. Gaps are NOT permitted. The database segment and the query segment will both be the same length, though they need not be the length of the query sequence. BLAST is slightly less accurate than FASTA, but is faster. BLAST searches through all the sequences in the database being used, and for each pair of sequences finds this maximal segment pair. A segment pair is a matching of one subsequence (a segment) in one sequence to a subsequence (segment) in the other. Since BLAST doesn't allow gaps, these are going to be the same length, and for that same reason we expect the MSP's to be much shorter than the query and target sequences. BLAST uses a mathematical formula to calculate the probability that a segment pair with a given score could arise by chance in the two sequences – if the chance is very low (as with high scores) then we would attach statistical significance to getting such a high score. The algorithm returns all the segment pairs which had significantly high scores, ranked in decreasing order of that significance. The significance of alignments The most important question to be asked when to sequences have been aligned is whether the alignment is significant i.e. can it be taken as evidence that the two sequences are indeed evolutionarily related. There is no reliable mechanism of doing this for global alignments, but a good method exists for local alignments without gaps, so called High-Scoring Segment Pairs (HSPs). The likelihood of the alignment occurring by chance (p value) is derived from the observed score (S) to the expected distribution of scores. The size of the database being searched will affect this probability since the larger the database, the larger the probability of a sequence match by chance. The probability of such an alignment occurring by chance (the p value) is derived from the observed score (S) relative to the expected distribution of scores. The closer the p-value is to zero, the more significance can be attached to the alignment. Comparing FASTA with BLAST There are a few obvious differences between FASTA and BLAST. On the one hand, FASTA permits gaps, even though they're penalised, whereas BLAST, in its original form, does not (see below). That leads to the following inevitable conclusion, that BLAST must find shorter matching subsequences, because poor matches which could be accommodated with an insertion or deletion cannot fit into the same segment pair (no gaps) without a significant loss of the score for the whole word. On the other hand, BLAST will return a great many potential matches, simply because it does not construct the complete alignment from a given MSP. This means that there may well be a lot of extra matches which are biologically unimportant and have just arisen by chance. For instance if you did a BLAST search and it returned two segments in the same sequence that matched up well with your query sequence, but which were separated on it by a sizeable gap, then you would have to decide for yourself whether you considered the sequence to be a real hit, or whether it was unimportant. Flavours of BLAST BLAST comes in a variety of flavours depending on the type of search to be done: Program blastp blastn blastx tblastn tblastx Database searched protein nucleotide translation of nucleotide protein translation of nucleotide Query sequence protein nucleotide protein translation of nucleotide translation of nucleotide Recent developments of BLAST The original BLAST program is restricted by its inability to introduce gaps into alignment, however, a modified version has been developed GAP-BLAST (Altschul et al., 1997), which is far more sensitive, and this is now the standard version of BLAST available at sites such as the NCBI. Another development of BLAST is PSI-BLAST (Position-specific iterated BLAST), which is able to detect very remote homologues by taking the results of one search, constructing a profile, and then using this profile to search the database again to find other homologues (the process can be repeated until no new sequences are found). Exercises ALION is a program that carries out pairwise alignments 1. Use this site to compare two proteins of your choice. (a) What difference do you observe when using the Smith-Waterman and Needleman-Wunsch algorithms? (b) What effect does using a different substitution matrix have (c) What effects do altering the gap-opening and gap-extension penalties have? 2. UseBLAST at the NCBI to do a search for homologues of human calnexin (accession number AAB29309) (save the results for use in a later module) References and useful links Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. http://www.maths.tcd.ie/~lily/pres2/sld002.htm - A good Powerpoint presentation on the Needleman & Wunsch and Smith-Waterman algorithms. There are many places you can perform your own FastA search. Here are a few which are accessible on the web: http://www2.ebi.ac.uk/fasta3/?request; http://www.arabidopsis.org/cgi-bin/fasta/TAIRfasta.pl; http://www.bio.cam.ac.uk/cgi-bin/fasta3/fasta3.pl – Version 3.2; http://genome-www2.stanford.edu/cgi-bin/SGD/nph-fastasgd for comparison with S. cerevisiae sequences; http://fasta.genome.ad.jp/ in Japan.