02/13/16 Leung: Page 1 of 16 Chapter 6 Sequence Alignment and Database Search Many biological problems such as the construction of phylogenetic trees or deducing putative gene functions can be approached by a sequence alignment. An alignment refers to a display of a collection of two or more sequences with one sequence written above another showing the similarities among the different members of the collection. All sequences in an alignment must be of the same type. They can be all DNA, all RNA, all amino acids, or all in those derived alphabets as introduced in Chapter 3. When two sequences in different living organisms show a fundamental functional similarity because of their having descended from a common ancestor, they are said to be homologous to each other. Although the rule is not absolute, it is true in many cases that similar nucleotide sequences or protein sequences are homologous to each other. Conversely, homologous genes and proteins often have similar sequences because they descend from a common ancestor. That is why a good sequence alignment can lend insight to important biological information about phylogeny, and the function of a gene or a protein. It is important, however, to emphasize here that sequence similarity and sequence homology are different concepts and must not be taken to be equivalent. Two sequences can either be homologous, or non-homologous. We cannot talk about a degree of homology. On the other hand, we can say that one pair of sequences are 95% similar while another pair is only 30% similar. Even though sequence homology and similarity are often observed together, one should never, without careful verification, take it for granted that one would imply the other. Finding a good alignment between even a pair of relatively short sequences (say, length 50 each) is a formidable task for the human eyes. However, with the computing power available to us at present, good sequence alignments between a pair of sequences can be obtained so quickly that one can align a query sequence against every sequence in a huge database holding millions of sequences within a very reasonable amount of time (say, a few minutes). Such a process of database search is getting very popular among geneticists and molecular biologists as they can derive useful information for a newly sequenced stretch of DNA from other similar sequences that have been previously studied. We shall devote the first section of this chapter to familiarize the reader with how sequence similarity is assessed. Section 2 explains the essence of dynamic algorithm used in many popular sequence alignment programs. Section 3 turns the attention to database search programs which is perhaps the most used application of sequence alignment. Section 4 will describe the statistics involved in evaluating the significance of the amount of similarity between sequences described by an alignment. Finally, we discuss some multiple alignment techniques in Section 5. 02/13/16 Leung: Page 2 of 16 6.1 Sequence Similarity Consider the pair of DNA fragments AGTAGTCAAGA and AGAAGCTCAAGA of length 10 and 11 nucleotide bases respectively. One cannot help noticing that these sequences kind of "look alike", and hence one would describe them as "similar" to each other. The similarity between the fragments are much more obvious to our eyes if we display them as follows: A G T A G _ T C A A G A (Alignment A) | | : | | | | | | | | A G A A G C T C A A G A A display of this kind is called an alignment. In an alignment, one sequence is stacked on top of the other. Since the two sequences may have different lengths, gaps are inserted at various places as necessary. Sandwiched between the two sequences is a line of symbols indicating whether the letters on the two sequences at corresponding positions are matches (|) or mismatches (:). The above display is only one of the numerous possible alignment of the given pair of DNA sequences. For example, these two DNA fragments can also be displayed as (Alignment B) A G T A G _ _ _ _ T C A A G A | | | | | | | | _ _ _ A G A A G C T C A A G A Indeed, if we allow ourselves to slide the first sequence on top of the second and introduce gaps at any arbitrarily places as necessary, we can generate an enormous number of different alignments. However, some of the alignments can better reveal the similarity between the pair of sequences than others. For our example, alignment A obviously reveals the similarity between the sequence pair better than alignment B. The goal of sequence alignment is to find the best alignment that reveals the highest amount of similarities between the two sequences. Sometimes there are actually more than one such best alignments. This brings up the question of how do we measure the similarity between two sequences when we are given an alignment of them. A simple way to measure the similarity expressed by an alignment is to assign a score to each individual position of the alignment according to whether there is a match, a mismatch, or a gap. The total of the scores at all the individual positions will give an overall score of the entire alignment. If we are interested only in a particular portion of the alignment, we can simply sum the scores in that portion. For example, if we assign a score of 1 to a match, -1 to a mismatch, and a -2 to a position with the gap letter in one of the sequences, we will get a score of 1+1-1+1+1-2+1+1+1+1+1+1 = 7 for Alignment A and a score of -2-2-2+1+1-2-2-2-2+1+1+1+1+1+1 = -6 for Alignment B. Clearly Alignment A expresses more similarity of the sequence pair than Alignment B. 02/13/16 Leung: Page 3 of 16 Scoring functions of this kind, which depend only on the count of matches, mismatches, and gap letters, do not take into account the various degrees of similarity in biochemical properties among the different pairs of bases. This is particularly important when we are aligning amino acid sequences because some of the 20 different amino acids are more similar to each other than others in their biochemical properties. Substitution of one amino acid by a different one similar in biochemical properties will not alter the function of the protein molecule structure and function much. On the other hand, replacing one amino acid by another which has entirely different biochemical properties can completely destroy the function of the protein molecule. Commonly, the amino acids are grouped into four families as displayed in Table 6.1. For those of you who are interested in chemistry, you may want to look up a biochemistry or molecular biology textbook (e.g., ...) to examine the chemical structure of these amino acids. You will see that members within the same family resemble one another more than members from different families. For example, glutamic acid would be much more similar to aspartic acid (both being acids) than to say, cysteine. Leucine and Isoleucine are almost identical in structure and they can easily substitute each other without altering too much of the chemical properties of the proteins. To take into account the various degrees of similarity and dissimilarity among amino acids, we make use of the scoring matrices. These are 20 by 20 matrices in which each entry indicates the similarity between the amino acid on the row and that on the column. Because of symmetry, it is sufficient to give the entries above and including the diagonal, or those below and including the diagonal. The other entries can be inferred by symmetry. Family Acidic Basic Uncharged Polar Nonpolar Members Aspartic acid, Glutamic acid Lysine, Arginine, Histidine Asparagine, Glutamine, Serine, Threonine, Tyrosine Alanine, Glysine, Valine, Leucine, Isoleucine, Phenylalanine, Methionine, Tryptophan, Cysteine Proline, The two big classes of scoring matrices are the PAM (Dayhoff 1972) and BLOSUM (Heinikoff and Heinikoff 1992) families of matrices. These families of matrices are constructed based on statistical analysis of a carefully collected database and the biologists knowledge of the evolutionary relationship of the sequences in the collection. We shall explain in detail the construction processes of these matrices in the section 6.3. Figure 6.1 shows the BLOSUM 62 matrix, a popularly used member of the BLOSUM family. When we are allowed to introduce gaps in an alignment, we need to assess how much is the similarity affected by the insertion of a gap, and the length of the gap. It is believed that extending the length of an already opened gap does not cause as devastating an effect as opening a new gap. While there are no general rules dictating what gap opening 02/13/16 Leung: Page 4 of 16 penalty and gap length extension penalty to use, most sequence alignment programs use a gap penalty function given by w(k) = a + bk for a gap of length k. Here a and b are respectively the gap opening and gap extension penalties which are free parameters for the users to choose values for. In practice, when we try different values for these parameters and examine the alignments obtained, we generally get a feeling of which values will produce the alignments that exhibit the similarities of the sequence under comparison. A R N D C Q E G H I L K M F P S T W Y V 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 A 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 R 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 N 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 D 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 C 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 Q 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 E 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 G 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 H 4 2 -3 1 0 -3 -2 -1 -3 -1 3 I 4 -2 2 0 -3 -2 -1 -2 -1 1 L 5 -1 -3 -1 0 -1 -3 -2 -2 K 5 0 -2 -1 -1 -1 -1 1 M 6 -4 -2 -2 1 3 -1 F 7 -1 4 -1 1 5 -4 -3 -2 -3 -2 -2 -2 -2 0 P S T 11 2 7 -3 -1 W Y 4 V Figure 6.1 The BLOSUM 62 amino acid substitution matrix. 6.2 Sequence alignment algorithms 6.2.1 Dot-matrix analysis The first computer aided sequence comparison is called "dot-matrix analysis" or simply dot-plot. The first published account of this method is by Gibbs and McIntyre (1970 The diagram, a method for comparing sequences. Eur. J. Biochem 16: 1-11). Briefly, this method involves constructing a matrix with one of the sequences to be compared running horizontally across the bottom, and the other running vertically along the left-hand side. Each entry of the matrix is a measure of similarity of those two residues on the horizontal and vertical sequence. In the Gibbs and McIntyre paper, they use the simplest scoring system, which distinguishes only between identical (dots) and non-identical (blank) residues. However, one can also use graded measures that give chemically similar pairs of 02/13/16 Leung: Page 5 of 16 bases higher similarity scores such as the BLOSUM and PAM matrices and enter a dot whenever the similarity exceeds a prescribed value. Similar sequences tend to have many identical or chemically related residues along the main diagonal; hence conspicuous diagonal runs of dots signal regions of similarity. Simple as it is, dot matrix analysis is still a popular tool for researchers to visually inspect the similarity between two sequences. It is often used as a first examination. From its output, the researcher can pick out regions from the two sequences on which more detailed alignment will be performed. Maizel and Lenk (1981 "Enhanced Graphic Matrix Analysis of Nucleic Acid and Protein Sequences", Proc. Natl. Acad. Sci. USA 78; 7665-7669) generalize the original ideas of Gibbs and McIntyre. At every base of the two sequences, a window of fixed size is laid down. A dot will be entered in the matrix if the total similarity score of the two windowed fragments exceeds a prescribed threshold. Their algorithm is implemented in the GCG program "compare". The output of compare can be fed into the "dot-plot" program to draw the dot-matrix. Figure 6.2 is the dot-plot output of the amino acid sequences of the human hemoglobin and chains. 02/13/16 Leung: Page 6 of 16 9 Figure 6.2 The dot-plot output of the amino acid sequences of the human hemoglobin alpha and beta chain. 02/13/16 Leung: Page 7 of 16 6.2.2 The dynamic programming algorithm In 1970, Needleman and Wunsch (1970, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48: 443 - 453) introduce an elegant algorithm for comparing two proteins sequences. This general algorithm works also for aligning nucleic acid sequences as well. The algorithm actually belongs to a very large class of algorithms for finding optimal solutions. The essence of the algorithm is a technique known as dynamic programming. For any letter sequence s, the segment of the sequence consisting of the letters from the beginning of the sequence up to the ith letter in the sequence is called a prefix, and it is denoted by s[i]. The dynamic programming technique basically tries to find the optimal alignment by taking advantage of the optimal alignments already found for the prefixes of the sequence. Suppose s and t are two sequences of size m and n respectively, there are m+1 possible prefixes of s and n+1 prefixes of t, including the empty string. To explain the calculations, we arrange our calculations in an (m+1) x (n+1) matrix where entry (i, j) contains the similarity between the prefixes s[i] and t[j]. This entry will be denoted by sim(i, j). Let us illustrate the dynamic programming algorithm using an example. We shall try to align the two DNA sequences s = GCTC and t = AGTCA with m = 4 and n = 5. Every base match receives a similarity score of +1 and every mismatch -1. The gap penalty function is chosen to be w(k) = -1-2k, where k is the length of the gap. In other words, a penalty of -3 will be given to a gap of length 1, and the penalty increase by multiples of 2 as the gap lengthens. Figure 6.3 Dynamic programming algorithm for global alignment We place s on the left and t along the top margin of a rectangular array. A special character "^" is introduced to indicate that the sequence will begin at the next position. The 0th row and 0th column are initialized with the gap penalty function with k being the length of the "gap" that has to be inserted at the beginning of either sequence. For instance, cell (0,3) has a value -7 because having the 0th character "^" of sequence s lining up with the 3rd character "T" of sequence t, producing a gap of length 3 at the beginning of the alignment like this: 02/13/16 Leung: Page 8 of 16 _ _ _ | sequence s begins here A G T | the rest of sequence t here The gap penalty, accordingly, is -1-2(3) = -7. For the rest of the array, cell (i, j) will be filled with the amount of similarity between the prefixes s[i] and t[j] computed recursively. Suppose we have already filled the entries at (i-1, j), (i-1, j-1), (i, j-1). Then we can compute sim(i, j k ) w(k ); k 1,... j (6.1) sim(i, j ) max sim(i 1, j 1) p(i, j ) sim(i k , j ) w(k ); k 1,..., i where p(i,j) is the similarity score between the ith letter of sequence s and the j letter of sequence t. In our scheme of this example, p(i,j) can only be +1 or -1 depending on the whether the letters are identical or different. The reasoning behind equation (6.1) is that there are just these possible ways of obtaining an alignment between s[i] and t[j]: (A) Align s[i] and t[j-k], and match a new gap of length k with the next k letters on t. (B) Align s[i-1] and t[j-1], and match the ith letter of s with the j letter of t. (C) Align s[i-k] and t[j], and match a new gap of length k with the next k letters on s. These possibilities are exhaustive because we cannot have two spaces paired in the last column of the alignment. Scores of the best alignments between smaller prefixes are already stored in the array if we choose an appropriate order in which to compute the entries (e.g., fill the array row by row, left to right in each row; or fill the array column by column, top to bottom on each column). As we enter each entry in the array following equation (6.1), we draw an arrow to indicate where the maximum value comes from. The options (A), (B), and (C) corresponds to getting the value for the current cell from the horizontal, diagonal, and vertical direction respectively. For instance, the cell in row 1 and column 3 of the matrix will contain the value of sim(1,3). This is obtained by taking as the maximum among the following numbers. sim(1, 2) - 3 = -2 - 3 = -5 (horizontal) sim(1, 1) - 5 = -1 -5 = -6 (horizontal) sim(0,2) - 1 = -5 - 1 = -6 (diagonal) sim(0,3) - 3 = -7 - 3 = -10 (vertical) The maximum value comes from entry (1,2), and that is where the arrow shows. If there are more than one way of getting the maximum, we put arrows to indicate all the possibilities. See, for example, entry (2,1). 02/13/16 Leung: Page 9 of 16 After the array has been completely filled, we find the best alignment by tracing back along the arrows. We start at the bottom right corner of the array and move according to the direction of the arrow. The best alignment we get from Figure 6.3 is A G T C A : : | | G C T C _ When there are multiple arrows emanating from an entry, we can follow any one of them. So it is possible to have more than one optimal alignment. Most computer programs for sequence alignment will report all different optimal alignments. It is important to note that an optimal alignment is optimal only for the particular similarity score matrix and the gap penalty functions. When any of these is altered, the optimal alignment will also change. The GCG program "Gap" uses the above algorithm to find the best global alignment of two sequences. Exercise With the same similarity scoring scheme, and gap penalty as in the example above, find the best alignment between the pair of DNA sequences in the beginning of this section. In the description above, we try to find an alignment that gives an overall best similarity scores between the entirety of the two sequences. This is called an optimal global alignment. At times, our aim is to find the best segments from the given pair of the sequences that lines up best with each other. This is called an optimal local alignment. Local alignments are particularly useful when a new sequence is just obtained from the laboratory. The researcher would first like to identify any parts of the sequence that have high similarity to known functional domains. The popular database search program BLAST uses a local alignment algorithm. The dynamic programming local alignment algorithm was developed in the early 1980's (Smith and Waterman 1981, “Identification of common molecular subsequences”. J Mol Biol. 147(1):195-7.) and is frequently referred to as the Smith-Waterman algorithm. It shares the same basic concepts with the global algorithm, differing only in a few details. First, an extra possibility is added to equation (6.1), allowing sim(s[1..i], t[1..j]) to take the value of 0 if all other options have value less than 0. That is, (6.2) 0 sim ( s[1..i ], t[1.. j k ]) w(k ); k 1,... j 1 sim ( s[1..i ], t[1.. j ]) max sim ( s[1..i 1], t[1.. j 1]) p(i, j ) sim ( s[1..i k ], t[1.. j ]) w(k ); k 1,..., i 1 02/13/16 Leung: Page 10 of 16 Consequently, the top row and left column of the array in Figure 6.3 will now be filled with 0's instead of the w(k)'s as in global alignment. Taking the option 0 corresponds to starting a new alignment. Since we are only looking for a local alignment, the alignment can start anywhere in the two sequences. So, if the best alignment up to a certain point has a negative score, it is better to start a new one at that point. Second, the alignment can end anywhere in the sequences. So, instead of starting the traceback from the bottom right corner, we look for the highest similarity value in the array and start the traceback from there. The traceback ends when we meet a cell with value 0, which corresponds to the start of the alignment. If we follow equation (6.2) to find the best local alignment to the same pair of sequences s and t. We will have the array in Figure 6.4, which indicates that the best local alignment between these two sequences is the match of two nucleotide bases TC in the 3rd and 4th positions of both sequences. Figure 6.4 Dynamic programming algorithm for local alignment The GCG programs that use dynamic programming algorithms for local pairwise sequence alignments are BestFit and FrameAlign. 6.3 Database similarity search -BLAST and FASTA BLAST is the acronym for Basic Local Alignment Search Tool. It uses the method of Altschul et al. (JMB 215:403-410, 1990) to pick out sequences already collected in a database that are similar to the query sequence. BLAST takes the query sequence input by the user and compares it with each entry in the database, looking for segments of high degrees of similarities. It picks out from the database those sequences that contain a segment so similar to part or all of the query that such similarity is deemed statistically significant (i.e., unlikely to occur by chance). 02/13/16 Leung: Page 11 of 16 Exercise BLAST is available at the NCBI web site. Before you go on, it may be helpful to visit http://www.ncbi.nlm.nih.gov/blast/ to take a look at the BLAST overview and go through the exercise in the BLAST tutorial there. The algorithm used in the current version of BLAST at NCBI can be summarized in three main steps: Step 1. Finding high-scoring segment pairs: For each sequence in the database, BLAST will compare it with the query. BLAST first seeks from the sequence pair, equal length sequence segments, which have maximal aggregate similarity score that cannot be increased by extension or trimming. Such locally optimal alignments are called "highscoring segment pairs" or HSP's. The current version of BLAST requires that each HSP must contain at least two non-overlapping pairs of words of length W (these word pairs are called "hits" in BLAST jargon, default values for W are 3 for amino acid, and 11 for nucleotide sequences) satisfying certain requirements: a) Their similarity score exceeds a threshold value T. b) The offset of the two word pairs are equal. If a word pair occurs at position x1 of the first sequence and position x2 of the second sequence, the offset of the word pair is defined to be x1- x2 c) The distance between the word pairs is no more than a preset upper limit A. The distance between two word pairs (x1, x2) and (x'1, x'2) is defined to be the difference between their first coordinates x1- x'1. The rationale behind these criteria for finding HSP's is based on the observation that an HSP with a large enough similarity score to eventually generate a statistically significant local alignment is very likely to contain multiple hits with the same offset and within a relatively short distance of one another. The chances of missing any HSP's of interest using this procedure is relatively small. Step 2. Gapped extensions of HSP's: BLAST will only retain those HSP's that exceed a moderate score Sg, and further attempt to extend the alignment in both the leftward and rightward directions while allowing gaps to be introduced. Sg is controlled so that no more than about one gapped extension is invoked per 50 database sequences. A dynamic type algorithm, with modifications to improve efficiency, is used. Whenever a gap is opened or extended, a penalty will be imposed according to a gap penalty function of the form w(k) = a+bk with k being the length of the gap. The alignment(s) with the maximal score will be assessed for statistical significance. Step 3. Assess statistical significance of the maximal alignment score: If the fully extended gapped alignment is deemed significant, the database sequence will be picked and described in the output. The evaluation of statistical significance is based on comparison with the rolling-die random sequence models described before (in Chapter 3). For example, the random amino acid sequence model will be generated by rolling an icosahedral (20 faced) die for a number of times equal to the length of the sequences 02/13/16 Leung: Page 12 of 16 under comparison. The die is loaded according to the relative frequencies of occurrence of the amino acids in the database. The maximal alignment score M for two random sequences is a random variable. Asymptotically, it follows an extreme value distribution when the lengths m and n of the sequences . In reality, when m and n large, the asymptotic distribution yields a good approximation that can be used to calculate the probability of the maximal local alignment score to exceed any given level. We shall discuss this more fully in the next section. From the probability distribution of the maximal alignment scores, one can determine the probability of getting an alignment as good as the one observed. If this probability is small (say < 0.05), the alignment is deemed statistically significant. In the BLAST output, this probability p is converted to a bit score equal to -log 2 p. The smaller the probability, the larger the bit score. One can also calculate the expected number E of times an alignment with such a score would occur in a database of the same size as the one searched. If this expected number is high, it means that the alignment can occur quite frequently by chance. On the other hand, a low value of E indicates that alignment is expected to occur very rarely and hence is worth further examination. BLAST lets you specify a parameter which discards those alignments expected to occur more than certain number (default is 10) of times. 6.4 Statistics for sequence alignments Beneath the surface of the sequence alignment programs lie two important applications of statistics. First, statistics play a key role in the construction of the similarity score matrices. Second, the evaluation of the significance the "best" alignment found by any sequence alignment algorithm also depends on statistics. The BLOSUM family of similarity score matrices BLOSUM is the acronym for blocks substitution matrix. The name comes from the fact that the values of the matrices come from a large collection of blocks of biologically similar proteins. S. Henikoff and J.G. Henikoff (1991, Nucleic Acid Res. 19, 6565-6572) designed an automated system, PROTOMAT, for obtaining a set of blocks given a group of related proteins. This system was applied to catalog of several hundred protein groups, yielding a database of more than 2000 blocks. Each block in this database consists of a number (say, d) of aligned amino acid sequences. Suppose there are w columns in the alignment. We say that the block has depth d and width w. From each column of d amino acids, one can form 1+2+...+(d -1) = d(d-1)/2 02/13/16 Leung: Page 13 of 16 unordered pairs of amino acids. For example, if a column contains nine alanines and one serine, one can form 10(9)/2 = 45 pairs, 36 of them are [A, A] and 9 [A, S]. Gap letters in the alignment will be ignored and no pair will be formed with gaps. Exercise If a column contains 2 A's, 1 S, 1 T and 1 _, list the possible pairs formed and their frequencies. When the procedure is repeated on every column of every protein block in the database, we obtain the frequency counts of all the 210 (i.e., 1 + 2 +...+ 20) amino acid pairs. For simplicity, we shall index the amino acids from 1 to 20 in some convenient order, (say, alphabetically by names) and denote the frequency counts by fij, where i=1, ..., 20, j = 1,..., i. These frequency counts will be used to calculate the score matrix. First, we need to calculate an "odds ratio" which is defined to be the ratio of the observed relative frequencies to the expected relative frequencies of the amino acid pairs. The observed relative frequencies are calculated as 20 i qij f ij / f ij . i 1 j 1 Let us pretend that the entire database has only that column of 9 A's and 1 S described before, where fAA = 36 and fAS = 9. Then qAA = 36/45 = 0.8 and qAS = 9/45 = 0.2. The expected relative frequencies pij are calculated based on a rolling-die model where all the amino acids in the protein block were generated independently. In such a model, we can write pij = pipj where pi and pj are the probabilities of observing the two individual amino acids in the database. These probabilities are estimated by the observed relative frequencies. So, pˆ i qii qij / 2 . Hence the expected relative frequency of the pair is j i estimated by pˆ i2 pˆ ij 2 pˆ i pˆ j i j i j In the example, the expected relative frequency for [A, A] is 0.9 x 0.9 = 0.81, that of [A, S] is 2 x 0.9 x 0.1 = 0.18, and that of [S, S] is 0.1 x 0.1 = 0.01. The odds ratio is then calculated where each entry is qij / pˆ ij . The base 2 logarithm, measured in number of bits, of this odds ratio is referred to as a "lod ratio" lod log 2 (qij / pˆ ij ) . The lod ratio is positive, zero, or negative according to the amino acid pair occurs more frequently than expected, just as frequently as expected, or less frequently than expected. A positive lod ratio indicates that the pair of amino acids frequently substitute for each other in proteins with like functions. The pair usually have similar molecular structures and biochemical functions. The lod ratios are multiplied by a scaling factor of 2 and then 02/13/16 Leung: Page 14 of 16 rounded to the nearest integer value to produce the values in a BLOSUM matrix in halfbit units. To reduce multiple contributions to amino acid pair frequencies from the most closely related members of a family, sequences are clustered within blocks and each cluster is weighted as a single sequence in counting pairs (Henikoff, S., Wallace, J.C., and Brown, J.P., 1990, Methods Enzymol. 183, 111-132.). This is done by specifying a clustering percentage in which sequence segments that are identical for at least that percentage of amino acids are grouped together. The BLOSUM matrix computed from this reduced block of proteins is associated with that percentage. That is why we have the BLOSUM62, BLOSUM80 matrices, etc. The clustering procedure is best explained by the example given in Henikoff and Henikoff (1991). Suppose the clustering percentage is set at 80%, and sequence A is identical to sequence B at 80% of their aligned positions, then A and B are clustered and their contributions are averaged in calculating pair frequencies. If C is identical to either A or B at 80% of aligned positions, it is also clustered with them and the contributions of A, B, and C are averaged, even though C might not be identical to both A and B at 80% of the aligned positions. In the above example, if 8 of the 9 sequences with A residues in the 9A-1S column are clustered, then the contribution of this column to the frequency table is equivalent to that of a 2A-1S column, which contributes 2[A,S] pairs. Assessing the statistical significance of alignment scores The statistical theory was initially developed for the older version of BLAST which only looks for ungapped local alignments (i.e., HSP's). For simplicity, we shall focus on the discussion of this ungapped situation and indicate what are the changes required to allow for gapped alignments. Again, a simple rolling-die model is assumed. The twenty amino acids occur randomly at all positions with background probabilities pi. We require that the expected score for two random amino acids Pi Pj sij be negative. i, j Based on the theory of extreme values, it can be proved that between two sufficiently long random letter sequences of lengths m and n, the number of high-scoring segment pairs exceeding a certain score S can be well approximated by a Poisson random variable with mean KmneS . Here K and are mathematical parameters that depend on the letter composition of the sequences as well as the matching scores in the matrix (see Karlin and Altschul 1990 "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes" and references therein). They are computed by the BLAST program and reported at the end of the BLAST output. The mathematical derivation of the above statement is beyond the scope of this book, but at least the formula makes intuitive sense. Doubling the length of either sequence under comparison 02/13/16 Leung: Page 15 of 16 should double the number of HSPs attaining the given score S. Also for an HSP to attain the score 2x, it must attain the score x twice in a row, so one expects E to decrease exponentially with the score. Here let us recall some of the results we discussed in Chapter 4. For a Poisson random variable X with mean , P( X 0) e . Hence, P( X 1) 1 e . For a small value of (say, < 0.01), 1 e is very close to . This is seen very easily for those of you who knows the Taylor series expansion for the exponential function. It can be demonstrated by trying out a few small values of . So if S is large enough to make small, the P-value for the observing an HSP with score S or more is approximately KmneS . The P-value tells how unlikely it is to have an HSP with a score as high as S. Note, however, that this P-value depends on the length of the sequences under comparison. To get a measure independent of the lengths, a bit-score S' is defined to be the base 2 logarithm of the P-value with m and n factored out. The mathematical properties of logarithms give the following relationship between the bitscores and raw scores. S ' log 2 ( P / mn) log 2 ( Ke s ) S ln K ln 2 The BLAST output also reports an E-value for each reported sequence. This is the expected number of HSP's with score as high as that reported when the query sequence is searched against a database of random sequences with the same base composition. If we are comparing the query sequence, of length m, with just one database sequence of length n, the expected number is given by . Now if the database contains s sequences of average length n, we can expect to see s sKmneS KmNeS that many HSP's. If this E-value is small, it is indicating that a match with such a high score in a database of the same size and composition is statistically unusual. The above discussion is done in terms of HSP's rather than gapped alignments such as the current version of BLAST allows. The reasoning, even with gapped alignments, are quite similar, except that the values of K and are calculated somewhat differently. These new parameters are used in the calculations of bit-scores and E-values in the BLAST output. 6.5 Multiple Sequence Alignment Needleman and Wunsch (1970) remark that the dynamic programming algorithm can be generalized to allow simultaneous alignment of more than two sequences. However, the procedure requires large amounts of computer memory as well as computing time. Many clever variation, adaptation of the basic algorithm for the pairwise alignment algorithms, 02/13/16 Leung: Page 16 of 16 some taking advantage of parallel processing computer systems, have been proposed. Examples of these are included in GCG programs for multiple sequence alignments: PileUp, SeqLab, PlotSimilarity, Pretty, PrettBox, Meme, ProfileMake, ProfileGap, Overlap, NoOverlap, OldDistances. We shall not describe them here but the interested readers can pursue the references contained in the GCG manual. In many applications, however, a full alignment of multiple sequences is not necessary nor practical, especially when the sequences under comparison are long. All we need is to find the matching segments in the sequences under investigation, just like finding local alignments in pairwise sequence comparison. There is a class of algorithm called the hash-coding type algorithm which locates matching segments among multiple, long sequences totally millions of bases very efficiently. The key feature of this type of algorithm is the construction of a “lookup table” of k-letter words or k-tuples (e.g., all possible dinucleotides and trinucleotides). The method was first introduced into molecular biology by Dumas and Ninio (1982) and was the basis of the database search programs FASTA which is also extensively used like BLAST. Exercises Applications of Multiple Sequence Alignments 1. Use multiple DNA sequence alignments to look for concensus sequences for prediction of promotion sites and splice junctions. 2. Multiple sequence alignments on Histones - sequences highly conserved - for unwinding the DNA to conduct activities Kinases Pleckstrin homolgy domains - highly conserved in structure but only 1/100 conserved amino acids. Chapter References: