BME230 Winter 2005 Project report Wing Wong Profile-profile alignment using hidden Markov models Wing Wong Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 sjsuwingwong@yahoo.com Abstract To detect distantly related proteins, the use of evolutionary information contained in a multiple sequence alignment always provides better results than the use of a single query sequence. Profile-profile methods generate an alignment of two multiple sequence alignments. They have been proven to produce more accurate alignments and more sensitive homolog detection. The four steps in a profile-profile alignment include: (i) preparation of multiple alignments; (ii) construction of numerical profiles from the multiple alignments; (iii) aligning and scoring; (iv) statistical evaluation of the resulting alignments. Besides the classical dynamic programming approach, hidden Markov models can also be used in the alignment process. Different scoring schemes have been proposed by various researchers and the family of log-odds based methods is shown to produce slightly better performance. It is our aim to incorporate a profile-profile aligner into the SAM suite with different profile-profile scoring options. 1. INTRODUCTION Protein, RNAs, and other functional units in genomes are typically classified into families of related sequences and structures. A multiple alignment of sequences from the same family shows how they relate to each other and reveals their pattern of conservation or variation. Hidden Markov models (HMMs) provide a statistical means to model the probability distribution of a sequence family. Profile HMMs, in particular, represent the profiles of multiple alignments by incorporating position-specific gap penalties into the model of position-dependent character distributions. A trained HMM can then be used to discriminate between family and non-family members in a database sequence search. To detect probable homologs of a protein sequence, one possible way is to align the query sequence to a library of profiles from known families. A score or expectation value of an alignment then gives a statistical measure of how closely the sequence is related to the particular family a profile represents. This is a sequence-profile alignment method as used in PSI-BLAST [1] or SAM-T2K [3]. An alternative approach is to use a profile-profile method. In a profile-profile method, a profile derived from the query sequence is aligned to a library of template profiles. Because more information is captured in a profile than in a single sequence, profileprofile methods are able to provide more accurate alignments and more sensitive homolog detection [8, 10, 12, 14, 2, 9]. The profile of the query sequence can be constructed from PSI-BLAST alignments. Different E-value cutoffs can be used to create the profile while cutoffs of 10-2 and 10-3 have been shown to produce good performance -1- BME230 Winter 2005 Project report Wing Wong [13]. PSI-BLAST, which uses the standard sequence-profile method, does not recognize as many related sequences as a profile-profile method. It is, however, a decent generator of an initial profile because of its computational advantages. The major difference among profile-profile methods lies in the way they assign a score to two profile positions. Each profile position is a vector containing the frequency of each amino acid type in that particular position. When comparing two frequency vectors, the similarity score can be calculated in different ways, such as sum-of-pairs scoring, probabilistic based scoring, or measures based on information theory. Ohlson et al. have shown that several profile-profile methods perform at least 30% better than sequenceprofile methods, both in detecting distantly related proteins and in producing alignments for these proteins [8]. Their conclusion that these methods are significantly better at fold recognition also agrees with earlier studies [10, 12, 14, 2, 9]. In addition, Mittleman et al. have shown that probabilistic scoring functions yield more accurate short seed alignments [7]. Profile-profile methods are becoming a standard tool in protein sequence analysis because of their ability to perform more sensitive searches and to produce better alignments. An understanding of how different methods perform should help us improve their performance. This project examines the principles of profile-profile alignments and reviews the different scoring schemes devised by various researchers. Section 2 describes the steps and issues involved in a profile-profile method. Section 3 illustrates how hidden Markov models can be used to align two profiles. Section 4 describes four different scoring schemes. A discussion is given in Section 5. And possible future work is discussed in Section 6. 2. PROFILE-PROFILE ALIGNMENT There are several major steps in a profile-profile alignment [10]: (i) preparation of a multiple alignment of the homologous family of the target sequence; (ii) construction of a position specific numerical profile from the multiple alignment; (iii) aligning and scoring the derived profile with the database of profiles; (iv) statistical evaluation of the resulting alignments. 2.1 Preparation of multiple alignment To obtain a multiple alignment, we have to consider what proteins to include in the generation of a profile. The choice of sequences has to balance two effects. The first is the inclusion of new information, which helps to increase the sensitivity in recognizing distant homologs. The second is the avoidance of errors, which leads to incorrect homology assignments. Outputs from PSI-BLAST are typically used in the generation of the profile needed. As an iterative method for sequence database searches, PSI-BLAST constructs a profile from the hits after each iteration. The diversity of the profile sequences can be controlled by adjusting the e-value cutoffs. A higher e-value cutoff allows more distantly related sequences in the profile. In [10], Rychlewski et al. studied the correlation between the diversity of profile sequences and the sensitivity of two profile-profile methods in recognizing distant homologies. They showed that e-values between 1e-10 to 1e-3 give high recognition rates because at this range, the sequences are -2- BME230 Winter 2005 Project report Wing Wong undoubtedly related to the sequence for which the profile is created and yet divergent enough to yield important information about the sequence family. 2.2 Construction of numerical profile A profile is a position-specific numerical representation of the residue distribution in a multiple sequence alignment. Each column in the multiple alignment is represented by a vector containing the frequencies of the 20 amino acid types plus the frequency of gap at that particular position. When constructing a profile, it is important to weight the sequences in the multiple alignment to compensate for sequence redundancy. Since closely related or even multiple entries of identical proteins are often included in database hits, the large number of redundant sequences will contribute more to the profile than the smaller number of divergent sequences, causing the lose of valuable information if a proper weighting scheme is not in use. A common way is to down weigh the contributions of residues from redundant sequences so that more divergent sequences have higher weights in the profile. In PSI-BLAST, the multiple alignment is purged leaving only sequences with less than 98% identity to other sequences. Henikoff sequence weights [5] are then applied and a pseudo count method is used to estimate the number of independent observations [10]. 2.3 Alignment and scoring A profile represents a family of homologous proteins. It can be compared to a database of protein sequences or a library of profiles. The comparison score for a given pair of positions in two profiles can be calculated in several different ways, such as sum-of-pairs scoring, probabilistic scoring, or measures based on information theory. The assignment of a score to two profile vectors should take into account two factors [8]: (1) a high score should be assigned only if the two vectors are similar; (2) even if the two vectors are similar, if they correspond to a random distribution, a high score should not be assigned. In addition, the use of prior information in the form of a substitution matrix has been shown to improve performance in some profile-profile methods [8]. Section 4 discusses some of the scoring methods experimented in earlier research. 2.4 Statistical evaluation of alignment results To perform significance evaluation, some profile-profile aligners transform the alignment scores into E-values or Z-scores. For example, the comparison of multiple protein alignments with assessment of statistical significance (COMPASS) [11] rescales the optimal scores of all profile pairs in the database to the extreme value distribution (EVD) to estimate E-values for the detected similarities between profiles. The Profile Comparer (PRC) [6] also estimates E-values for its observed log-odds scores by fitting them to an EVD. 3. USING HIDDEN MARKOV MODELS Hidden Markov models (HMMs) can also be used to construct an alignment of two multiple sequence alignments. There are various published methods that perform profileprofile alignments using HMMs. -3- BME230 Winter 2005 Project report Wing Wong 3.1 Comparison of Alignments by Constructing Hidden Markov Models (COACH) COACH [2] aligns two multiple alignments by constructing a profile HMM from one alignment and aligning the other alignment to that HMM. The following example illustrates how COACH generates an alignment via an HMM. Suppose we have a template alignment T from which a profile HMM is constructed. The input alignment A is the multiple sequence alignment to be aligned to the HMM. An output alignment C can be constructed by assigning columns in A to the emitter states in the HMM. Figure 1. Example alignment of two multiple alignments via an HMM [2] The profile HMM constructed from T has a match state for each column, as shown by the dotted arrows. The alignment of A to the HMM is done by assigning the columns of A to the emitter states in the model, as shown by the solid arrows. The resulting alignment C keeps both the columns of A and T intact. If one or more sequences in A visit an insert state, a column of gaps is inserted into T. If all sequences in A visit a delete state, a column of gaps is inserted into A. An assignment of columns of A to the HMM uniquely determines the path that a sequence in A has to take through the HMM. An optimal alignment is one that maximizes the probability of C, which is the product of the probabilities of the paths implied by the alignment for each sequence in A. Edgar and Sjölander [2] showed that the Viterbi algorithm to find a most probable path can be extended to handle multiple alignments. 3.2 Profile Comparer (PRC) PRC [6] aligns two profile HMMs and assign a score to the alignment. PRC generates a so-called pair HMM from the two profile HMMs to be aligned. States of the pair HMM are just pairs of ordinary profile HMM states (i.e. matches M, inserts I, and deletes D). A sample pair HMM topology is shown below. -4- BME230 Winter 2005 Project report Wing Wong Figure 2. Sample PRC pair hidden Markov model [6] A transition in the pair HMM corresponds to the simultaneous transitions in the two profile HMMs. For example, the transition MiDj -> IiMj+1 in the pair HMM is equivalent to the transitions Mi -> Ii and Dj -> Mj+1 in the first and second model respectively. Each transition probability in the pair HMM is computed as the product of the corresponding transition probabilities in the two profile HMMs. For a transition from WiXj to YkZl, the probability is given by [6]: pair HMM 1 HMM 2 Ptrans (Wi X j Yk Z l ) Ptrans (Wi Yk ) Ptrans ( X j Zl ) . The emission probabilities for MiMj, MiIj, IiMj, and IiIj can be calculated using different scoring schemes. PRC uses the dot product of the corresponding emission vectors [6]: Pempair ( X iY j ) PemHMM 1 ( | X i ) PemHMM 2 ( | Y j ) . { A,C , D ,...} This dot product score can easily be interpreted as the joint emission probability and be efficiently calculated. The other pair HMM states are silent. With the pair HMM, an alignment between two HMMs can be constructed using the Viterbi or Forward/Backward algorithms. 4. SOME SCORING METHODS In this section, we look at four scoring methods that have been shown to produce more sensitive recognition and better alignments than PSI-BLAST[8]. 4.1 Sum-of-pairs Scoring: Dot-Product Dot-product scoring is used in many profile-profile methods including the FFAS [10], a method introduced by Rychlewski et al. In FFAS, the vectors are balanced before the -5- BME230 Winter 2005 Project report Wing Wong scores are calculated. First, each amino acid frequency i in the vector is multiplied by the corresponding background frequency freqi: i ' i freqi i 1,...,20 The background frequencies are the frequencies of the amino acids in the database. Next, the balanced vector is multiplied by a constant weight of 5 and added back to the original vector. The value 5 is based on the estimated average diversity of sequence profiles in the database. i ' ' 5 i ' i i 1,...,20 Finally, the fraction score for the vector is calculated and the vector of amino acid frequencies becomes a probability vector: i ' '2 i ' ' ' 20 i ' '2 i 1 The dot-product score between two vectors can then be computed as: 20 score ( , ) i ' ' ' i ' ' ' i 1 Dot-product score ranges from 0 to 1. A shift value can be added to the score when performing local alignments so that a score between two vectors is between 0 + shift and 1 + shift. The scores can also be normalized so that the standard deviation is equal to 1. 4.2 Information Theoretic Measure: prof_sim A scoring method that uses an information theoretic measure is devised by Yona and Levitt [14] in their profile-profile comparison tool. Their profile similarity score measures the similarity between two probability distributions. The similarity score between two vectors in two profiles is defined as a combination of their statistical similarity and the significance of this similarity. For two profile vectors and δ, the Kullback-Leibler (KL) divergence is defined as: 20 D KL ( , ) k log 2 k 1 k k The Jensen-Shannon divergence is defined as: DJS ( , ) D KL ( , r ) (1 ) D KL ( , r ) where r (1 ) can be considered as the most likely common source distribution of both distributions α and β, with λ as a prior weight. The divergence score DJS is symmetric and ranges from 0 to 1, where the score between two identical distributions is 0. A small DJS value indicates that the two profile columns are closely related and may be approximated by the common source distribution. To assess the significance S of the similarity, the JS divergence of the common source r from the background is measured: S D JS ( r, freq) where freq is the overall amino acid distribution in the database. This significance measure S reflects the probability that the source distribution is obtained by chance. -6- BME230 Winter 2005 Project report Wing Wong Finally, the divergence score and the significance score is combined to give a single similarity score: 1 score ( , ) (1 D JS ( , ))(1 S ) 2 Using this measure, similar distributions (D -> 0) having common source far from the background (S -> 1) will have a similarity score close to 1. Scores for dissimilar distributions (D -> 1) close to the background distribution (S -> 0) will tend to 0. Distributions that are similar (D -> 0) but resemble the background (S -> 0) will have a score of ½. The combined similarity score reflects both divergence and significance. The similarity of two random distributions is not as significant as the similarity of two unique distributions. 4.3 Probabilistic Scoring: log_aver Unlike the previous information theoretic measure which measures only the similarity between two profiles but not the similarity between amino acids, the log average scoring proposed by von Öhsen et al. [12] explicitly uses the substitution matrix when scoring two frequency profiles. They extended the usual amino acid similarity score to the profile-profile situation, making the sequence-sequence score a special case of their formula. The log average score is defined as: 20 20 score ( , ) ln i j exp((ln 2 / 2) BLOSUM 62i , j ) i 1 j 1 where BLOSUM62i,j is the value in the BLOSUM62 substitution matrix for amino acid i substituted by j. The score is a geometric mean of the BLOSUM62 scores for all amino acids in a position of the multiple sequence alignment. It is equivalent to the sum of the probabilities of replacing all amino acids in the α vector with the ones in the β vector. A pair of vectors gets a high score if they have a similar distribution and are conserved. Log average score can be used directly for local alignment without any transformation or shift. 4.4 Probabilistic Scoring: prob_score Mittelman et al. modified the PICASSO scoring method introduced by Heger and Holm [4] and came up with a symmetric scheme called PICASSO3 [7]. The probabilistic score is defined as: 20 score( , ) i log i 20 i log i . freqi i 1 freqi The highest is obtained between two similar conserved positions. Positions that have a distribution similar to the background amino acid frequencies will have intermediate scores. i 1 5. DISCUSSION In [8], Ohlson et al. compared the four scoring methods above on their ability to recognize proteins related at different levels according to SCOP (Structural Classification of Proteins database), and on the quality of alignments generated. The ability of the methods to separate structurally aligned residues from random residues on family level, -7- BME230 Winter 2005 Project report Wing Wong superfamily level, and fold level was tested. At family level, prof_sim and prob_score show better performance. On superfamily and fold levels, prob_score is the best method, followed by log_aver which performs better than dot-product and prof_sim. Their study concluded that all the tested methods are significantly better at fold-recognition than standard sequence-profile methods, which agrees with earlier studies [10, 12, 14, 2, 9]. The alignment qualities are also better than for standard sequence-profile method, which is also in agreement with the result in [11]. One point to note is that the performance of a profile-profile method depends not only on the scoring method, but also on many other factors such as gap-penalties, alignment methodology, and E-value calculations [8]. All these factors must also be optimized to obtain the best result. In fact, the difference in performance between the different profileprofile methods is quite small if the gap-penalties are optimized separately for alignment generation and fold recognition. However, the probabilistic scoring methods (log_aver and prob_score) do have a slight advantage as they are less sensitive to gap-penalties. The two methods show good performance in both fold recognition and alignment quality using identical parameters while the other methods have to use different parameters to obtain similar performance. In a study to evaluate different scoring methods on their ability to predict accurate short ungapped alignment segments, Mittleman et al. came to the conclusion that the log-odds based scoring methods they studied (PICASSO3 and COMPASS) and the method prof_sim perform significantly better than the sum-of-pairs, Pearson’s correlation coefficient or the dot-product methods [7]. Their result suggests that the family of probabilistic methods (log-odds based methods and prof_sim) is able to provide better initial ‘seeds’ in the first step of a local profile-profile alignment. One final note is that the improvement in recognition sensitivity and alignment quality does come with a cost. Profile-profile methods are, in general, much more time and memory demanding than profile-sequence methods. They should be used when the computational cost can be justified. 6. FUTURE WORK The remote homolog detection method currently used by SAM [3] relies on a sequenceprofile comparison. The incorporation of a profile-profile aligner into the SAM suite should further improve its performance. We plan to test the different scoring methods and the possibility of incorporating a secondary structure component into the scoring system. 7. REFERENCES [1] Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. [2] Edgar,R.C. and Sjölander,K. (2004) COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, 20(8), 1309-1318. -8- BME230 Winter 2005 Project report Wing Wong [3] Karplus,K., Barrett,C., and Hughey,R. (1998) Hidden Markov Models for Detecting Remote Protein Homologies, Bioinformatics, 14(10), 846-856. [4] Heger,A. and Holm,L. (2003) Exhaustive enumeration of protein domain families. J. Mol. Biol., 328, 749–767. [5] Henikoff,S. and Henikoff,J.G. (1994) Position-based sequence weights. J. Mol. Biol., 243, 574-578. [6] Madera,M. (currently unpublished) PRC, the Profile Comparer. http://supfam.org/ PRC. [7] Mittelman,D., Sadreyev,R., Grishin,N. (2003) Probabilistic scoring measures for profile–profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 1531–1539. [8] Ohlson,T., Wallner,B., and Elofsson,A. (2004) Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins: Structure, Function, and Bioinformatics, 57, 188-197. [9] Pei,J., Sadreyev,R., Grishin,NV. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19, 427–428. [10] Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Comparison of sequence profiles, strategies for structural predictions using sequence information. Protein Sci., 9, 232–241. [11] Sadreyev,R. and Grishin,N. (2003) COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317– 336. [12] von Öhsen,N., Sommer,I. and Zimmer,R. (2003) Profile–profile alignment, a powerful tool for protein structure prediction. Proc. Pacific Symp. Biocomp., 252– 263. [13] Wallner,B., Fang,H., Ohlson,T., Frey-Skott,J., and Elofsson,A. (2004) Using evolutionary information for the query and target improves fold recognition. Proteins, 54, 342–350. [14] Yona,G. and Levitt,M. (2002) Within the twilight zone: a sensitive profile–profile comparison tool based on information theory. J. Mol. Biol., 315, 1257–1275. -9-