BME 230 – Winter 2005 Proposal abstract rewrite Wing Wong BME 230 Winter 2005 Proposal rewrite Wing Wong Profile-profile alignment using hidden Markov models Protein, RNAs, and other functional units in genomes are typically classified into families of related sequences and structures. A multiple alignment of sequences from the same family shows how they relate to each other and reveals their pattern of conservation or variation. Hidden Markov models (HMMs) provide a statistical means to model the probability distribution of a sequence family. Profile HMMs, in particular, represent the profiles of multiple alignments by incorporating position-specific gap penalties into the model of position-dependent character distributions. A trained HMM can then be used to discriminate between family and non-family members in a database sequence search. To detect probable homologs of a query sequence, one possible way is to align the query sequence to profiles from known families. A score or expectation value of an alignment then gives a statistical measure of how closely the sequence is related to the particular family a profile represents. This is a sequence-profile alignment method as used in PSIBLAST (Altschul et al., 1997) or SAM-T2K (Karplus et al., 1998). In a profile-profile method, a profile from the query sequence is aligned to template profiles. Because more information is captured in a profile than in a single sequence, profile-profile methods are able to provide more accurate alignments and more sensitive homolog detection. The profile of the query sequence can be constructed from PSIBLAST alignments. Different E-value cutoffs can be used to create the profile while cutoffs of 10-2 and 10-3 have been shown to produce good performance (Wallner et al., 2004). PSI-BLAST, which uses the standard sequence-profile method, does not recognize as many related sequences as a profile-profile method. It is, however, a decent generator of an initial profile because of its computational advantages. The major difference among profile-profile methods lies in the way they assign a score to two profile positions. Each profile position is a vector containing the frequency of each amino acid type in that particular position. When comparing two frequency vectors, the similarity score can be calculated in different ways including sum-of-pairs scoring, probabilistic based scoring, and measures based on information theory. Ohlson et al. have shown that several profile-profile methods perform at least 30% better than sequenceprofile methods, both in detecting distantly related proteins and in producing alignments for these proteins (Ohlson el al., 2004). Their conclusion that these methods are significantly better at fold recognition also agrees with earlier studies (Rychlewski et al., 2000; von Öhsen et al., 2003; Yona and Levitt, 2002; Edgar and Sjölander, 2003; Pei et al., 2003). In addition, Mittleman et al. have shown that probabilistic scoring functions yield more accurate short seed alignments (Mittleman et al., 2003). All these studies indicate that profile-profile methods should become the standard tool for their ability to perform more sensitive searches and produce better alignments. An understanding of how different methods perform should help improving their performance. This project surveys the different methods for profile-profile alignment and -1- February 15, 2005 BME 230 – Winter 2005 Proposal abstract rewrite Wing Wong in particular the different scoring schemes. Their effect on the performance of a profileprofile aligner will be studied. References Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Edgar,R.C. and Sjolander,K. (2004) COACH: profile-profile alignment of protein families using hidden Markov models. Bioinformatics, 20(8), 1309-1318. Karplus,K., Barrett,C., and Hughey,R. (1998) Hidden Markov Models for Detecting Remote Protein Homologies, Bioinformatics, 14(10), 846-856. Mittelman,D., Sadreyev,R., Grishin,N. (2003) Probabilistic scoring measures for profile– profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 1531– 1539. Ohlson,T., Wallner,B., and Elofsson,A. (2004) Profile-profile methods provide improved fold-recognition: a study of different profile-profile alignment methods. Proteins: Structure, Function, and Bioinformatics, 57, 188-197. Pei,J., Sadreyev,R., Grishin,NV. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19, 427–428. Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Comparison of sequence profiles, strategies for structural predictions using sequence information. Protein Sci., 9, 232–241. von Öhsen,N., Sommer,I. and Zimmer,R. (2003) Profile–profile alignment, a powerful tool for protein structure prediction. Proc. Pacific Symp. Biocomp., 252–263. Wallner,B., Fang,H., Ohlson,T., Frey-Skott,J., and Elofsson,A. (2004) Using evolutionary information for the query and target improves fold recognition. Proteins, 54, 342–350. Yona,G. and Levitt,M. (2002) Within the twilight zone: a sensitive profile–profile comparison tool based on information theory. J. Mol. Biol., 315, 1257–1275. -2- February 15, 2005