doc

advertisement
BME 230 – Winter 2005
Proposal abstract rewrite
Wing Wong
BME 230 Winter 2005
Proposal rewrite
Wing Wong
Profile-profile alignment using hidden Markov models
Protein, RNAs, and other functional units in genomes are typically classified into families
of related sequences and structures. A multiple alignment of sequences from the same
family shows how they relate to each other and reveals their pattern of conservation or
variation. Hidden Markov models (HMMs) provide a statistical means to model the
probability distribution of a sequence family. Profile HMMs, in particular, represent the
profiles of multiple alignments by incorporating position-specific gap penalties into the
model of position-dependent character distributions. A trained HMM can then be used to
discriminate between family and non-family members in a database sequence search.
To detect probable homologs of a query sequence, one possible way is to align the query
sequence to profiles from known families. A score or expectation value of an alignment
then gives a statistical measure of how closely the sequence is related to the particular
family a profile represents. This is a sequence-profile alignment method as used in PSIBLAST (Altschul et al., 1997) or SAM-T2K (Karplus et al., 1998).
In a profile-profile method, a profile from the query sequence is aligned to template
profiles. Because more information is captured in a profile than in a single sequence,
profile-profile methods are able to provide more accurate alignments and more sensitive
homolog detection. The profile of the query sequence can be constructed from PSIBLAST alignments. Different E-value cutoffs can be used to create the profile while
cutoffs of 10-2 and 10-3 have been shown to produce good performance (Wallner et al.,
2004). PSI-BLAST, which uses the standard sequence-profile method, does not recognize
as many related sequences as a profile-profile method. It is, however, a decent generator
of an initial profile because of its computational advantages.
The major difference among profile-profile methods lies in the way they assign a score to
two profile positions. Each profile position is a vector containing the frequency of each
amino acid type in that particular position. When comparing two frequency vectors, the
similarity score can be calculated in different ways including sum-of-pairs scoring,
probabilistic based scoring, and measures based on information theory. Ohlson et al. have
shown that several profile-profile methods perform at least 30% better than sequenceprofile methods, both in detecting distantly related proteins and in producing alignments
for these proteins (Ohlson el al., 2004). Their conclusion that these methods are
significantly better at fold recognition also agrees with earlier studies (Rychlewski et al.,
2000; von Öhsen et al., 2003; Yona and Levitt, 2002; Edgar and Sjölander, 2003; Pei et
al., 2003). In addition, Mittleman et al. have shown that probabilistic scoring functions
yield more accurate short seed alignments (Mittleman et al., 2003).
All these studies indicate that profile-profile methods should become the standard tool for
their ability to perform more sensitive searches and produce better alignments. An
understanding of how different methods perform should help improving their
performance. This project surveys the different methods for profile-profile alignment and
-1-
February 15, 2005
BME 230 – Winter 2005
Proposal abstract rewrite
Wing Wong
in particular the different scoring schemes. Their effect on the performance of a profileprofile aligner will be studied.
References
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and
Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res., 25, 3389–3402.
Edgar,R.C. and Sjolander,K. (2004) COACH: profile-profile alignment of protein
families using hidden Markov models. Bioinformatics, 20(8), 1309-1318.
Karplus,K., Barrett,C., and Hughey,R. (1998) Hidden Markov Models for Detecting
Remote Protein Homologies, Bioinformatics, 14(10), 846-856.
Mittelman,D., Sadreyev,R., Grishin,N. (2003) Probabilistic scoring measures for profile–
profile comparison yield more accurate short seed alignments. Bioinformatics, 19, 1531–
1539.
Ohlson,T., Wallner,B., and Elofsson,A. (2004) Profile-profile methods provide improved
fold-recognition: a study of different profile-profile alignment methods. Proteins:
Structure, Function, and Bioinformatics, 57, 188-197.
Pei,J., Sadreyev,R., Grishin,NV. (2003) PCMA: fast and accurate multiple sequence
alignment based on profile consistency. Bioinformatics, 19, 427–428.
Rychlewski,L., Jaroszewski,L., Li,W. and Godzik,A. (2000) Comparison of sequence
profiles, strategies for structural predictions using sequence information. Protein Sci., 9,
232–241.
von Öhsen,N., Sommer,I. and Zimmer,R. (2003) Profile–profile alignment, a powerful
tool for protein structure prediction. Proc. Pacific Symp. Biocomp., 252–263.
Wallner,B., Fang,H., Ohlson,T., Frey-Skott,J., and Elofsson,A. (2004) Using
evolutionary information for the query and target improves fold recognition. Proteins, 54,
342–350.
Yona,G. and Levitt,M. (2002) Within the twilight zone: a sensitive profile–profile
comparison tool based on information theory. J. Mol. Biol., 315, 1257–1275.
-2-
February 15, 2005
Download