doc

advertisement
BME230 Winter 2005
Project report
Wing Wong
Profile-profile alignment using hidden Markov models
Wing Wong
Computer Science Department
San Jose State University
San Jose, CA 95192
408-924-1000
sjsuwingwong@yahoo.com
Abstract
To detect distantly related proteins, the use of evolutionary information contained in a
multiple sequence alignment always provides better results than the use of a single query
sequence. Profile-profile methods generate an alignment of two multiple sequence
alignments. They have been proven to produce more accurate alignments and more
sensitive homolog detection. The four steps in a profile-profile alignment include: (i)
preparation of multiple alignments; (ii) construction of numerical profiles from the
multiple alignments; (iii) aligning and scoring; (iv) statistical evaluation of the resulting
alignments. Besides the classical dynamic programming approach, hidden Markov
models can also be used in the alignment process. Different scoring schemes have been
proposed by various researchers and the family of log-odds based methods is shown to
produce slightly better performance. It is our aim to incorporate a profile-profile aligner
into the SAM suite with different profile-profile scoring options.
1. INTRODUCTION
Protein, RNAs, and other functional units in genomes are typically classified into families
of related sequences and structures. A multiple alignment of sequences from the same
family shows how they relate to each other and reveals their pattern of conservation or
variation. Hidden Markov models (HMMs) provide a statistical means to model the
probability distribution of a sequence family. Profile HMMs, in particular, represent the
profiles of multiple alignments by incorporating position-specific gap penalties into the
model of position-dependent character distributions. A trained HMM can then be used to
discriminate between family and non-family members in a database sequence search.
To detect probable homologs of a protein sequence, one possible way is to align the
query sequence to a library of profiles from known families. A score or expectation value
of an alignment then gives a statistical measure of how closely the sequence is related to
the particular family a profile represents. This is a sequence-profile alignment method as
used in PSI-BLAST [1] or SAM-T2K [3].
An alternative approach is to use a profile-profile method. In a profile-profile method, a
profile derived from the query sequence is aligned to a library of template profiles.
Because more information is captured in a profile than in a single sequence, profileprofile methods are able to provide more accurate alignments and more sensitive
homolog detection [8, 10, 12, 14, 2, 9]. The profile of the query sequence can be
constructed from PSI-BLAST alignments. Different E-value cutoffs can be used to create
the profile while cutoffs of 10-2 and 10-3 have been shown to produce good performance
-1-
BME230 Winter 2005
Project report
Wing Wong
[13]. PSI-BLAST, which uses the standard sequence-profile method, does not recognize
as many related sequences as a profile-profile method. It is, however, a decent generator
of an initial profile because of its computational advantages.
The major difference among profile-profile methods lies in the way they assign a score to
two profile positions. Each profile position is a vector containing the frequency of each
amino acid type in that particular position. When comparing two frequency vectors, the
similarity score can be calculated in different ways, such as sum-of-pairs scoring,
probabilistic based scoring, or measures based on information theory. Ohlson et al. have
shown that several profile-profile methods perform at least 30% better than sequenceprofile methods, both in detecting distantly related proteins and in producing alignments
for these proteins [8]. Their conclusion that these methods are significantly better at fold
recognition also agrees with earlier studies [10, 12, 14, 2, 9]. In addition, Mittleman et al.
have shown that probabilistic scoring functions yield more accurate short seed alignments
[7].
Profile-profile methods are becoming a standard tool in protein sequence analysis
because of their ability to perform more sensitive searches and to produce better
alignments. An understanding of how different methods perform should help us improve
their performance. This project examines the principles of profile-profile alignments and
reviews the different scoring schemes devised by various researchers. Section 2 describes
the steps and issues involved in a profile-profile method. Section 3 illustrates how hidden
Markov models can be used to align two profiles. Section 4 describes four different
scoring schemes. A discussion is given in Section 5. And possible future work is
discussed in Section 6.
2. PROFILE-PROFILE ALIGNMENT
There are several major steps in a profile-profile alignment [10]: (i) preparation of a
multiple alignment of the homologous family of the target sequence; (ii) construction of a
position specific numerical profile from the multiple alignment; (iii) aligning and scoring
the derived profile with the database of profiles; (iv) statistical evaluation of the resulting
alignments.
2.1 Preparation of multiple alignment
To obtain a multiple alignment, we have to consider what proteins to include in the
generation of a profile. The choice of sequences has to balance two effects. The first is
the inclusion of new information, which helps to increase the sensitivity in recognizing
distant homologs. The second is the avoidance of errors, which leads to incorrect
homology assignments. Outputs from PSI-BLAST are typically used in the generation of
the profile needed. As an iterative method for sequence database searches, PSI-BLAST
constructs a profile from the hits after each iteration. The diversity of the profile
sequences can be controlled by adjusting the e-value cutoffs. A higher e-value cutoff
allows more distantly related sequences in the profile. In [10], Rychlewski et al. studied
the correlation between the diversity of profile sequences and the sensitivity of two
profile-profile methods in recognizing distant homologies. They showed that e-values
between 1e-10 to 1e-3 give high recognition rates because at this range, the sequences are
-2-
BME230 Winter 2005
Project report
Wing Wong
undoubtedly related to the sequence for which the profile is created and yet divergent
enough to yield important information about the sequence family.
2.2 Construction of numerical profile
A profile is a position-specific numerical representation of the residue distribution in a
multiple sequence alignment. Each column in the multiple alignment is represented by a
vector containing the frequencies of the 20 amino acid types plus the frequency of gap at
that particular position. When constructing a profile, it is important to weight the
sequences in the multiple alignment to compensate for sequence redundancy. Since
closely related or even multiple entries of identical proteins are often included in database
hits, the large number of redundant sequences will contribute more to the profile than the
smaller number of divergent sequences, causing the lose of valuable information if a
proper weighting scheme is not in use. A common way is to down weigh the
contributions of residues from redundant sequences so that more divergent sequences
have higher weights in the profile. In PSI-BLAST, the multiple alignment is purged
leaving only sequences with less than 98% identity to other sequences. Henikoff
sequence weights [5] are then applied and a pseudo count method is used to estimate the
number of independent observations [10].
2.3 Alignment and scoring
A profile represents a family of homologous proteins. It can be compared to a database of
protein sequences or a library of profiles. The comparison score for a given pair of
positions in two profiles can be calculated in several different ways, such as sum-of-pairs
scoring, probabilistic scoring, or measures based on information theory. The assignment
of a score to two profile vectors should take into account two factors [8]: (1) a high score
should be assigned only if the two vectors are similar; (2) even if the two vectors are
similar, if they correspond to a random distribution, a high score should not be assigned.
In addition, the use of prior information in the form of a substitution matrix has been
shown to improve performance in some profile-profile methods [8]. Section 4 discusses
some of the scoring methods experimented in earlier research.
2.4 Statistical evaluation of alignment results
To perform significance evaluation, some profile-profile aligners transform the alignment
scores into E-values or Z-scores. For example, the comparison of multiple protein
alignments with assessment of statistical significance (COMPASS) [11] rescales the
optimal scores of all profile pairs in the database to the extreme value distribution (EVD)
to estimate E-values for the detected similarities between profiles. The Profile Comparer
(PRC) [6] also estimates E-values for its observed log-odds scores by fitting them to an
EVD.
3. USING HIDDEN MARKOV MODELS
Hidden Markov models (HMMs) can also be used to construct an alignment of two
multiple sequence alignments. There are various published methods that perform profileprofile alignments using HMMs.
-3-
BME230 Winter 2005
Project report
Wing Wong
3.1 Comparison of Alignments by Constructing Hidden Markov Models (COACH)
COACH [2] aligns two multiple alignments by constructing a profile HMM from one
alignment and aligning the other alignment to that HMM. The following example
illustrates how COACH generates an alignment via an HMM. Suppose we have a
template alignment T from which a profile HMM is constructed. The input alignment A is
the multiple sequence alignment to be aligned to the HMM. An output alignment C can
be constructed by assigning columns in A to the emitter states in the HMM.
Figure 1. Example alignment of two multiple alignments via an HMM [2]
The profile HMM constructed from T has a match state for each column, as shown by the
dotted arrows. The alignment of A to the HMM is done by assigning the columns of A to
the emitter states in the model, as shown by the solid arrows. The resulting alignment C
keeps both the columns of A and T intact. If one or more sequences in A visit an insert
state, a column of gaps is inserted into T. If all sequences in A visit a delete state, a
column of gaps is inserted into A. An assignment of columns of A to the HMM uniquely
determines the path that a sequence in A has to take through the HMM. An optimal
alignment is one that maximizes the probability of C, which is the product of the
probabilities of the paths implied by the alignment for each sequence in A. Edgar and
Sjölander [2] showed that the Viterbi algorithm to find a most probable path can be
extended to handle multiple alignments.
3.2 Profile Comparer (PRC)
PRC [6] aligns two profile HMMs and assign a score to the alignment. PRC generates a
so-called pair HMM from the two profile HMMs to be aligned. States of the pair HMM
are just pairs of ordinary profile HMM states (i.e. matches M, inserts I, and deletes D). A
sample pair HMM topology is shown below.
-4-
BME230 Winter 2005
Project report
Wing Wong
Figure 2. Sample PRC pair hidden Markov model [6]
A transition in the pair HMM corresponds to the simultaneous transitions in the two
profile HMMs. For example, the transition MiDj -> IiMj+1 in the pair HMM is equivalent
to the transitions Mi -> Ii and Dj -> Mj+1 in the first and second model respectively. Each
transition probability in the pair HMM is computed as the product of the corresponding
transition probabilities in the two profile HMMs. For a transition from WiXj to YkZl, the
probability is given by [6]:
pair
HMM 1
HMM 2
Ptrans
(Wi X j  Yk Z l )  Ptrans
(Wi  Yk ) Ptrans
( X j  Zl ) .
The emission probabilities for MiMj, MiIj, IiMj, and IiIj can be calculated using different
scoring schemes. PRC uses the dot product of the corresponding emission vectors [6]:
Pempair ( X iY j )   PemHMM 1 ( | X i ) PemHMM 2 ( | Y j ) .
{ A,C , D ,...}
This dot product score can easily be interpreted as the joint emission probability and be
efficiently calculated. The other pair HMM states are silent. With the pair HMM, an
alignment between two HMMs can be constructed using the Viterbi or
Forward/Backward algorithms.
4. SOME SCORING METHODS
In this section, we look at four scoring methods that have been shown to produce more
sensitive recognition and better alignments than PSI-BLAST[8].
4.1 Sum-of-pairs Scoring: Dot-Product
Dot-product scoring is used in many profile-profile methods including the FFAS [10], a
method introduced by Rychlewski et al. In FFAS, the vectors are balanced before the
-5-
BME230 Winter 2005
Project report
Wing Wong
scores are calculated. First, each amino acid frequency i in the vector  is multiplied by
the corresponding background frequency freqi:
 i '   i freqi i  1,...,20
The background frequencies are the frequencies of the amino acids in the database. Next,
the balanced vector is multiplied by a constant weight of 5 and added back to the original
vector. The value 5 is based on the estimated average diversity of sequence profiles in the
database.
 i ' '  5 i ' i i  1,...,20
Finally, the fraction score for the vector is calculated and the vector of amino acid
frequencies becomes a probability vector:
i ' '2
 i ' ' '  20
i ' '2
i 1
The dot-product score between two vectors can then be computed as:
20
score ( ,  )    i ' ' '  i ' ' '
i 1
Dot-product score ranges from 0 to 1. A shift value can be added to the score when
performing local alignments so that a score between two vectors is between 0 + shift and
1 + shift. The scores can also be normalized so that the standard deviation is equal to 1.
4.2 Information Theoretic Measure: prof_sim
A scoring method that uses an information theoretic measure is devised by Yona and
Levitt [14] in their profile-profile comparison tool. Their profile similarity score
measures the similarity between two probability distributions. The similarity score
between two vectors in two profiles is defined as a combination of their statistical
similarity and the significance of this similarity.
For two profile vectors  and δ, the Kullback-Leibler (KL) divergence is defined as:
20
D KL ( ,  )    k log 2
k 1
k
k
The Jensen-Shannon divergence is defined as:
DJS ( ,  )  D KL ( , r )  (1   ) D KL (  , r )
where r    (1   )  can be considered as the most likely common source
distribution of both distributions α and β, with λ as a prior weight. The divergence score
DJS is symmetric and ranges from 0 to 1, where the score between two identical
distributions is 0. A small DJS value indicates that the two profile columns are closely
related and may be approximated by the common source distribution.
To assess the significance S of the similarity, the JS divergence of the common source r
from the background is measured:
S  D JS ( r, freq)
where freq is the overall amino acid distribution in the database. This significance
measure S reflects the probability that the source distribution is obtained by chance.
-6-
BME230 Winter 2005
Project report
Wing Wong
Finally, the divergence score and the significance score is combined to give a single
similarity score:
1
score ( ,  )  (1  D JS ( ,  ))(1  S )
2
Using this measure, similar distributions (D -> 0) having common source far from the
background (S -> 1) will have a similarity score close to 1. Scores for dissimilar
distributions (D -> 1) close to the background distribution (S -> 0) will tend to 0.
Distributions that are similar (D -> 0) but resemble the background (S -> 0) will have a
score of ½. The combined similarity score reflects both divergence and significance. The
similarity of two random distributions is not as significant as the similarity of two unique
distributions.
4.3 Probabilistic Scoring: log_aver
Unlike the previous information theoretic measure which measures only the similarity
between two profiles but not the similarity between amino acids, the log average scoring
proposed by von Öhsen et al. [12] explicitly uses the substitution matrix when scoring
two frequency profiles. They extended the usual amino acid similarity score to the
profile-profile situation, making the sequence-sequence score a special case of their
formula. The log average score is defined as:
20 20
score ( ,  )  ln  i  j  exp((ln 2 / 2)  BLOSUM 62i , j )
i 1 j 1
where BLOSUM62i,j is the value in the BLOSUM62 substitution matrix for amino acid i
substituted by j. The score is a geometric mean of the BLOSUM62 scores for all amino
acids in a position of the multiple sequence alignment. It is equivalent to the sum of the
probabilities of replacing all amino acids in the α vector with the ones in the β vector. A
pair of vectors gets a high score if they have a similar distribution and are conserved. Log
average score can be used directly for local alignment without any transformation or
shift.
4.4 Probabilistic Scoring: prob_score
Mittelman et al. modified the PICASSO scoring method introduced by Heger and Holm
[4] and came up with a symmetric scheme called PICASSO3 [7]. The probabilistic score
is defined as:
20
score( ,  )  i log
i
20
   i log
i
.
freqi i 1
freqi
The highest is obtained between two similar conserved positions. Positions that have a
distribution similar to the background amino acid frequencies will have intermediate
scores.
i 1
5. DISCUSSION
In [8], Ohlson et al. compared the four scoring methods above on their ability to
recognize proteins related at different levels according to SCOP (Structural Classification
of Proteins database), and on the quality of alignments generated. The ability of the
methods to separate structurally aligned residues from random residues on family level,
-7-
BME230 Winter 2005
Project report
Wing Wong
superfamily level, and fold level was tested. At family level, prof_sim and prob_score
show better performance. On superfamily and fold levels, prob_score is the best method,
followed by log_aver which performs better than dot-product and prof_sim. Their study
concluded that all the tested methods are significantly better at fold-recognition than
standard sequence-profile methods, which agrees with earlier studies [10, 12, 14, 2, 9].
The alignment qualities are also better than for standard sequence-profile method, which
is also in agreement with the result in [11].
One point to note is that the performance of a profile-profile method depends not only on
the scoring method, but also on many other factors such as gap-penalties, alignment
methodology, and E-value calculations [8]. All these factors must also be optimized to
obtain the best result. In fact, the difference in performance between the different profileprofile methods is quite small if the gap-penalties are optimized separately for alignment
generation and fold recognition. However, the probabilistic scoring methods (log_aver
and prob_score) do have a slight advantage as they are less sensitive to gap-penalties.
The two methods show good performance in both fold recognition and alignment quality
using identical parameters while the other methods have to use different parameters to
obtain similar performance.
In a study to evaluate different scoring methods on their ability to predict accurate short
ungapped alignment segments, Mittleman et al. came to the conclusion that the log-odds
based scoring methods they studied (PICASSO3 and COMPASS) and the method
prof_sim perform significantly better than the sum-of-pairs, Pearson’s correlation
coefficient or the dot-product methods [7]. Their result suggests that the family of
probabilistic methods (log-odds based methods and prof_sim) is able to provide better
initial ‘seeds’ in the first step of a local profile-profile alignment.
One final note is that the improvement in recognition sensitivity and alignment quality
does come with a cost. Profile-profile methods are, in general, much more time and
memory demanding than profile-sequence methods. They should be used when the
computational cost can be justified.
6. FUTURE WORK
The remote homolog detection method currently used by SAM [3] relies on a sequenceprofile comparison. The incorporation of a profile-profile aligner into the SAM suite
should further improve its performance. We plan to test the different scoring methods and
the possibility of incorporating a secondary structure component into the scoring system.
7. REFERENCES
[1]
Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and
Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res., 25, 3389–3402.
[2]
Edgar,R.C. and Sjölander,K. (2004) COACH: profile-profile alignment of protein
families using hidden Markov models. Bioinformatics, 20(8), 1309-1318.
-8-
BME230 Winter 2005
Project report
Wing Wong
[3]
Karplus,K., Barrett,C., and Hughey,R. (1998) Hidden Markov Models for Detecting
Remote Protein Homologies, Bioinformatics, 14(10), 846-856.
[4]
Heger,A. and Holm,L. (2003) Exhaustive enumeration of protein domain families. J.
Mol. Biol., 328, 749–767.
[5]
Henikoff,S. and Henikoff,J.G. (1994) Position-based sequence weights. J. Mol. Biol.,
243, 574-578.
[6]
Madera,M. (currently unpublished) PRC, the Profile Comparer. http://supfam.org/
PRC.
[7]
Mittelman,D., Sadreyev,R., Grishin,N. (2003) Probabilistic scoring measures for
profile–profile comparison yield more accurate short seed alignments. Bioinformatics,
19, 1531–1539.
[8]
Ohlson,T., Wallner,B., and Elofsson,A. (2004) Profile-profile methods provide
improved fold-recognition: a study of different profile-profile alignment methods.
Proteins: Structure, Function, and Bioinformatics, 57, 188-197.
[9]
Pei,J., Sadreyev,R., Grishin,NV. (2003) PCMA: fast and accurate multiple sequence
alignment based on profile consistency. Bioinformatics, 19, 427–428.
[10] Rychlewski,L.,
Jaroszewski,L., Li,W. and Godzik,A. (2000) Comparison of sequence
profiles, strategies for structural predictions using sequence information. Protein Sci.,
9, 232–241.
[11] Sadreyev,R.
and Grishin,N. (2003) COMPASS: a tool for comparison of multiple
protein alignments with assessment of statistical significance. J. Mol. Biol., 326, 317–
336.
[12] von
Öhsen,N., Sommer,I. and Zimmer,R. (2003) Profile–profile alignment, a
powerful tool for protein structure prediction. Proc. Pacific Symp. Biocomp., 252–
263.
[13] Wallner,B.,
Fang,H., Ohlson,T., Frey-Skott,J., and Elofsson,A. (2004) Using
evolutionary information for the query and target improves fold recognition. Proteins,
54, 342–350.
[14] Yona,G.
and Levitt,M. (2002) Within the twilight zone: a sensitive profile–profile
comparison tool based on information theory. J. Mol. Biol., 315, 1257–1275.
-9-
Download