Identifying protein folds with threading

advertisement
Linear Programming Optimization and a Double Statistical Filter for
Protein Threading Protocols.
Jaroslaw Meller1,2 and Ron Elber1*
1
Department of Computer Science
Upson Hall 4130
Cornell University
Ithaca NY 14853
2
Department of Computer Methods,
Nicholas Copernicus University,
87-100 Torun, Poland
* Corresponding author
Phone (607)255-7416
Fax (607)255-4428
e-mail ron@cs.cornell.edu
Running title: “LP based threading model”
Keywords: Linear Programming, Potential Optimization, decoy structures,
threading, gaps
1
Abstract
The design of scoring functions (or potentials) for threading, differentiating
native-like from non-native structures with a limited computational cost, is an active field
of research. We revisit two widely used families of threading potentials, namely the
pairwise and profile models.
To design optimal scoring functions we use linear programming (LP). The LP
protocol makes it possible to measure the difficulty of a particular training set in
conjunction with a specific form of the scoring function. Gapless threading demonstrates
that pair potentials have larger prediction capacity compared to profile energies.
However, alignments with gaps are easier to compute with profile potentials. We
therefore search and propose a new profile model with comparable prediction capacity to
contact potentials.
A protocol to determine optimal energy parameters for gaps using linear
programming is also presented. A statistical test, based on a combination of local and
global Z scores, is employed to filter out false positives. Extensive tests of the new
protocol are presented. The new model provides an efficient alternative for threading
with pair energies, maintaining comparable accuracy. The code, databases and a
prediction server are available at http://www.tc.cornell.edu/CBIO/loopp.
2
I.
Introduction
The threading approach [1-8] to protein recognition is an alternative to the sequenceto-sequence alignment. Rather than matching the unknown sequence Si to another
sequence S j (one dimensional matching) we match the sequence Si to a shape X j (three
dimensional matching). Experiments found a limited set of folds compared to a large
diversity of sequences, suggesting the use of structures to find remote similarities
between proteins. Hence, the determination of overall folds is reduced to tests of
sequence fitness into known and limited number of shapes.
The sequence-to-structure compatibility is commonly evaluated using reduced
representations of protein structures. Points in 3D space represent amino acid residues
and an effective energy of a protein is defined as a sum of inter-residue interactions. The
effective pair energies can be derived from the analysis of contacts in known structures.
Such knowledge-based, pairwise potentials are widely used in fold recognition [2,3,6,911], ab-initio folding [11-13] and sequence design [14-15].
Alternatively, one may define the so-called “profile” energy [1,5,16] taking the form
of a sum of individual site contributions, depending on the structural environment of a
site. For example, the solvation/burial state or the secondary structure can be used to
characterize different local environments. The advantage of profile models is the
simplicity of finding optimal alignments with gaps (deletions and insertions into the
aligned sequence) that allow the identification of homologous proteins of different length.
Using dynamic programming (DP) algorithm [17-20] one may compute optimal
alignments with gaps in polynomial time, as compared to the exponential number of all
possible alignments.
3
In contrast to profile models, the potentials based on pair energies do not lead to exact
alignments with dynamic programming. A number of heuristic algorithms, providing
approximate alignments, have been proposed [21]. However, they cannot guarantee an
optimal solution with less than exponential number of operations [22]. Another common
approach is to approximate the energy by a profile model (the so-called frozen
environment approximation) and to perform the alignment using DP [23].
In the present manuscript we evaluate several different scoring functions for
sequence-to-structure alignments, with parameters optimized by linear programming (LP)
[24-26]. The capacity of the energies is explored in terms of a number of protein shapes
that are recognized with a certain number of parameters. We propose a novel profile
model, designed to mimic pair energies, which is shown to have accuracy comparable to
that of other contact models. We discuss gap energies and introduce a double Z score
measure (from global and local alignments) to assess the results. The proposed threading
protocol emphasizes structural fitness (as opposed to sequence similarity) and can be
made a part of more complex fold recognition algorithms that use family profiles,
secondary structures and other patterns relevant for protein recognition.
II.
Theory and computational protocols
II.1. Functional form of the energy
In this section we formally define the families of pairwise and profile models. We
also introduce a novel THreading Onion Model (THOM), which is investigated in the
subsequent sections of the paper. In the widely used pairwise contact model, the score of
4
the alignment of a sequence S into a structure X is a sum of all pairs of interacting amino
acids,
E pairs    ij ( i ,  j , rij ) .
(1)
i j
The pair interaction model --  ij depends on the distance between sites i and j, and on the
types of the amino acids,  i and  j . The latter are defined by the alignment, as certain
amino acid residues are placed in sites i and j, respectively.
We consider here the widely used contact potential. If the geometric centers of the
side chains are closer than 6.4 Angstrom then the two amino acids are considered in
contact. The total energy is a sum of the individual contact energies:

 ij ( i ,  j , rij )   
1.0  rij  6.4 Ang 
 ,
otherwise 
 0
(2)
where i, j are the structure site indices (contacts due to sites in sequential vicinity
are excluded, i  3  j ),  ,  are indices of the amino acid types (we drop the
subscripts i and j for convenience) and   is a matrix of all the possible contact types.
For example, it can be a 20x20 matrix for the twenty amino acids. Alternatively, it can be
a smaller matrix if the amino acids are grouped together to fewer classes. Different
groups that are used in the present study are summarized in table 1. The entries of   are
the target of parameter optimization.
**PLACE TABLE 1 HERE **
The second type of energy function assigns “environment” or a profile to each of
the structural sites [1]. The total energy E profile is written as a sum of the energies of the
sites:
5
E profile   i ( i , X) .
(3)
i
As previously,  i denotes the type of an amino acid that was placed at site i of X . For
example if  i is a hydrophobic residue, and xi is characterized as a hydrophobic site, the
energy  i ( i , X) will be low (score will be high). If  i is charged then the energy will be
high (low score). The total score is given by a sum of the individual site contributions.
We consider two profile models. In THOM1 (THreading Onion Model 1), which
was used in the past as an effective solvation potential [1,2], the total energy of the
protein is a direct sum of the contributions from structural sites and can be written as
ETHOM1    i (ni ) .
(4)
i
The energy of a site depends on two indices: (a) the number of neighbors to the site -- ni
(a neighbor is defined by a cutoff distance -- formula (2)), and (b) the type of the amino
acid at site i --  i . For twenty amino acids and maximum of ten neighbors we have 200
parameters to optimize, a comparable number to the detailed pairwise model.
THOM1 provides a non-specific interaction energy, which as we show in section
III, has relatively low prediction ability when compared to pairwise interaction models.
THOM2 is an attempt to improve the accuracy of the environment model making it more
similar to pairwise interactions.
We define the energy  i ( ni , n j ) of a contact between structural sites i and j, where
ni is the number of neighbors to site i and n j is the number of neighbors to site j . The
type of amino acid at site i is  i . Only one of the amino acids in contact is known. The
total contribution to the energy of site i is a sum over all contacts to this site
6
i , THOM2 ( i  X)   ' (ni , n j ) . The prime indicates that we sum only over sites j that are
i
j
in contact with i (i.e. over sites j satisfying the condition 1.0  rij  6.4 Ang and
i  j  4 ). The total energy is finally given by a double sum over i and j ,
ETHOM 2  
  ( n , n ) .
i
'
i
i
j
(5)
j
It is possible to define an effective contact energy using THOM2:
Vijeff   i (ni , n j )   j (n j , ni ) .
(6)
Hence, we can formally express the THOM2 energy as a sum of pair energies,
ETHOM 2   Vijeff . The effective energy mimics the formalism of pairwise interactions.
i j
However, in contrast to the usual pair potential, the optimal alignments with gaps can be
computed efficiently with THOM2, since structural features alone determine the
“identity” of the neighbor.
** PLACE TABLE 2 HERE **
We use a coarse-grained model leading to a reduced set of structural environments
(types of contacts) as outlined in table 2. The use of a reduced set makes the number of
parameters (300 when all the 20 types of amino acids are considered) comparable to that
of the contact potential. Further analysis of the new model is included in section III.
II.2 Linear programming protocol for optimization of the energy parameters
Consider the alignment of a sequence S of length n , into a structure X of length m .
In order to optimize the energy parameters for the amino acid interactions (the gap
energies are discussed in section II.3) we employ the so-called gapless threading, in
which the sequence S is fitted into the structure X with no deletions or insertions. Hence,
7
the length of the sequence must be shorter or equal to the length of the protein chain. If n
is shorter than m we may try m  n  1 possible alignments varying the structural site of
the first residue (and the following sequence).
The energy (score) of the alignment of S into X is denoted by E ( S , X, p) , where X
stands (depending on the context) either for the whole structure or only for a substructure of length n. The energy function -- E ( S , X, p) depends on a vector p of q
parameters (so far undetermined).
Consider the sets of structures Xi  and sequences S j  . There is an energy value
for each of the alignments of the sequences
S 
j
into the structures Xi  . A good
potential will make the alignment of the “native” sequence into its “native” structure the
lowest in energy. Let X n be the native structure. A condition for an exact recognition is:
E ( S n , X j , p)  E ( S n , X n , p)  0
 jn.
(7)
In the set of inequalities (7) the coordinates and sequences are given and the unknowns
are the parameters that we need to determine.
The LP protocol makes it possible to measure the difficulty of a training set. The
number of parameters of the energy function that are necessary to satisfy all the
inequalities are derived from the set of structures, as defined in equation (7). Note also,
that while the statistical potentials are based on the analysis of native structures only, the
LP protocol is using sequences threaded through misfolded structures during the process
of learning. As a result the LP has the potential for accumulating more information,
attempting to put the energies of the misfolded sequence as far as possible from the
energy of the native state. In fact, the LP protocol can be used to optimize the Z score of
8
the distribution of energy gaps [27]. In the remaining part of this section we describe the
technique to solve the inequalities of equation (7).
The “profile” and pairwise interaction models that are considered in this work can
be written as a scalar product:
E   n ( X) p  n( X)  p ,
(8)

where p is the vector of parameters that we wish to determine. The index of the vector,
 , is running over the types of contacts or sites. For example, in the pairwise interaction
model the index  is running over amino acid pairs, whereas in the THOM1 model it is
running over the types of sites characterized by the identity of the amino acid at the site
and the number of its neighbors. n (X ) is the number of contacts, or sites of a specific
type found in the structure X.
Using the representation of equation (8) we may rewrite equation (7) as follows:
p  n j   p (n ( X j )  n ( X n ))  0
 j  n.
(9)

Standard linear programming tools can solve equation (9) for p. We use the BPMPD
program of Cs. Meszaros [28], which is based on the interior point algorithm. In the
present computations we seek a point in parameter space that satisfies the constraints and
we do not optimize a function in that space. Without a function to optimize the interior
point algorithm places the solution at the “maximally feasible” point, which is at the
analytic center of the feasible polyhedron that defines the “accessible” volume of
parameters [29, 27].
The set of inequalities that we wish to solve includes tens of millions of
constraints that could not be loaded into the computer memory directly (we have access
9
to machines with two to four Gigabytes of memory). Therefore, the following heuristic
approach was used. Only a subset of the constraints is considered, namely p  n  Cj 1 ,
J
with a threshold C chosen to restrict the number of inequalities to a manageable size
(which is about 500,000 inequalities for 200 parameters). Hence, during a single iteration,
we considered only the inequalities that are more likely to be relevant for further
improvement by being smaller than the cutoff C .
The subset p  n  Cj 1 is sent to the LP solver “as is”. If proven infeasible, the
J
calculation stops (no solution possible). Otherwise, the result is used to test the remaining
inequalities for violations of the constraints (equation 9). If no violations are detected the
process was stopped (a solution was found). If negative inner products were found in the
remaining set, a new subset of inequalities below C was collected. The process was
repeated, until it converged. Sometimes convergence was difficult to achieve and human
intervention in the choices of the inequalities was necessary. For example, mixing subsets
of inequalities from previous runs with the lowest inequalities that were obtained with the
new parameters helps to avoid the problem of “oscillating” between solutions.
II.3 Protocol for optimization of gap energies
In the following section we discuss the derivation of the energy terms for gaps and
deletions that enable the detection of homologs. We introduce an “extended” sequence,
S , which may include gap “residues” (spaces, or empty structural sites) and deletions
(removal of an amino acid, or an amino acid placed at a virtual structural site).
The gap residue, “-“, is considered to be another amino acid. We assign to it a score
(or energy) --   (X) according to its environment. Gap training is similar to the training
of other amino acid residues, in contrast to the usual ad-hoc treatment of gap energies.
10
The proposed treatment is also more symmetric than the different penalties for opening
and extending gaps.
The database of “native” and decoy structures is different, however, for gapless and
gap training. To optimize the gap parameters we need “pseudo-native” structures that
include gaps. We construct such “pseudo-native” conformations by removing the true
native shape X n of the sequence S n from the coordinate training set and replacing it by a
homologous structure -- X h . The best alignment of the native sequence, S n , into the
homologous structure, X h , with an initial guess of gap penalties, defines S n . The
extended sequence, S n , with gap residues at certain (fixed from now on) positions
becomes our new (pseudo-) native sequence of the structure X h .
We require that the alignment of S n into the homologous protein will yield the
lowest energy compared to all other alignments of the set. Hence, our constraints are:
E(Sn , X j , p)  E(Sn , Xh , p)  0
 j  h, n .
(10)
Equation (10) is different from equation (7) since we consider the “extended” set of
“amino acids” -- S instead of S and the native-like structure is X h instead of X n .
To limit the scope of the computations we optimize here the scores of the gaps
only. Thus, we do not allow the amino acid energies optimized separately (see section
III.2) to change while optimizing parameters for gaps. Moreover, the sequence S ,
obtained by a certain prior (e.g. structure-to-structure) alignment or from the
experimental data if available, is held fixed. In other words, threading of the extended
sequence with fixed positions and number of gap residues (treated now as any other
residue), S , against all other structures in the training set is used, in order to generate a
11
corresponding set of inequalities, (equation (10)). This optimization, while limited and
clearly not the final word on the topic, is still expected to be better than a guess. Further
studies of gap penalties are in progress [Galor, Meller and Elber, work in progress]. We
note that optimization of gaps has been attempted in the past [23,30].
In principle, one could optimize deletion penalties using a similar protocol. In the
present manuscript we exploit an assumed symmetry between insertion of a gap residue
to a sequence and the placement of a “delete” residue in a virtual structural site. The
deletion penalty is set equal to the cost of insertion averaged over two nearest structural
sites. No explicit dependence on the amino acid type is assumed.
II.4 Double Z score filter for gapped alignments
In sections III.4 and IV.2-4 we consider optimal alignments of an extended
sequence S with gaps into the library structures X j . We focus on the alignments of
complete sequences to complete structures (global alignments [17]) and alignments of
continuous fragments of sequences into continuous fragments of structures (local
alignment [18]). In global alignments opening and closing gaps (gaps before the first
residue and after the last amino acid) reduce the score. In local alignments gaps or
deletions at the C and N terminals of the highest scoring segment are ignored. Only one
local segment, with the highest score, is considered.
Threading experiments that are based on a single criterion (the energy) are usually
unsatisfactory [26,31]. While we do hope that the (free) energy function that we design is
sufficiently accurate so that the native state (the native sequence threaded through the
native structure) is the lowest in energy, this is not always the case. Our exact training is
for the training set and for gapless threading only (see section III). The results were not
12
extended to include exact learning with gaps, or exact recognition of structures of related
proteins that are not the native. Such extensions are difficult since the number of
inequalities for S is exponentially larger than the number of inequalities without gaps.
Other investigators use the Z score as an additional or the primary filter [19,32,4,6]
and we follow their steps. The novelty in the present protocol is the combined use of
global and local Z scores to assess the accuracy of the prediction. This filtering
mechanism, in addition to initial energy filter, provides an improved discrimination as
compared to a single Z score test.
The Z score is a dimensionless, “normalized” score, defined as:
Z
E  Ep
E2  E
2
.
(12)
The energy of the current “probe” i.e. the energy of the optimal alignment of a query
sequence into a target structure is denoted by E p . The averages, ... , are over “random”
alignments. The Z score measures the deviation of our “hits” from random alignments
(alignments with scores that are far from the “random” average value are more
significant). Following the common practice [32-34] we generate the distribution of
random alignments numerically, employing sequence shuffling. That is, we consider the
family of sequences obtained by permutations of the original sequence. The set of
shuffled sequences has the same amino acid composition and length as the native
sequence and all the shuffled sequences have the same energy in the unfolded state (the
energy of an amino acid with no contacts is set to zero).
In section III.4 we estimate numerically the probability P( Z p ) of observing a
Z score larger than Z p by chance for local threading alignments. The relatively high
13
likelihood of observing large Z scores for false positives makes predictions based on the
Z score test problematic. Therefore we propose an additional filtering mechanism, based
on a combination of Z scores for global and local alignments. The double Z score filter
eliminates false positives, missing a smaller number of correct predictions.
Global alignments (in contrast to local alignments) are influenced significantly by
a difference in the lengths of the structure and the threaded sequence. The matching of
lengths was considered too restricted in previous studies [35]. Nevertheless, using
environment dependent gap penalty, the Z score of the global alignment was proven a
useful independent filter (see sections III.4 and IV.2-4). We observe that good scores are
obtained for length differences (between sequence and structure) that are of order of ten
percent. On the other hand, when the differences in length are profound the global
alignment fails. Large differences imply identification of domains and not a whole
protein. This is a different problem, which the present manuscript does not address.
III.
Application to potential design and analysis
In this section we analyze and compare several pairwise and profile potentials,
optimized using the LP protocol. Given the training set, we either obtain a solution (exact
recognition on the training set), or the LP problem proves infeasible.
We use the infeasibility of a set to test the capacity of an energy model. We compare
the capacity of alternative energy models by inquiring how many native folds they can
recognize (before hitting an infeasible solution). The larger is the number of proteins that
are recognized with the same number of parameters, the better is the energy model. We
14
find that the “profile” potentials have in general lower capacity than the pairwise
interaction models.
III.1 Training and test sets
Two sets of protein structures and sequences are used for the training of
parameters in the present study. Hinds and Levitt developed the first set [31] that we call
the HL set. It consists of 246 protein structures and sequences. Gapless threading of all
sequences into all structures generated the 4,003,727 constraints (i.e. the inequalities of
equation (7)). The gapless constraints were used to determine the potential parameters for
the twenty amino acids. Since the number of parameters does not exceed a few hundred,
the number of inequalities is larger than the number of unknowns by many orders of
magnitude.
The second set of structures consists of 594 proteins and was developed by Tobi
et al. [25]. It is called the TE set and is considerably more demanding. It includes
structures chosen according to diversity of protein folds but also some homologous
proteins (up to 60 percent sequence identity), and poses a significant challenge to the
energy function. For example, the set is infeasible for the THOM1 model, even when
using 20 types of amino acids (see section III.2). The total number of inequalities that
were obtained from the TE set using gapless threading was 30,211,442. The TE set
includes 206 proteins from the HL set.
We developed two other sets that are used as testing sets to evaluate the new
potentials both in terms of gapped and gapless alignments. These test sets contain
proteins that are structurally dissimilar to the proteins included in the training sets,
specified by the RMS distance between the structures. A structure-to-structure alignment
15
algorithm, based on the overlap of the contact shells that are defined for the
superimposed side-chain centers in analogy with the THOM2 model (disregarding
however the identities of amino acids), was used (Meller and Elber, unpublished results).
The first testing set, which is referred to as S47, consists of 47 proteins: 25
proteins from the CASP3 competition [36] and 22 related structures, chosen randomly
from the list of DALI [37] relatives of the CASP3 targets. Using CASP3-related
structures is a convenient way of finding protein structures that are not sampled in the
training. None of the 47 structures has homologous counterparts in the HL set and only
six (representing three different folds) have counterparts in the TE set, with a cutoff for
structural (dis)-similarity of 12 angstrom RMSD (between the superimposed side-chain
centers).
The second test set, referred to as S1082, consists of 1082 proteins that were not
included in the TE set and which are different by at least 3 angstrom RMSD (measured,
as previously, between the superimposed side chain centers) with respect to any protein
from the TE set and with respect to each other. Thus, the S1082 set is a relatively dense
(but non-redundant up to 3 angstrom RMSD) sample of protein families. The training and
testing sets are available from the web [38].
III.2 Linear programming training of “minimal” models
This section addresses the question: What is the minimal number of parameters
that is required to obtain an exact solution for the HL and for the TE sets? By “exact” we
mean that each of the sequences picks the native fold as the lowest in energy using a
gapless threading procedure. Hence, all the inequalities in equation (7), for all sequences
S n and structures X j , are satisfied.
16
The pairwise model requires the smallest number of parameters (55) to solve the
HL set exactly (see table 3). Only ten types of amino acids were required: HYD POL
CHG CHN GLY ALA PRO TYR TRP HIS (see also table 1). The THOM1 and THOM2
models require 200 and 150 parameters respectively to provide an exact solution on the
same (HL) set (see table 3). It is impossible to find an exact potential (of the functional
forms we examined) for the HL set without (at least) ten types of amino acids. The
potentials optimized on the HL set are then used to predict the folds of the proteins of the
TE set. Again, we find that the pairwise interaction model performs better than THOM
models.
** PLACE TABLE 3 HERE **
An indication that THOM2 is a better choice than THOM1 is included in the next
comparison. It is impossible to find parameters that will solve exactly the TE set using
THOM1 (the inequalities form an infeasible set). The infeasibility is obtained even if 20
types of amino acids are considered. In contrast, both THOM2 and the pairwise
interaction model led to feasible inequalities if the number of parameters is 300 for
THOM2 and 210 for the pairwise potential. Note that the set of parameters that solved
exactly the TE set does not solve exactly the HL set since the latter set includes proteins
not included in the TE set.
We have also attempted to solve the TE set using pair energies and THOM2 with
a smaller number of parameters. The problem was proven infeasible even for 17 different
types of amino acids and only very similar amino acids grouped together (Leu and Ile,
Arg and Lys, Glu and Asp). Similarly, we failed to reduce the number of parameters by
17
grouping together structurally determined types of contacts in THOM2. Enhancing the
range of a “dense” site to be a site of seven neighbors or more also results in infeasibility.
III.3 Analysis of THOM2 model
As discussed in section II, the THOM1 potential provides a new set of parameters
for an effective solvation model that was used in the past. Since applying the LP protocol
we can only solve the HL set, the solution for that set gives our optimal THOM1
energies, as included in table 4.a. In this section we analyze in detail the THOM2 model,
which has significantly higher capacity than THOM1. However, the double layer of
neighbors makes the results more difficult to understand.
** PLACE TABLE 4 HERE **
In figure 1 we present a contour plot of the total contributions to the energies of
the native alignments in the TE set, as a function of the number of contacts in the first
shell, n , and the number of secondary contacts to a primary contact, n ' , respectively.
The results for two types of residues, lysine and valine, are presented. The contribution of
a given type of contact is defined as f    n, n ' , where   n, n ' is the energy of a
given type of contact and f is the frequency of that contact, observed in the TE set.
** PLACE FIGURE 1 HERE **
It is possible to find a very attractive (or repulsive) site that makes only negligible
contribution to the native energies since it is extremely rare (i.e. f is small). For specific
examples see table 5. By plotting f    n, n ' we emphasize the important contributions.
Hydrophobic residues with a large number of contacts stabilize the native alignment, as
opposed to polar residues that stabilize the native state only with a small number of
neighbors.
18
** PLACE TABLE 5 HERE **
It has been suggested that pairwise interactions are insufficient to fold proteins
and higher order terms are necessary [26]. It is of interest to check if the environment
models that we use catch cooperative, many-body effects. As an example we consider the
cases of valine-valine and lysine-lysine interactions. We use equation (6) to define the
energy of a contact. In the usual pairwise interaction the energy of a valine-valine contact
is a constant and independent of other contacts that the valine may have.
** PLACE TABLE 6 HERE **
In table 6 we list the effective energies of contacts between valine residues as a
function of the number of neighbors in the primary and secondary sites. The energies
differ widely from –1.46 to +3.01. The positive contributions refer to very rare type of
contacts. The plausible interpretation is that these rare contacts are used to enhance
recognition in some cases, due to specific “homologous features”. Significant differences
are observed also for the frequently occurring types of contacts that contribute in accord
with the “general principle” of rewarding contacts between hydrophobic sites. For
example, the effective energies of contacts between valine of five neighbors with another
valine of three, five or seven neighbors are equal to –0.44, -0.54, -0.61, respectively.
Hence, the THOM2 model includes significant cooperativity effects. The optimal
parameters for THOM2 potential are provided in table 4.b.
III.3 Training of gap energies
In this section we apply the linear protocol for the optimization of gap energies
described in section II.3. Training concerns the gap energies for THOM2 model only and
it is limited to a small set of carefully chosen homologous pairs. Despite the limited scope
19
of our training, we obtain satisfactory results in terms of recognition of remote homologs,
as discussed in the subsequent parts of the paper.
Pairs of homologous proteins from the following families were considered:
globins, trypsins, cytochromes and lysozymes (see table 7). The families were selected to
represent different folds. The globins are helical, trypsins are mostly  -sheets, and
lysozymes are  /  proteins. Note also that the number of gaps differs appreciably from
a protein to a protein. For example, S n includes only one gap for the alignment of 1ccr
(sequence) vs. 1yea (structure), and 22 gaps for 1ntp vs. 2gch. The structures of the
lysozymes 1lz5 and 1lz6 include engineered insertions that allow us to sample
experimentally observed gap locations.
** PLACE TABLE 7 HERE **
For the remaining families, the process of generating pseudo-native sequences is
as follows: For each pair of native and homologous proteins the alignment of the native
sequence S n into the homologous structure X h is constructed using THOM1 potential
with an initial guess for the gap energies, provided in table 8a. The ad hoc gap penalties
favor gaps at sites with few neighbors and they satisfy the following constraints: The gap
penalty should increase with the number of neighbors; The energy of a gap with n
contacts must be larger than the energy of an amino acid with the same number of
contacts (the gap energy must be higher otherwise gaps will be preferred to real amino
acids); The energy of amino acids without contacts is set to zero and therefore the gap
energy is greater than zero. Given the above constraints the initial gap penalties are tuned
up to minimize the discrepancies with the DALI [37] structure-to-structure alignments
20
(we choose not to use the DALI alignments directly since they involve deletions that are
not trained explicitly at this stage – see section II.3).
** PLACE TABLE 8 HERE **
The “pseudo-native” structures with extended sequences, obtained as described
above, are added to the HL set (while removing the original native structures). The
energy functional form that we used for the gaps is the same as for other amino acids in
the THOM2 model. “Gapless” threading into other structures of the HL set generates
about 200,000 constraints for the gap energies, which are solved using the LP solver. The
resulting gap penalties for THOM2 are given in table 8.b. The value of 10 is the maximal
penalty allowed by the optimization protocol that we used. Maximal penalty is assigned
to gaps that are found only in decoy states and have no native states to bound the penalty
at lower values. For example, using our initial guess for gap penalties we do not observe
gaps at the hydrophobic cores of pseudo-native structures. Gaps are usually found in
loops with significant solvent exposure and we have no information in our training set on
“native” gaps in sites that are neighbor-rich.
** PLACE TABLE 9 HERE **
In table 9 we show the results of optimal threading with gaps (using dynamic
programming) for myoglobin (1mba) against leghemoglobin (1lh2) structure. We show
the initial alignment (with the ad-hoc gap parameters from table 8.a), defining the
pseudo-native state, and the results for optimized gap penalties for THOM2. The location
of gaps in the initial alignment is to a large extent consistent with the DALI [37]
structure-to-structure alignment. Four (out of seven) insertions coincide with the DALI
superposition of the two structures, two insertions are shifted by three residues (see
21
caption for table 9). The THOM2 alignment (different from the initial set-up) is less
consistent with the DALI alignment. Interestingly, however, it provides a better
superposition of alpha helices. Note that the gaps appear (as expected) in loop regions
(e.g., the CD, EF, and GH loops). An exception is the gap at position 9 (in 1lh2), which is
located in the middle of the A helix instead of position 2, as suggested by the DALI
alignment. Further tests of alignments with gaps are presented in section IV.
In section IV we present also threading results for the pairwise TE potential, using
the so-called frozen environment approximation (FEA) [23]. In order to compute optimal
alignments with the FEA we need to set the gap penalties for the TE potential. Pairwise
models are not the focus of our study and we do not attempt to optimize gap energies for
the TE potential. Therefore, for the sake of fair comparison we introduce ad hoc gap
penalties based on a similar functional model, for both the TE and THOM2 potentials.
After some experimentation, the insertion penalties are chosen to be proportional to
the number of neighbors to a site,  TE (n)  0.2  (n  1) and  THOM 2 (n)  1.0  ( n  1) , for
the TE and THOM2 potentials (the averaged number of neighbors, n , in a class n
belongs to, is used for THOM2 – see table 2), respectively. This choice is consistent with
the trained THOM2 gap energies, which also penalize sites of no neighbors. The
proportionality coefficients were gauged using the same families that were used to train
THOM2 gap energies. However, no LP training was attempted. The deletion penalties are
also consistent with the THOM2 model and they are defined in the way described in
section II.3. For further comparisons with sequence-to-sequence alignments we also
introduce environment dependent gap penalties that are used for family recognition in
22
conjunction with the BLOSUM50 [39] substitution matrix,  B 50 (n)  (5  n)  8 (see
section IV.3).
III.4 Assessing the distribution of the Z scores for gapped alignments
In this section we compute numerical distributions of the Z scores for local and
global threading alignments, using the THOM2 model and the gap penalties of table 8.b,
trained in the previous section. Based on these distributions, we derive empirical cutoffs
for the double Z score test (discussed in section II.4) that filters out all the incorrect
predictions observed in our benchmark. Further tests of the specificity as well as
sensitivity of the double Z score filter are included in sections IV.2-4.
To establish a cutoff for the Z score (and not the energy itself) that eliminates false
positives, we estimate numerically the probability P( Z p ) of observing a Z score larger
than Z p by chance. The distribution of Z scores for random alignments is generated by
threading sequences of the S47 set through structures included in the HL set. The probe
sequences of known structures were selected to ensure no structural similarity between
the HL set and the structures of the probe sequences (see section III.1). Therefore any
significant hit in this set may be regarded as a false positive. Z scores of local alignments
are employed to estimate P  Z p  . The number of local alignments with “good” energies
(significantly lower than zero) is large, underlying the need for an additional selection
mechanism to eliminate false positives.
** PLACE FIGURE 2 HERE **
In local alignments a contribution due to a given contact should be only included if
it belongs to the alignment (which is not known to start with). This implies a “structural”
frozen environment approximation (see also section IV.4). When counting contacts we
23
assume that the sites in contact in the original structure belong to the aligned part of the
structure. This may result in spuriously low energies of local matches, making the Z score
of the local threading alignment an important filter.
As can be seen from figure 2, the attempted analytical fit to the gaussian
distribution underestimates the tail of the observed distribution. The analytical fit to the
extreme value distribution [40] provides, in turn, an upper bound for the tail. In the realm
of sequence comparison, the extreme value distribution has been used to model scores of
random sequence alignments for both: local, ungapped alignments [41], as well as local
alignments with gaps [42]. We, however, establish our thresholds based on the numerical
distribution.
The number of random alignments with Z score larger than three, for example, is
non-negligible (see the tail in figure 2 and also the analytical estimate in the caption for
figure 2). The expected number of false positives observed in N trials is N  P  Z p  .
Therefore, only relatively high Z scores (that would miss at same time many correct
predictions) may result in significant predictions, when searching large databases.
Restricting the Z score test to only best matches (according to energy) is insufficient. We
find that the double Z score filter performs better, eliminating false positives with a
smaller number of correct predictions that are dismissed as insignificant.
In figure 3 we present the joint probability distribution for global and local Z scores
for a population of false positives versus a population of correct predictions. The squares
at the upper right corner represent correct predictions, resulting from 331 native
alignments (of a sequence into its native structure) and homologous alignments (of a
sequence into a homologous structure) of the HL set proteins. The circles at the left lower
24
corner are incorrect predictions (false positives) obtained from the alignments of the
sequences of the S47 set against all structures in the HL.
The procedure is the same as the one used previously to generate the probability
density function for the Z scores of local alignments. However, the Z-scores are
computed using 1000 shuffled sequences for both global and local alignments, which is
sufficient for convergence. The converged results reduce somewhat the tails of the
distribution. For example, the number of false positives with a global Z score larger than
2.5 and a local Z score larger than 1.0 is equal to 3, as compared to 7 with only 100
shuffled sequences.
** PLACE FIGURE 3 HERE **
Figure 3 shows that the thresholds of 3.0 for global Z scores and of 2.0 for local Z
scores are sufficient to eliminate all the false predictions. These cutoffs result in a number
of misses (see also next section). However, this is the price we have to pay for high
confidence of our predictions. The total number of pairwise alignments for which we
compute the global and the local Z scores, and subsequently test for the presence of false
positives, is about 10000. Hence, we estimate that the probability of observing a single
false positive with a global and a local Z-score larger than the 3.0 and 2.0 thresholds is
smaller than 0.0001.
IV.
Tests of the model
There are four tests that we perform in this section on the THOM2 potential. First,
we compare the performance of the THOM2 and pair potentials from the literature, using
gapless alignments and the S1082 set of proteins. Next, we consider alignments with
25
gaps. We test the specificity and sensitivity of the double Z score filter that is employed
to assess the statistical significance of gapped alignments. Using the double Z score
filter, we analyze self-recognition for the S47 set of proteins that contains representatives
of folds not sampled in the training. Next, tests of family recognition are presented,
including the comparison of THOM2 results with those of a pairwise model using the
frozen environment approximation.
IV.1 Evaluation of THOM2 and pair potentials by gapless threading
To make a comparison to pairwise potentials and to test at the same time the
generalization capacity of THOM2, we use the S1082 set. This set does not contain
proteins included in the training set. However, as discussed in section III.1, the threshold
of 3 angstrom RMS for global structure-to-structure alignments (using side chain centers)
excludes only close structural homologs. Therefore, the S1082 set includes many
structural variations of the folds used in the training. In general, it is difficult to find
completely independent test sets when using training sets covering essentially all the
known folds. This problem concerns all the knowledge-based potentials considered here.
Using gapless threading we compare the performance of THOM2 to the
performance of five knowledge-based pairwise potentials. As can be seen from the table
10, the Godzik-Skolnick-Kolinski (GSK) potential [43] is the best in terms of the number
of inequalities that are not satisfied, followed by the Betancourt-Thirumalai (BT) [44],
Tobi-Elber (TE) [25], THOM2, Miyazawa-Jernigan (MJ) [45] and the Hinds-Levitt (HL)
[46] potentials. However, in terms of the number of proteins recognized exactly (i.e.
proteins with native energies lower than energies of all the decoys generated by gapless
26
threading into all the structures in the S1082 set), the HL potential is the best, followed
by TE, MJ, THOM2, BT, and GSK potentials.
The lack of correlation between the above two criteria is related to the fact that
some of the above potentials, while recognizing very well many proteins, fair quite
poorly for some of the proteins included in the S1082 set. Reducing the number of
violated inequalities becomes important when applying some additional filters to select
correct predictions from a small subset of energetically favorable matches (e.g. the Z
score test, see section III.4). Therefore, it would be desirable to satisfy both criteria at
same time (maximizing also the Z score of the distribution of energy gaps). From this
point of view the TE, MJ and THOM2 potentials seem to be somewhat better than the
other four potentials. Note that gapless training of energies is still a difficult problem, as
reflected in table 10. None of the widely used potentials has better than 90% success rate.
In a set of 1000 proteins this means many errors.
** PLACE TABLE 10 HERE **
The conclusion, which is important for the present work, is that the performance of
the THOM2 potential is comparable to the performance of pairwise potentials, including
the TE potential trained on the same set using a similar LP protocol. Since the proteins
used in this test were either not included in the training or represent at least considerable
variations of the structures included in the training, we conclude that the exact learning
on the training set does not result in over-fitting. This is further supported by the results
(presented in the next section) for the S47 set of proteins that represent folds not sampled
during the training.
27
IV.2 Self recognition by gapped alignments
We summarize first the performance of the THOM2 potential in terms of selfrecognition of the HL set proteins by optimal alignments and Z score filters. The HL set
was partially learned (using gapless threading). However, our training did not include the
Z score or the possibility of gaps. Successful predictions based on the Z score only, are
useful tests even if performed on the training set of structures. Additionally, there are
forty proteins in the HL set that were not included in the learning (TE) set.
For each sequence we generate all the global and local alignments into all the
structures in the HL set. Energy and Z score filters are considered. Of the total of 246
proteins, 234 native (global) alignments obtain the lowest energy and the highest Z score.
There are four native alignments resulting in weak Z scores. The four failures are
membrane proteins (from the photosynthetic reaction centers) that were not included in
the training set. Only five out of the remaining 242 native alignments obtain Z-scores
smaller than 3 (four alignments with Z scores larger than 2.5 and one alignment with a Z
score smaller than 2.5).
For the local alignments we use the Z score as the main filter since there are many
incorrect alignments with low energies. There are 226 local native alignments with Z
scores larger than 2 (177 of them of rank one and 35 of them of rank two). Among the
remaining 20 local native alignments, nine result in very low Z scores ( Z  1.0 ),
including six structures from the training set. Using the double Z score filter with the
conservative threshold of 3 for global Z scores and of 2 for local Z scores results in
dismissing 23 native alignments as insignificant.
28
In order to assess further the generalization capacity of THOM2 in terms of selfrecognition by optimal alignments, we use the S47 set again. The structures of S47
proteins were embedded in the structures of the TE set and the sequences of 25 proteins
representing different folds in the S47 set were aligned into all the structures of such
extended set. We observe that the native structures are found with high probability.
Twenty of twenty-five structures result in native alignments with global Z scores larger
than 3 and local Z scores larger than 2 (see table 11).
** PLACE TABLE 11 HERE **
A less encouraging observation is the sensitivity of the results to structural
fluctuations. The THOM2 model can identify related structures only if their distance is
not too large. Seven, out of 14 homologous structures with the DALI [37] Z score for
structure-to-structure alignment larger than 10, are detected with high confidence. Only
one homologous structure with the DALI Z score lower than 10 is detected.
We would like to point out that only six structures (three pairs of structures
representing three folds) of the S47 set had homologous counterparts in the training set. It
is therefore reassuring that most of the native structures and significant fraction of the
relatives are recognized both in terms of their energies and the Z scores. Moreover, there
are no further significant hits into other structures from the TE set. Hence, no false
positives above our confidence thresholds are observed in this test. We conclude that our
nearly exact learning (on a training set) preserves significant capacity for identification of
new folds using optimal alignments with gaps.
29
IV.3 Assessing the specificity of the protocol
We present here examples of family recognition (i.e. identification of homologs) in
terms of energy and double Z score filter. Only a few homologs are identified in a large
set of (decoy) structures. This allows us to assess the specificity of the protocol,
providing also a limited analysis of the sensitivity (see the next section for an extended
assessment of the sensitivity). The test set S1082 is used. Eight families that have at least
three representatives in the S1082 set are chosen to illustrate various aspects of THOM2
threading alignments as compared to DALI [37] structure-to-structure alignments as well
as sequence-to-sequence alignments. The latter ones are generated using SmithWaterman algorithm [18], with the BLOSUM50 [39] substitution matrix and structurally
biased gap penalties (see section III.3). Since we do not incorporate family profiles in our
threading protocol, we consider here only pairwise sequence alignments for comparison.
Similarly to threading, the confidence of sequence matches is estimated using Z
scores, defined by the distribution of scores for shuffled sequences. We find that
structurally biased gap penalties improve the recognition in case of weak sequence
similarity. We do not observe false positives (if there is no clear evidence of sharing a
common ancestor and a common function, then the structural dissimilarity is used to
define false positives) with more than 50% of the query sequence aligned and with a Z
score larger than 8 (for sequence alignment). Note that the distribution of Z scores for
sequence substitution matrices is different from that of threading potentials, with very
high Z score for highly homologous sequences.
Regarding the specificity of threading results for the families considered here, we
point out that there are only two energy based predictions with relatively high: global and
30
local threading Z scores that are false. They are still below our thresholds. The highest
scoring false positive, namely the alignment of the aspartyl protease 1htrB into the
xylanase 1clxA (Z scores of 3.7 and 1.5, when converged using 1000 shuffled sequences
- see table 12.d), is still below our cutoffs. The alignment of the zinc finger protein
1meyC into the Adrl DNA-binding domain 2adr is potentially the highest scoring false
positive among the sequence-based matches. However, even though 1meyC and 2adr are
structurally dissimilar according to DALI (RMS of 7.9 angstrom for 40 residues), they
share very high sequence similarity (42% for 55 residues), have similar function and are
classified as related folds (zinc finger design and classic zinc finger, respectively) by
SCOP. Other false positives due to the sequence-to-sequence alignments obtain Z scores
between 5 and 7, which makes predictions based on weak sequence similarity difficult.
** PLACE TABLE 12 HERE **
Regarding the sensitivity of the protocol, one finds first that all the native structures
are with the lowest energies and are recognized with high confidence in terms of the
double Z score filter. We observe a varying degree of success in the recognition of family
members and structural homologs, as illustrated in table 12.a to 12.h. Threading
predictions are very robust for RAS, lactoglobulin and glutathione transferase families. In
the case of the RAS family (table 12.a), a number of matches into remote structural
relatives that share certain structural motifs with the RAS fold is observed. The structural
similarity between lactoglobulins and bilin-binding proteins (that do not share detectable
sequence similarity) is recognized (see alignment of 2blg into 2apd in table 12.b).
Glutathione transferases 1aw9 and 1axdA, with very weak signals from sequence
alignments, are recognized as well.
31
On the other hand, there are families for which threading performance is erratic,
including phosphotransferase, cytochrome and zinc finger families that include matches
recognizable by sequence alignment, of similar length and significant structural
similarity, yet not recognized by threading (see table 12.c to 12.f). The results for the
pepsin-like acid proteases (table 12.g) demonstrate missing matches due to significant
differences in length, which are difficult to account for in global alignments. Local
sequence and threading alignments for proteases 1pfzA and 1lyaB result in high Z scores,
but no signal from global threading alignment is observed. The family of small toxins is
an example of relatively weak signals (both from threading and sequence alignment) that
are below our universal cutoffs for false positives (see table 12.h).
IV.4 Assessing protein family signals and the sensitivity of the protocol
Three families are considered here: globins (92 proteins), immunoglobins (Fv
fragments, 137 proteins) and the DNA-binding, POU-like domains (26 proteins).
Sequences of all family members are aligned optimally to all the structures in the family.
Both, the local and global alignments are generated for each sequence-structure pair and
the results are compared in terms of a simplified version of the double Z score filter
discussed before. Ideally, all the scores should be above the thresholds we presented
earlier. The scores should also correlate with the RMS distance. The THOM2 results are
compared to the results of the TE pairwise potential, which was trained on the same (TE)
set using LP protocol [25].
The alignments due to the TE potential are computed using the first iteration of the
frozen environment approximation (FEA) [23]. In THOM2, the number of neighbors to a
secondary site determines its identity, whereas in FEA it is approximated by the identity
32
of the native residue at that site. In principle, the FEA should be iterated until selfconsistency is achieved [23]. Alternative to FEA are global optimization techniques [22]
that are computationally expensive and difficult to use at the scale of testing presented
here. Purely structural characterization of contact types in THOM2 avoids this problem,
making the THOM2 potential amendable to dynamic programming, at least for global
alignments (see section IV.2).
Figures 4.a, 4.b and 4.c show the joint histograms of the sum of Z scores for local
and global THOM2 threading alignments (with trained gap penalties of table 8.b) vs. the
RMS deviations between the superimposed side chain centers (see section III.1), for
globins, immunoglobins and POU-like domains, respectively. The vertical lines in the
figures correspond to the sum of global and local Z scores equal to 5, which
approximately discriminates the high confidence matches (with the sum of local and
global Z scores larger than 5) and lower confidence matches that might be obscured by
the false positives. Nearly all pairs differing by less than 3 angstrom RMSD can be
identified by THOM2 threading alignments. Most of the matches in the range between 3
and 5 angstrom can be still identified with high confidence. Overall, 60%, 90% and 95%
of homologs with RMSD smaller than 5 angstrom are recognized, for POU, globins and
immunoglobins families, respectively. However, the number of matches with high
confidence quickly decreases with the growing RMS distance.
The population of matches that are difficult to identify by pairwise sequence-tosequence alignments, with structurally biased gap penalties (see sections III.3 and IV.3),
is represented by the filled squares. All the matches represented by circles can be
identified with high confidence by sequence-to-sequence alignments (i.e. they result in Z
33
scores larger than 8.0). Essentially all the pairs with RMSD smaller than 3 angstrom are
identified by sequence alignments as well. Below this threshold, we observe many
matches that can be still identified by threading but not by sequence alignment (filled
rectangles with the sum of threading Z scores higher than 5). We also found examples of
matches detected with high confidence by threading and not detected by PsiBLAST [47]
(with default parameters and the PDB database) in many of the families considered here,
for example: globins 1flp and 1ash, immunoglobin 2hfm and T-cell receptor 1cd8, toxins
1acw and 1pnh, lactoglobulin 2blg and bilin-binding protein 2apd, pheromons 2erl and
1erp, POU-like proteins 1akh and 1mbg. On the other hand, there are many sequence
alignment matches that are not detected by threading.
** PLACE FIGURE 4 HERE **
The performance of THOM2 and TE potentials is compared using one-dimensional
histograms for the sum of Z scores for local and global threading alignments. For the sake
of fair comparison, the ad hoc gap penalties, as defined in section III.3, are used for both
potentials. As can be seen from figures 4.d and 4.e, for globins and POU-like domains the
number of low Z scores for THOM2 is smaller than the number of low Z scores obtained
with the TE potential and FEA. For example, the number of low confidence matches
(which can be still roughly defined as matches below the cutoff of 5) for globins
increases from 2401 in case of THOM2 to 3350 (out of 8558 matches) in case of the TE
potential. One can also notice that the distribution of Z scores is different. The TE
potential yields many high Z scores for alignments into very close homologs as opposed
to lower scores for more divergent pairs.
34
The somewhat worse performance of the pairwise model for these two families
may result from the sub-optimality of the alignments that we generate using the FEA.
Interestingly, FEA with the TE potential fails also for a larger number of native
alignments. For example, in the family of DNA binding proteins the number of native
alignments with very low Z scores (smaller than 4) is equal to 7 for TE and only 2 for
THOM2.
On the other hand, there are families for which the TE potential works better. An
example is the family of the immunoglobins (figure 4.f). The FEA is expected to perform
well when the sequence similarity is sufficiently high, since the information about the
native sequences is used to generate optimal alignments. The divergence in terms of what
can be detected by sequence similarity is larger for globins and POU-like proteins than
for immunoglobins. For example, contrary to other families considered here, all the
immunoglobins with RMSD smaller than 4 angstrom can be detected by sequence
alignments (see figure 4.c). Therefore, good performance of the FEA with the TE
potential is expected in this case.
The above observation is further supported by the results of the FEA with the TE
potential for eight families from the S1082 set, considered in the previous section. We do
not include here detailed results. Instead, we summarize them. The threading results with
the FEA and the TE potential are robust (and comparable to the THOM2 results) for
RAS, SH3 and acid protease families that are represented by proteins of high sequence
similarity. The results of the FEA are considerably worse for lactoglobulins and
glutathione transferase families that are characterized by much lower success of
sequence-based recognition (see table 12). At the same time the FEA performs as poorly
35
as THOM2 for cytochrome and zinc finger families. An exception is observed for the
toxin family, for which the FEA performs considerably better than THOM2, although
there is no (or low) sequence similarity for some of the matches.
V.
Conclusions and final remarks
In the present manuscript we proposed and applied an automated procedure for the
design of threading models. The strength of the procedure, which is based on linear
programming tools, is the automation and the ability of continuous exact learning. The
LP protocol was used to evaluate different energy functions for accuracy and recognition
capacity. Keeping in mind the necessity for efficient threading algorithms with gaps we
selected the THOM2 model as our best choice.
Statistical filters based on local and global Z scores were outlined. We observe
that, while using conservative Z scores that essentially exclude false positives, the new
protocol recognizes correctly (without any information about sequences) most of the
family members with the RMS distance between the superimposed side chain centers of
up to 5 angstrom and differences in length of up to 10 percent. We also observe many
instances of successful recognition of family members that are not recognized by pair
energies with the so-called frozen environment approximation.
The present approach is based on fitness of sequences into structures. Nevertheless,
it is easily extendable to include also sequence similarity, family profiles or secondary
structures. Such complementary “signals” are often employed in conjunction with
pairwise potentials [9-11,16]. Threading protocols that are based exclusively on contact
models were shown (consistent with our observations) to be quite sensitive to variations
36
in structures [48]. The THOM2 model provides an alternative comparable in performance
to pairwise potentials. Therefore it can be used as a fast component of fold recognition
methods employing pair energies, which is the target of a future work.
Despite the limitations of the threading protocol that is based on the THOM2
potential and the double Z score filter (in terms of range of variations in structure and
length that can be recognized), we found a number of useful predictions for remote
homologs (e.g. [49]). Therefore, we decided to take part (group 280) in the recently held
Critical Assessment of Fully Automated protein Structure Prediction methods (CAFASP)
[50], even at the preliminary phase, without utilizing additional information as secondary
structures or family profiles. The performance of the LOOPP server [30] was about
average for all fold recognition targets (e.g. LOOPP missed some targets recognizable by
Psi-BLAST). However, in the category of difficult-to-recognize targets, it was ranked
among the best servers (rank 4 in the MaxSub 5.0 A evaluation), providing the best
predictions among the servers for two difficult targets (T0097 and T0102) [50].
Acknowledgements
This research was supported by an NIH NCRR grant to the Cornell Theory Center
(acting director – Ron Elber) for the developments of Computational Biology Tools. It
was further supported by a seed grant from DARPA to Ron Elber. Jaroslaw Meller
acknowledges also partial support from the Polish State Committee for Scientific
Research (grant 6 P04A 066 14).
37
References
1. Bowie JU, Luthy R, Eisenberg D, “A method to identify protein sequences that
fold into a known three-dimensional structure”, Science, 1991; 253:164-170
2. Jones DT, Taylor WR, Thornton JM, “A new approach to protein fold
recognition”, Nature 1992; 358:86-89
3. Sippl MJ, Weitckus S, “Detection of native-like models for amino acid sequences
of unknown three-dimensional structure in a database of known protein
conformations”, Proteins 1992; 13:258-271
4. Godzik A, Kolinski A, Skolnick J, “Topology fingerprint approach to the inverse
folding problem”, J. Mol. Biol. 1992; 227:227-238
5. Ouzounis C, Sander C, Scharf M, Schneider R, “Prediction of protein structure by
evaluation of sequence-structure fitness. Aligning sequences to contact profiles
derived from 3D structures”, J. Mol. Biol. 1993; 232:805-825
6. Bryant SH, Lawrence CE, “An empirical energy function for threading protein
sequence through folding motif”, Proteins 1993; 16:92-112
7. Matsuo Y, Nishikawa K, “Protein structural similarities predicted by a sequencestructure compatibility method”, Protein Sci. 1994; 3:2055-2063
8. Mirny LA and Shakhnovich EI, “Protein structure prediction by threading. Why it
works and why it does not”, J. Mol. Biol. 1998; 283:507-526
9. Jones DT, “GenTHREADER: An Efficient and Reliable Protein Fold Recognition
Method for Genomic Sequenecs”, J. Mol. Biol. 1999; 287:797-815
38
10. Panchenko AR, Marchler-Bauer A, Bryant SH, “Combination of threading
potentials and sequence profiles improves fold recognition”, J. Mol. Biol. 2000;
296: 1319-1331
11. Sternberg MJE, Bates PA, Kelley LA, MacCallum RM, “Progress in protein
structure prediction: assessment of CASP3”, Curr. Opin. Struct. Biol. 1999;
9:368-373
12. Liwo A, Oldziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA, “A
united-residue force field for off-lattice protein structure simulations: functional
forms and parameters of long range side chain interaction potentials from protein
crystal data”, J. Comp. Chem. 1997; 18: 849-873
13. Xia Y, Huang ES, Levitt M, Samudrala R, “Ab initio construction of protein
tertiary structures using a hierarchical approach”, J. Mol. Biol. 2000; 300: 171185
14. Babajide A, Hofacker IL, Sippl MJ, Stadler PF, “Neural networks in protein
space: a computational study based on knowledge-based potentials of mean
force”, Folding Design 1997: 2: 261-269
15. Babajide A, Farber R, Hofacker IL, Inman J, Lapedes AS, Stadler PF, “Exploring
protein sequence space using knowledge based potentials”, J. Comp. Biol., 1999.
submitted
16. Elofsson A, Fischer D, Rice DW, Le Grand S, Eisenberg D, “A study of
combined structure-sequence profiles”, Folding & Design 1998; 1:451-461
39
17. Needleman SB, Wunsch CD, “A general method applicable to the search for
similarities in the amino acid sequences of two proteins”, J. Mol. Biol. 1970;
48:443-453
18. Smith TF, Waterman MS, “Identification of common molecular subsequences”, j.
Mol. Biol. 1981; 147:195-197
19. Johnson MS, Overington JP, Blundell TL, “Alignment and searching for common
protein folds using a data bank of structural templates”, J. Mol. Biol. 1993;
231:735-752
20. Croman HT, Leiserson CE, Rivest RL, “Introduction to algorithms”, The MIT
press, Cambridge MA 1985, chapter 16
21. Lathrop RH, Smith TF, “Global optimum protein threading with gapped
alignment and empirical pair score functions”, J. Mol. Biol. 1996; 255: 641-665
22. Lathrop RH, “The protein threading problem with sequence amino-acid
interaction preferences is NP-complete”, Protein Eng. 1994; 7:1059-1068
23. Goldstein RA, Luthey-Schulten ZA, Wolynes PG, “The statistical mechanical
basis of sequence alignment algorithms for protein structure prediction”, in
“Recent developments in theoretical studies of proteins”, Ed. Ron Elber, World
Scientific, Singapore, 1996, chapter 6.
24. Maiorov VN and Crippen GM, “Contact potential that recognizes the correct
folding of globular proteins”, J. Mol. Biol. 1992; 227:876-888
25. Tobi D, Shafran G, Linial N, Elber R, “On the design and analysis of protein
folding potentials”, Proteins: Structure Function and Genetics 2000, 40: 71-85
40
26. Vendruscolo M, Domany E, “Pairwise contact potentials are unsuitable for
protein folding”, J. Chem. Phys. 1998; 109:11101-11108
27. Meller J, Wagner M, Elber R, “Maximum feasibility guideline in the design and
analysis of protein folding potentials”, J. Comp. Chem. 2001, submitted
28. Meszaros CS, “Fast Cholesky factorization for interior point methods for linear
programming”, Computer and Mathematics with Applications 1996; 31:49-51
29. Adler I, Monteiro RDC, “Limiting behavior of the affine scaling continous
trajectories for linear programming problems”, Math. Program. 1991; 50:29-51
30. Taylor WR, Munro RE, “Multiple sequence threading: conditional gap
placement”, Folding & Design 1997; 2:S33-S39
31. Hinds DA, Levitt M, “Exploring conformational space with a simple lattice model
for protein structure”, J. Mol. Biol. 1994; 243:668-682
32. Bryant SH and Altschul SF, “Statistics of sequence-structure threading”, Curr.
Opin. Struct. Biol. 1995; 5:236-244
33. Fitch WM, “Random sequences”, J. Mol. Biol. 1983; 163:171-176
34. Altschul SF, Erickson BW, “Significance of nucleotide sequence alignments: a
method for random sequence permutation that preserves dinucleotide and codon
usage”, Mol. Biol. Evol. 1985; 2:526-538
35. Fischer D, Elofsson A, Rice D, Eisenberg D, “Assesing the performance of fold
recognition methods by means of a comprehensive benchmark”, in: Pacific
Symposium on Biocomputing, Hawaii 1996; 300-318
41
36. CASP3. Third community wide experiment on the critical assessment of
techniques for protein structure prediction, Proteins: Structure Function and
Genetics 1999, Suppl. 3, see also http://Predictioncenter.lnl.gov/casp3
37. Holm L, Sander C, “The FSSP database of structurally aligned protein fold
families”, Nucleic Acids Res. 1994; 22:3600-3609, see also DALI server:
http://www2.embl-ebi.ac.uk/dali
38. Meller J, Elber R, Learning, Observing and Outputting Protein Patterns (LOOPP)
-- a program for protein recognition and design of folding potentials,
http://www.tc.cornell.edu/CBIO/loopp
39. Henikoff S, Henikoff JG, “Amino acid substitution matrices from protein blocks”,
PNAS USA 1989; 89: 10915-10919
40. Gambel EJ, “Statistics of extremes”, New York: Columbia University Press, 1958
41. Karlin S, Altschul SF, “Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes”, Proc. Natl. Acad.
Sci. 1990; 87:2264-2268
42. Pearson WR, Lipman DJ, “Improved tools for biological sequence comparison”,
Proc. Natl. Acad. Sci. USA 1988; 85:2444-2448; Pearson WR, “Empirical
statistical estimates for sequence similarity searches”, J. Mol. Biol. 1998; 276:7184
43. Godzik A, Kolinski A, Skolnick J, “Are proteins ideal mixtures of amino acids?
Analysis of energy parameter sets”, Protein Sci. 1995; 4: 2107-2117
42
44. Betancourt MR, Thirumalai D, “Pair potentials for protein folding: choice of
reference states and sensitivity of predicted native states to variations in the
interaction schemes”, Protein Sci. 1999; 2: 361-369
45. Miyazawa S, Jernigan RL, “Residue-residue potentials with a favorable contact
pair term and an unfavorable high packing density term for simulation and
threading”, J. Mol. Biol. 1996; 256:623-644
46. Hinds DA, Levitt M, “A lattice model for protein structure prediction at low
resolution”, Proc Natl. Acad. Sci. USA 1992; 89: 2536-2540
47. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ,
“Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs”, Nucleic Acid Res. 1997; 25:3389-3402
48. Bryant SH, “Evaluation of threading specificity and accuracy”, Proteins 1996; 26:
172-185
49. Frary A, Nesbitt TC, Frary A, Grandillo S, van der Knaap E, Cong B, Liu J,
Meller J, Elber R, Alpert KP, Tanksley SD, "Cloning Transgenic Expression and
Function of fw2.2: a Quantitative Trait Locus Key to the Evolution of Tomato
Fruit", Science 2000; 289: 85-88
50. Fischer D, et al. CAFASP-2: The Second Critical Assessment of Fully Automated
Structure Prediction Methods. Proteins, Dedicated CASP4 issue, submitted, 2001.
43
Figure legends
1. Contour plots of the total energy contributions to the native alignments in the TE
set for valine and lysine residues as a function of the number of neighbors in the
first and second shells. Figure 1.a shows that contacts involving valine residues
with five to six neighbors with other residues of medium number of neighbors
stabilize most of the native alignments. On the other hand (figure 1.b), only
contacts involving lysine residues with a small number of neighbors stabilize
native alignments.
2. The probability distribution function of the Z scores computed with local
threading alignments for the population of false positives. A set of 47 sequences
of proteins included in the S47 set is used to sample the distribution of the Z
scores for false positives (proteins of the S47 set have no homologs in the HL set see text for details). Each of the sequences is aligned to all the structures included
in the HL set. The Z scores are calculated for the two hundred best matches
(according to energy), using 100 shuffled sequences. The observed distribution of
Z scores for 6813 local threading alignments is represented by '+'. Note the
significant tail to the right, which means a relatively high likelihood of observing
false positives with large Z scores. The dotted line shows the attempted analytical
fit to the gaussian distribution, whereas the solid line the attempted fit to the
extreme value distribution (EVD). Note that actual distribution deviates
significantly from both. According to the analytical fit to the EVD, the probability
of
observing
a
Z
score
larger
than
P  Z P   1  exp   exp  1.313   Z P  0.466    ,
44
ZP
by chance
with
the
is
98%
equal
to
confidence
intervals: 1.313  0.112 and 0.466  0.079 . For example, the probability of
observing by chance a Z score larger than 4 is equal to 0.003. We emphasize
however, that the analytical fit to the extreme value distribution provides an upper
bound for the observed number of observed false positives.
3. The joint probability distribution for the Z scores of global and local alignments.
Circles at the lower left corner represent a population of 1081 false positives,
resulting from the alignments of the S47 set sequences (see figure 4) against all
structures in the HL set (100 best global and 200 best local matches are
considered, disregarding matches with positives energies of global alignments).
The best pair scoring false positive is slightly below the threshold (3,2). The
population in the right upper corner represents (square boxes) 331 pairs of HL
sequences aligned to HL structures with global Z scores larger than 2.5 and local
Z scores larger than 1. This set includes 236 native alignments and 95 non-native
alignments. There are 10 matches that are false positives (filled squares) and they
are all below the threshold (3,2). Stiffer energy constraints were employed with
only the 10 best global and 200 best local alignments considered. There is a
population of true positives below (2.5,1.0), which are not shown in the figure
(including 10 native alignments). However, the number of false positives below
this threshold makes predictions in this range difficult.
4. Comparison of family recognition by THOM2 and pair energies. The results of
THOM2 (with the trained gap penalties of table 8.b) for families of globins, POUlike domains and immunoglobins (Fv fragments) are shown in figures 4.a, 4.b and
4.c, respectively. The joint histograms of the sum of Z scores for local and global
45
threading alignments vs. the RMS deviations between the superimposed
(according to structure-to-structure alignments) side-chain centers are presented.
The population of matches that are difficult to identify by sequence-to-sequence
alignments is represented by the filled squares. Next, the THOM2 results are
compared to the results of Tobi-Elber (TE) pairwise potential [25], using ad hoc
gap penalties defined in section V.2. The TE potential was optimized using LP
protocol and the same training set. The first iteration of the so-called frozen
environment approximation (FEA) [23] is performed to obtain approximate
alignments for the TE potential. Figures 4.d, 4.e and 4.f show one-dimensional
histograms of the sum of Z scores for local and global threading alignments for
the globins, POU and immunoglobins families, respectively. Note, that the
number of low THOM2 Z scores (smaller than 5) is smaller for families of
globins and POU-like proteins. On the other hand the TE potential and the FEA
perform better for the family of immunoglobins, which is also easier for sequence
alignment methods (see text for details).
Table legends
1. The definitions of different groups of amino acids that are used in the present
study. Note that ten types of amino acids are found to be sufficient to solve the
Hinds-Levitt set either by pairwise interaction models or by THOM2. The amino
acid types are: HYD POL CHG CHN GLY ALA PRO TYR TRP HIS. The list
implies that when an amino acid appears explicitly, it is excluded from other
groups that may contain it. For example, HYD includes in this case CYS, ILE,
46
LEU, MET, and VAL while CHG includes ARG and LYS only, since the
negatively charged residues form a separate group.
2. Definitions of energy types for the THOM2 energy model. There are fifteen types
of energy terms in THOM2 that are based on contacts in the first and the second
contact layers. A contact between two amino acids is “on” if their distance is
smaller than 6.4 angstrom. We consider five types of contacts in the first layer and
three in the second layer. Thus there are 20 15  300 different energy terms for
twenty different amino acids. A reduced set of amino acids is associated with a
smaller number of parameters to optimize (for ten types of amino acids the
number of parameters is 10 15  150 ). The notation we used for each type of site
is based on a representative number of neighbors. The number of neighbors, n, in
a given class and its representative are given in the first column (for different
classes of sites in the first layer) and in the first row (for different classes of sites
in the second layer). The intersections between columns and rows correspond to
contacts of different types: a contact between two sites of medium number of
neighbors is denoted by  5, 5  , for example.
3. Comparing the capacity of different threading potentials. Capacity for recognition
of pairwise and profile threading potentials is measured by gapless threading on
HL and TE representative sets of proteins (see section III.1). THOM1 performs
significantly worse than pairwise potentials. THOM2 is showing a comparable
performance and is able to learn the TE set (see also table 10). For each potential
the number of amino acids types used and the resulting number of parameters are
reported. The training set used (either HL or TE) is indicated by a star (*) symbol.
47
The number of correct predictions for structures in HL and TE sets is given in the
second and third columns, respectively.
4. Parameters of some of the threading potentials trained using the LP protocol.
Numerical values of the energy parameters for THOM1 potential trained on HL
set of proteins (table 4.a) and THOM2 potential trained on TE set of proteins
(table 4.b, see text for details). The rows in the tables correspond to either
different types of sites (THOM1) or contacts (THOM2). The columns correspond
to different types of amino acids.
5. Characterization of native and decoy structures. Frequencies of different types of
sites (relevant for the training of THOM1) found in the native structures of HL set
as opposed to decoy structures generated using the HL set are presented in table
5.a. In THOM1 the type of site is defined by number of its neighbors (n).
Frequencies are defined by the percentage from the total number of 53012 native
sites in HL set and 556.14 millions of decoy sites generated using HL set,
respectively. Frequencies of different types of contacts (appropriate for the
training of THOM2) found in the native structures of TE set as opposed to decoy
structures generated using TE are given in table 5.b. Different classes of contacts
are specified in table 2. Frequencies are defined by the percentage from the total
number of 439364 native contacts in TE set and 10089.19 millions of decoy
contacts generated using TE set, respectively. The overall site and contact
distributions are split into distributions for hydrophobic and polar residues (as
defined in table 1), which are given in the parenthesis.
48
6. Cooperativity in effective pairwise interactions of the THOM2 potential. For a
pair of two amino acids  and  in contact, we have 25 different possible types
of contacts (and consequently 25 different effective energy contributions) as 
and 
may occupy sites that belong to one of the five different types
characterized by the increasing number of contacts in the first contact shell (see
table 2). Moreover, the 5 by 5 interaction matrix will be in general asymmetric.
The effective energies of contact between two VAL residues with a different
number of neighbors are given in table 6.a, whereas the energies of contacts
between two LYS residues are given in table 6.b.
7. Pairs of homologous structures used for the training of gap penalties. For each
pair the native and the homologous structures are specified by their PDB codes,
names and lengths in the first and second column, respectively. In the third
column, the similarity between the native and the homologous proteins is defined
in terms of the sequence identity [%], RMS distance [Ang] and length [number of
residues], as defined by structure-to-structure alignments obtained by submitting
the corresponding pairs to the DALI server [37].
8. The gap penalties for THOM2 model as trained by the LP protocol. The limited
set of homologous structures from table 7 is used. Initial guess of gap penalties
for different types of sites in the THOM1 model is given in table 8.a. Optimized
gap penalties for different types of contacts in the THOM2 model are given in
table 8.b. Penalties that are not specified explicitly are equal to the maximum
value of 10.0. Note that the training here is limited and will be extended in a
future work.
49
9. Comparison of alignments of myoglobin (1mba) sequence into leghemoglobin
(1lh2) structure. The THOM1 alignment with the initial gap penalties is included
in table 9.a, whereas the THOM2 alignment with trained gap penalties is included
in table 9.b. Note that the location of insertions in the initial alignment (which is
used for training of gap energies) is to a large extent consistent with the DALI
structure to structure alignment [37], which aligns: residues 2-50 of 1mba to 3-51
of 1lh2, residues 53-56 of 1mba to 52-55 of 1lh2 (implying deletions at positions
51 and 52 in 1mba), residues 59-80 of 1mba to 56-77 of 1lh2, residues 81-86 of
1mba to 82-87 of 1lh2, residues 87-121 of 1mba to 89-123 (with the implied
insertion at position 88 in 1lh2), residues 122-139 of 1mba to 126-143 of 1lh2
(implying two insertions at positions 124 and 125 in 1lh2) and residues 140-145
of 1mba to 145-150 of 1lh2 (with an insertion at position 144 in 1lh2),
respectively. Alpha helices in both structures are marked in bold. Note that F and
G helices are shifted considerably in the DALI alignment (there is no counterpart
of the D helix in 1lh2). The initial THOM1 alignment is in perfect agreement with
the DALI superposition between residues 88 and 150 of 1lh2, except for two
insertions at positions 128 and 147 (shifted by three residues with respect to the
DALI alignment). The insertions at positions 88, 125, 151 and 153 coincide with
the DALI alignment. The THOM2 alignment, with trained gap penalties of table
9.b, is in perfect agreement with the DALI superposition for residues 10 to 50 of
1lh2 (including A, B and C helices) and then departs from the DALI alignment,
overlapping E, F and G helices with a smaller shift.
50
10. Comparison of THOM2 and knowledge-based pairwise potentials using gapless
threading. The results of gapless threading on the S1082 set (see text for the
details) are reported. The results of THOM2 potential are compared to five other
knowledge-based pairwise potentials by: Betancourt and Thirumalai (BT) [44],
Hinds and Levitt (HL) [46], Miyazawa and Jerningan (MJ) [45], Godzik, Kolinski
and Skolnick (GKS) [43] and Tobi and Elber (TE) [25]. The latter potential was
trained using LP protocol and the same (TE) training set. Potentials are ordered
according to the number of proteins recognized exactly (out of 1082), given in the
2nd column. The number of inequalities that are not satisfied (out of
approximately 95 million inequalities generated from the S1082 set) is given in
the 3rd column (in the units of millions). The 4th column reports Z scores (i.e. the
ratios of the first and second moments) for the distributions of energy differences
between the native and misfolded structures. Note lack of correlation between the
number of proteins that are missed and the number of inequalities, which are not
satisfied. See text for more details.
11. Self-recognition for folds that were not learned. The S47 set of proteins is used in
order to test the self-recognition. It is also a test of the sensitivity of the results to
structural fluctuations for 25 different folds (of which 22 were not represented in
the training set) using the double Z score test. Pairs of homologous structures
belonging to the S47 set are specified in the first column (three folds are
represented by a single structure, for 2a2u its structural relative from the training
set is included), using their PDB codes and lengths (specified in the parenthesis).
If the domain is not specified and one refers to a multi-domain protein, then the A
51
(or first) domain is used. The results of global and local THOM2 threading
alignments of the 25 query sequences into an extended TE + S47 set are reported
in the 3rd and 4th column, respectively. Query sequences are indicated in italic
(for each pair the first line describes the native alignment and the second line an
alignment into a homologous structure). Two of 25 native alignments gave weak
signals (DNA binding protein 1blo and glycosidase 1bhe). Four other native
alignments (2a2u, 1byf, 1jwe and 1bkb) result in global Z scores somewhat
smaller than 3. The DALI [37] Z scores and RMS deviations for structure-tostructure alignments into native and homologous structures are reported in the
second column. Low DALI Z scores indicate that only short fragments of the
respective structures are aligned and the resulting RMS deviation may not be
representative. Most of the homologous structures with the DALI Z score larger
than 10 are recognized with a high confidence. The alignment of the 2a2u
sequence into the 1bbp structure was the only significant hit of any of the query
sequences into the structures included in the training (TE) set. Thus, no false
positives with scores above our confidence cutoffs were observed. High
confidence predictions (global Z score larger than 3.0 and local Z score larger
than 2.0) are indicated in bold.
12. Examples of predictions for families of homologous proteins. Eight families, with
a number of representatives included in the S1082 set, are used to illustrate
various degrees of success of our threading protocol in terms of sensitivity and
specificity. The results for global and local threading alignments using THOM2
potential, together with the results for (structurally biased) local sequence-to-
52
sequence alignments and DALI structure-to-structure alignments, are reported.
Names of proteins (PDB codes) and their lengths are included in columns 1 and 5,
respectively. Representatives used as query sequences that are aligned to all the
structures in the S1082 set are marked in bold. The Z scores reported in columns
2, 3 and 6, respectively, are computed using 50 shuffled sequences for a number
of alignments with the lowest energies: 20 best matches in case of global
threading (GT) alignments, 500 best matches in case of local threading (LT)
alignments and 50 best matches in case of local sequence (LS) alignments. The
rank of the energy of global threading alignments is reported in the 4th column.
The DALI [37] alignments between the (known) structure of a query and the
structure of a match are characterized in the last column: Z score, RMS distance,
length of the aligned fragment and the identity for this fragment are provided (a
star symbol indicates that comparisons with the FSSP representative of the query
structure are used instead of a direct DALI alignment). A dash symbol is used to
indicate the lack of a detectable (threading, sequence or structural) similarity.
Matches are ordered according to the sum of global and local threading Z scores
and according to Z scores of the local sequence alignments if no threading signal
is detected. False positives (defined as matches with DALI Z scores lower than
2.0) are indicated in italic. The highest scoring false positives for both: threading
and sequence alignments are reported for each family. Examples of families, with
successful threading predictions that do not share a detectable sequence similarity
or have a weak signal from sequence-to-sequence alignment (Z score lower than
8.0), are included in table 12.a, 12.b and 12.c. For families included in table 12.d,
53
12.e and 12.f threading is less successful, missing a number of family members
(of similar length) that can be detected by sequence-to-sequence alignment. Lack
of detection when the difference in length is significant (see table 12.g) is
expected and it is one of the limitations of the present protocol. For the last family
(table 12.h) the DALI results could not be retrieved and therefore the SCOP
classification is used to define structural relatives (i.e. proteins that do share the
knottins fold).
54
Table 1
Hydrophobic (HYD)
ALA CYS HIS ILE LEU MET PHE PRO TRP TYR VAL
Polar (POL)
ARG ASN ASP GLN GLY LYS SER THR
Charged (CHG)
ARG ASP GLU LYS
Negatively Charged (CHN)
ASP GLU
Table 2
Type of site*
n’=1,2; 1
n’=3,4,5,6; 5
n’  7; 9
n=1,2; 1
( 1,1 )
( 1, 5 )
( 1, 9 )
n=3,4; 3
( 3,1 )
( 3, 5 )
( 3, 9 )
n=5,6; 5
( 5, 1 )
( 5, 5 )
( 5, 9 )
n=7,8; 7
( 7, 1 )
( 7, 5 )
( 7, 9 )
n  9; 9
( 9, 1 )
( 9, 5 )
( 9, 9 )
Table 3
POTENTIAL
Hinds-Levitt set
Tobi-Elber set
Pairwise, HP model, par. free
200
456
Pairwise, 10 aa, 55 par
246*
504
Pairwise, 20 aa, 210 par
246*
530
Pairwise, 20 aa, 210 par
237
594*
THOM1, 20 aa, 200 par
246*
474
THOM2, 10 aa, 150 par
246*
478
THOM2, 20 aa, 300 par
246*
428
THOM2, 20 aa, 300 par
236
594*
55
Table 4
Table 4.A: THOM1
ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL
(1) -0.02
0.10 -0.22
0.02 -0.13
0.02
0.05 -0.05 -0.15 -0.17 -0.04
0.13 -0.40 -0.52
0.29 -0.02
(2) -0.06 -0.23 -0.07
0.20 -0.37
0.21 -0.03 -0.06 -0.05 -0.30 -0.22
0.12 -0.20 -0.25
0.24 -0.01 -0.10 -0.57 -0.27 -0.25
(3) -0.02 -0.01 -0.01
0.43 -0.72
0.09
0.10
0.05 -0.25 -0.48 -0.37
0.19 -0.66 -0.58
0.06
0.05 -0.12 -0.77 -0.37 -0.38
(4) -0.17
0.12
0.29
0.37 -0.70
0.22
0.40
0.14 -0.31 -0.64 -0.41
0.60 -0.50 -0.68
0.22
0.00
0.21 -0.36 -0.39 -0.36
(5) -0.13
0.22
0.20
0.68 -1.13
0.33
0.45
0.38
0.24 -0.53 -0.50
0.37 -0.39 -0.65
0.31
0.31
0.02 -0.65 -0.78 -0.51
(6)
0.02
0.32
0.17
0.43 -1.16
0.02
0.70
0.42
0.36 -0.57 -0.58
0.63 -0.80 -0.82
0.75
0.27
0.24 -0.46 -0.72 -0.51
(7)
0.12 -0.10
0.30
0.27 -0.76 -0.54
0.73 -0.44 -0.40
0.42
0.09
0.36
0.57 -0.66
0.25
0.02
0.36
0.15 -0.26 -0.74 -0.59
1.66 -1.03
1.13
2.23 -0.57 10.00 -0.38 -0.13
0.43 -1.27
0.46
0.39
0.20
(8) -0.07
0.91 -0.12 -0.01 -1.60
0.51
0.83
0.29 -0.71 -1.37 -0.72
(9)
1.36
0.82 10.00
(10)
0.83
0.11
0.35 -1.71
1.57 10.00 10.00 10.00 10.00 10.00 10.00
2.12
3.38 -0.33
1.03 10.00
0.83 10.00 -0.93 -0.47 10.00 10.00
0.02 -0.20 -0.23 -0.16
0.12 -0.39 -0.78
0.40 10.00 10.00 10.00 -0.78 10.00
Table 4.B: THOM2
ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE
LEU LYS MET PHE PRO SER THR TRP TYR VAL
(1,1)
0.23 -0.03 -0.03 -0.08 -0.82 -0.26
(1,5)
-0.21 -0.26 -0.10
(1,9)
-6.01 -4.09 -5.42 -6.14 -7.27 -5.88 -5.80 -5.81 -4.75 -5.46 -5.85 -4.91 -4.97 -5.83 -6.17 -5.89 -5.89 -5.25 -6.79 -6.99
(3,1)
-0.01 -0.10 -0.17
0.02 -0.50 -0.09
0.11
0.31
(3,5)
-0.08
0.18
0.15
0.13 -0.69
0.12
0.24
0.04 -0.03 -0.29 -0.21
(3,9)
-0.29
0.06 -0.33
0.08 -0.78
0.18
0.02 -0.13 -0.47 -0.60 -0.49
(5,1)
0.13 -0.21
0.04
0.22 -0.15 -0.11
0.08
0.48
(5,5)
0.06
0.16
0.20
0.17 -0.60
0.13
0.18 -0.04 -0.25 -0.19
(5,9)
-0.65
0.68 -0.26 -0.19 -0.82 -0.09
0.43 -0.36 -0.19 -0.47 -0.42
(7,1)
6.29
5.50
5.56
6.02
5.09
5.55
5.68
6.10
5.70
(7,5)
0.17
0.29
0.36
0.39 -0.28
0.28
0.45
0.33
(7,9)
0.08
0.41
0.00 -0.15 -0.30
0.04 -0.27
0.05
0.69
(9,1)
10.00
4.50
6.05
5.21
4.00
5.94 10.00 10.00 10.00 10.00
(9,5)
0.26
0.30
0.26
0.71
0.41 -0.02
(9,9)
0.20
0.04 -0.37 -1.34 -1.19
0.20 -1.11
0.09
0.29
0.07 -0.12 -0.16 -0.02
0.03
0.05 -0.07 -0.50 -0.64 -0.28
0.00 -0.08
0.00
0.03 -0.31 -0.23 -0.13 -0.15 -0.29 -0.23
0.07 -0.09 -0.60 -0.40 -0.36
0.04
0.47
0.32
0.04 -0.10 -0.10
0.11 -0.20 -0.17 -0.02
0.40
0.06 -0.31 -0.29 -0.05
0.14
0.06
0.08 -0.36 -0.28 -0.17
0.09 -0.85 -0.07
0.19
0.23
0.15 -0.15
0.03 -0.27
0.19 -0.15 -0.32 -0.06 -0.15 -0.27
0.17
0.19
0.34 -0.07
0.02
0.26 -0.26 -0.28
0.09
0.11
0.02 -0.36 -0.30 -0.27
0.34
0.32
0.07
0.55
0.22
0.01
0.04 -0.46 -0.58
5.26
6.08
5.64
5.80
5.82
5.23
5.48
6.42
0.28 -0.08 -0.01
0.50
0.24 -0.16
0.42
0.13
0.34
0.04 -0.08 -0.03
0.67
0.06
0.03 -0.71
0.82
0.24 -0.36
5.59
4.91
6.02
9.61 10.00 10.00
0.52 -0.19
0.43
3.07
0.43
1.41 -1.33
6.94
3.22 -0.54
0.83 -0.09
1.37 -1.36
0.21 -0.20
5.59
0.04 -0.17
6.22
1.26 -0.15
1.06 -1.99 -0.25 -0.29
56
0.08 -0.32 -0.05
5.17
0.19
5.53
0.14 -0.25
5.88 10.00 10.00
0.52 -0.08
0.08
0.21
0.81 -0.53 -0.52
0.71
Table 5
A:
Type of site*
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
Native (HYD / POL)
16.97 (4.89 / 12.09)
17.30 (6.06 / 11.24)
17.72 (8.29 / 9.43)
16.60 (9.68 / 6.92)
14.62 (10.16 / 4.47)
9.96 (7.66 / 2.30)
4.95 (4.02 / 0.92)
1.57 (1.32 / 0.25)
0.26 (0.21 / 0.05)
0.04 (0.04 / 0.00)
Decoys (HYD / POL)
24.20 (11.72 / 12.48)
21.72 (10.52 / 11.20)
18.70 (9.06 / 9.64)
15.00 (7.28 / 7.73)
10.79 (5.24 / 5.55)
6.04 (2.94 / 3.10)
2.63 (1.28 / 1.35)
0.77 (0.38 / 0.40)
0.12 (0.06 / 0.06)
0.02 (0.01 / 0.01)
Type of contact
( 1,1 )
Native (HYD / POL)
5.09 (1.59 / 3.50)
Decoys (HYD / POL)
11.34 (5.48 / 5.85)
( 1, 5 )
9.02 (2.99 / 6.04)
12.69 (6.15 / 6.54)
( 1, 9 )
0.41 (0.15 / 0.26)
0.35 (0.17 / 0.18)
( 3,1 )
6.25 (2.88 / 3.37)
9.51 (4.60 / 4.91)
( 3, 5 )
24.09 (13.01 / 11.08)
26.59 (12.91 / 13.68)
( 3, 9 )
3.23 (1.88 / 1.35)
2.29 (1.12 / 1.18)
( 5, 1 )
2.77 (1.81 / 0.96)
3.18 (1.54 / 1.64)
( 5, 5 )
28.36 (20.96 / 7.40)
22.09 (10.75 / 11.34)
( 5, 9 )
6.85 (5.11 / 1.74)
3.84 (1.87 / 1.96)
( 7, 1 )
0.40 (0.31 / 0.09)
0.34 (0.16 / 0.17)
( 7, 5 )
9.56 (8.00 / 1.56)
5.84 (2.85 / 3.00)
( 7, 9 )
3.21 (2.60 / 0.61)
1.54 (0.75 / 0.79)
( 9, 1 )
0.01 (0.01 / 0.00)
0.01 (0.01 / 0.01)
( 9, 5 )
0.52 (0.44 / 0.08)
0.29 (0.15 / 0.14)
( 9, 9 )
0.23 (0.19 / 0.04)
0.09 (0.05 / 0.05)
B:
57
Table 6
A:
V( 1 )
V( 3 )
V( 5 )
V( 7 )
V( 9 )
V( 1 )
-0.56
-0.41
-0.17
-1.46
3.01
V( 3 )
-0.41
-0.34
-0.44
-0.30
-0.07
V( 5 )
-0.17
-0.44
-0.54
-0.61
-0.38
V( 7 )
-1.46
-0.30
-0.61
-0.49
-0.76
V( 9 )
3.01
-0.07
-0.38
-0.76
-1.03
K( 1 )
K( 3 )
K( 5 )
K( 7 )
K( 9 )
K( 1 )
-0.03
-0.03
-0.19
1.18
0.69
K( 3 )
-0.03
0.28
0.40
0.58
0.61
K( 5 )
-0.19
0.40
0.52
0.83
0.86
K( 7 )
1.18
0.58
0.83
1.34
0.38
K( 9 )
0.69
0.61
0.86
0.38
-0.59
B:
Table 7
Native
Homologous
Similarity
1mba (myoglobin, 146)
1lh2 (leghemoglobin, 153)
20%, 2.8 Ang, 140 res
1mba (myoglobin, 146)
1babB (hemoglobin, chain B, 146)
17%, 2.3 Ang, 138 res
1ntp (-trypsin, 223)
2gch (-chymotrypsin, 245)
45%, 1.2 Ang, 216 res
1ccr (cytochrome c, 111)
1yea (cytochrome c, 112)
53%, 1.2 Ang, 110 res
1lz1 (lysozyme, 130)
1lz5 (1lz1 + 4 res insert, 134)
99%, 0.5 Ang, 130 res
1lz1 (lysozyme, 130)
1lz6 (1lz1 + 8 res insert, 138)
99%, 0.3 Ang, 129 res
58
Table 8
A:
Type of site
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
Penalty
0.1
0.3
0.6
0.9
2.0
4.0
6.0
8.0
9.0
10.0
B:
Type of contact
(0)
( 1,1 )
Penalty
1.0
8.9
( 1, 5 )
5.7
( 1, 9 )
10.0
59
Table 9
A:
.........1.........2.........3.........4.........5.........
SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPK
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPE
.........1.........2.........3.........4.........5.........
1 - 59
1mba
1lh2
1 - 59
6.........7.........8......ii...9.........0.........1......
LRDVSSRIFTRLNEFVNNAANAGKMSA--MLSQFAKEHVGFGVGSAQFENVRSMFPGFV
LQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTI
6.........7.........8.........9.........0.........1........
60 - 116
1mba
1lh2
60 - 118
...2..i..i.....3.........4..i...i.i
ASVAAP-PA-GADAAWTKLFGLIIDALK-AAG-AKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA
.2.........3.........4.........5...
117 - 146
1mba
1lh2
119 - 153
B:
........i.1.........2.........3.........4.........i....i.i.
SLSAAEAD-LAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGK-SVAD-I-K
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPE
.........1.........2.........3.........4.........5.........
1 - 55
1mba
1lh2
1 - 59
....6.........7........i.8.i........9.........0.........1..
ASPKLRDVSSRIFTRLNEFVNNA-ANA-GKMSAMLSQFAKEHVGFGVGSAQFENVRSMF
LQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTI
6.........7.........8.........9.........0.........1........
56 - 112
1mba
1lh2
60 - 118
.......2.i........3.........4......
PGFVASVAA-PPAGADAAWTKLFGLIIDALKAAGA
KEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA
.2.........3.........4.........5...
113 - 146
1mba
1lh2
119 – 153
Table 10
Potential
Recog. structs.
Nsat. ineqs. [M]
Z score
HL
915
1.84
1.14
TE
914
0.20
1.45
MJ
902
0.28
1.23
THOM2
877
0.24
1.35
BT
861
0.17
1.26
GSK
819
0.08
1.35
60
Table 11
name (len)
1hka
1vhi
2a2u
1bbp
2ezm
1qgo
1abe
1byf
1ytt
1jwe
1b79
1b7g
1a7k
1eug
1udh
1d3b
1b34
1dpt
1ca7
1bg8
1dj8
1qfj
1vid
1bkb
1eif
1b0n
1lmb
1bd9
1beh
1bhe
1rmg
1b9k
1qts
1eh2
1qjt
1bqv
1b4f
1ck2
1cn8
1bl0
1jhg
1bnk
1b93
1mjh
1bk7
1bol
1bvb
(158)
(139)
(158)
(173)
(101)
(257)
(305)
(123)
(115)
(114)
(102)
(340)
(358)
(225)
(244)
(72)
(118)
(114)
(114)
(76)
(79)
(226)
(214)
(132)
(130)
(103)
(87)
(180)
(180)
(376)
(422)
(237)
(247)
(95)
(99)
(110)
(82)
(104)
(104)
(116)
(101)
(100)
(148)
(143)
(190)
(222)
(211)
DALI
Z-sc (RMS)
33.0 (0.0)
4.3 (5.2)
33.8 (0.0)
11.6 (3.3)
55.3 (0.0)
46.0 (0.0)
6.4 (3.4)
29.5 (0.0)
16.4 (2.2)
26.9 (0.0)
18.7 (1.3)
61.5 (0.0)
25.1 (2.9)
43.0 (0.0)
30.8 (1.7)
18.4 (0.0)
13.4 (1.1)
24.8 (0.0)
18.7 (1.2)
19.1 (0.0)
16.2 (0.7)
42.7 (0.0)
7.1 (3.1)
25.1 (0.0)
17.4 (1.6)
19.5 (0.0)
8.0 (5.3)
38.8 (0.0)
36.0 (0.3)
70.2 (0.0)
36.9 (2.2)
39.7 (0.0)
36.1 (0.7)
24.3 (0.0)
7.6 (2.5)
20.9 (0.0)
3.2 (3.3)
26.0 (0.0)
14.3 (2.2)
24.9 (0.0)
3.4 (6.6)
24.9 (0.0)
31.4 (0.0)
6.1 (3.4)
37.2 (0.0)
19.7 (2.3)
37.3 (0.0)
THOM2
Glob Z-sc
7.1
0.2
2.5
3.5
3.7
5.6
0.5
1.8
-0.1
2.6
0.3
8.7
-0.4
3.4
-1.0
3.5
1.9
6.2
4.0
3.4
5.1
8.1
-2.0
2.7
3.5
4.7
0.3
4.5
7.4
6.7
0.9
8.1
3.5
6.0
3.6
3.5
0.0
5.2
5.3
0.5
1.1
5.4
4.0
0.3
7.7
0.1
5.3
61
THOM2
Loc Z-sc
7.1
0.3
4.0
3.0
3.2
7.6
0.4
2.8
1.4
2.3
1.3
8.8
-0.9
3.0
2.9
2.8
2.0
6.0
2.5
3.5
3.9
8.4
0.5
1.5
2.0
5.0
0.1
5.8
5.8
0.6
8.2
6.4
6.5
3.7
2.3
1.7
4.3
2.0
0.5
1.0
6.3
3.2
1.3
9.0
-1.0
4.3
Table 12
A: RAS family
Name
121p
1kao
3rabA
1ftn
1hurA
1kevA
1mioB
1hdeA
1ksaA
1cbf
GT
5.9
6.1
4.8
3.7
2.8
-
LT
9.5
5.9
3.2
3.8
3.6
3.6
3.5
3.4
2.7
-
ene
1
2
3
10
4
-
len
166
167
169
193
180
351
458
310
366
285
LS
74.8
46.4
17.1
21.7
9.3
5.1
DALI
36.1/0.0/166/100
28.5/1.4/166/49
27.5/1.4/165/31
22.9/1.8/161/35
14.8/2.5/147/15
3.1/4.0/83/11
3.5/3.6/99/10*
2.5/3.6/111/9
1.8/3.9/74/9
B: Lactoglobulin family
Name
2blg
1a3yA
1bj7
2apd
1mup
2pcfB
1lgbC
1nglA
GT
8.2
4.7
3.0
3.0
1.7
-
LT
10.0
3.8
3.1
2.1
2.5
3.0
-
ene
1
5
4
2
3
-
len
162
149
150
169
157
250
159
179
LS
79.9
10.6
4.4
8.9
6.0
5.8
DALI
35.1/0.0/162/100
17.6/2.4/140/17
17.8/2.4/142/18
11.8/3.0/138/15
19.2/2.2/146/16
12.0/3.5/136/14
C: Glutathione s-transferase family
name
2gsq
1axdA
1gsdA
1aw9
1gnwA
1clxA
1fhe
2ljrA
1ao7E
GT
7.0
2.0
3.2
4.3
-
LT
7.3
5.2
3.7
2.5
4.0
3.7
-
Ene
1
3
4
2
-
len
202
209
221
216
211
347
217
244
245
LS
87.3
3.3
16.5
4.9
5.0
11.5
5.1
4.9
DALI
37.5/0.0/202/100
18.1/2.9/190/17
25.1/2.1/200/29
18.4/3.1/194/19
17.1/3.1/187/17
0.5/3.9/50/9
20.9/2.3/195/25
15.7/3.1/195/18
-
62
C: Phosphotransferase (SH3 domain) family
name
1aww
2semA
1fynA
4hck
1hsq
1a3k
1gbrA
1ark
1nksA
GT
3.4
3.9
4.3
3.2
2.7
-
LT
4.6
2.6
2.1
3.5
4.0
3.1
-
ene
2
3
1
5
6
-
len
67
58
62
72
71
137
74
60
194
LS
19.3
13.8
49.8
28.4
10.8
3.2
13.9
12.3
5.5
DALI
8.1/1.7/56/36*
10.2/1.5/56/31*
9.5/1.7/56/47*
8.1/2.0/55/40*
9.8/1.4/54/26*
7.7/2.0/57/34*
7.6/1.9/56/20*
-
ene
1
-
len
124
107
333
322
449
82
107
LS
57.4
15.9
4.9
3.9
2.9
DALI
28.1/0.0/123/100
14.4/1.7/99/36
0.9/3.3/50/8
4.9/2.1/64/19
-
Ene
1
-
len
87
106
76
60
66
LS
36.8
6.5
11.4
9.9
DALI
8.9/1.8/82/51*
1.2/7.9/40/35*
4.9/2.6/58/33*
E: Cytochrome c family
name
2cxbA
1co6
1dsn
1crxA
1ndoA
451c
3cyr
GT
6.8
-
LT
6.0
3.4
3.1
-
F: Zinc finger family
name
1meyC
1jhb
1iml
2adr
2drpA
GT
5.5
-
LT
3.6
3.4
2.9
-
63
G: Aspartyl protease family
name
1htrB
4cms
2jxrA
1clxA
1egzA
1pfzA
1lyaB
2pia
GT
9.6
5.0
5.6
3.7
1.3
LT
8.3
5.7
3.7
1.5
3.6
2.9
2.4
-
ene
1
3
2
4
6
len
329
323
329
347
291
380
241
321
LS
95.0
47.5
43.9
32.2
32.0
6.0
len
31
29
28
30
39
35
40
61
LS
15.6
6.0
5.9
4.8
3.3
DALI
56.8/0.0/329/100
39.6/1.7/301/39*
37.0/2.1/307/41
31.7/2.4/298/29
9.3/2.4/83/59
0.5/4.3/49/6
H: Scorpion toxin-like family
name
1pnh
1acw
1mea
1bh4
1mtx
2pta
1ica
1ilmA
GT
2.8
2.8
2.5
1.1
1.4
-
LT
2.9
2.1
1.3
2.5
-
ene
2
1
4
5
10
-
64
Download