Linear Programming Optimization and a Double Statistical Filter for Protein Threading Protocols. Jaroslaw Meller1,2 and Ron Elber1* 1 Department of Computer Science Upson Hall 4130 Cornell University Ithaca NY 14853 2 Department of Computer Methods, Nicholas Copernicus University, 87-100 Torun, Poland * Corresponding author Phone (607)255-7416 Fax (607)255-4428 e-mail ron@cs.cornell.edu Running title: “LP based threading model” Keywords: Linear Programming, Potential Optimization, decoy structures, threading, gaps 1 Abstract The design of scoring functions (or potentials) for threading, differentiating native-like from non-native structures with a limited computational cost, is an active field of research. We revisit two widely used families of threading potentials, namely the pairwise and profile models. To design optimal scoring functions we use linear programming (LP). The LP protocol makes it possible to measure the difficulty of a particular training set in conjunction with a specific form of the scoring function. Gapless threading demonstrates that pair potentials have larger prediction capacity compared to profile energies. However, alignments with gaps are easier to compute with profile potentials. We therefore search and propose a new profile model with comparable prediction capacity to contact potentials. A protocol to determine optimal energy parameters for gaps using linear programming is also presented. A statistical test, based on a combination of local and global Z scores, is employed to filter out false positives. Extensive tests of the new protocol are presented. The new model provides an efficient alternative for threading with pair energies, maintaining comparable accuracy. The code, databases and a prediction server are available at http://www.tc.cornell.edu/CBIO/loopp. 2 I. Introduction The threading approach [1-8] to protein recognition is an alternative to the sequenceto-sequence alignment. Rather than matching the unknown sequence Si to another sequence S j (one dimensional matching) we match the sequence Si to a shape X j (three dimensional matching). Experiments found a limited set of folds compared to a large diversity of sequences, suggesting the use of structures to find remote similarities between proteins. Hence, the determination of overall folds is reduced to tests of sequence fitness into known and limited number of shapes. The sequence-to-structure compatibility is commonly evaluated using reduced representations of protein structures. Points in 3D space represent amino acid residues and an effective energy of a protein is defined as a sum of inter-residue interactions. The effective pair energies can be derived from the analysis of contacts in known structures. Such knowledge-based, pairwise potentials are widely used in fold recognition [2,3,6,911], ab-initio folding [11-13] and sequence design [14-15]. Alternatively, one may define the so-called “profile” energy [1,5,16] taking the form of a sum of individual site contributions, depending on the structural environment of a site. For example, the solvation/burial state or the secondary structure can be used to characterize different local environments. The advantage of profile models is the simplicity of finding optimal alignments with gaps (deletions and insertions into the aligned sequence) that allow the identification of homologous proteins of different length. Using dynamic programming (DP) algorithm [17-20] one may compute optimal alignments with gaps in polynomial time, as compared to the exponential number of all possible alignments. 3 In contrast to profile models, the potentials based on pair energies do not lead to exact alignments with dynamic programming. A number of heuristic algorithms, providing approximate alignments, have been proposed [21]. However, they cannot guarantee an optimal solution with less than exponential number of operations [22]. Another common approach is to approximate the energy by a profile model (the so-called frozen environment approximation) and to perform the alignment using DP [23]. In the present manuscript we evaluate several different scoring functions for sequence-to-structure alignments, with parameters optimized by linear programming (LP) [24-26]. The capacity of the energies is explored in terms of a number of protein shapes that are recognized with a certain number of parameters. We propose a novel profile model, designed to mimic pair energies, which is shown to have accuracy comparable to that of other contact models. We discuss gap energies and introduce a double Z score measure (from global and local alignments) to assess the results. The proposed threading protocol emphasizes structural fitness (as opposed to sequence similarity) and can be made a part of more complex fold recognition algorithms that use family profiles, secondary structures and other patterns relevant for protein recognition. II. Theory and computational protocols II.1. Functional form of the energy In this section we formally define the families of pairwise and profile models. We also introduce a novel THreading Onion Model (THOM), which is investigated in the subsequent sections of the paper. In the widely used pairwise contact model, the score of 4 the alignment of a sequence S into a structure X is a sum of all pairs of interacting amino acids, E pairs ij ( i , j , rij ) . (1) i j The pair interaction model -- ij depends on the distance between sites i and j, and on the types of the amino acids, i and j . The latter are defined by the alignment, as certain amino acid residues are placed in sites i and j, respectively. We consider here the widely used contact potential. If the geometric centers of the side chains are closer than 6.4 Angstrom then the two amino acids are considered in contact. The total energy is a sum of the individual contact energies: ij ( i , j , rij ) 1.0 rij 6.4 Ang , otherwise 0 (2) where i, j are the structure site indices (contacts due to sites in sequential vicinity are excluded, i 3 j ), , are indices of the amino acid types (we drop the subscripts i and j for convenience) and is a matrix of all the possible contact types. For example, it can be a 20x20 matrix for the twenty amino acids. Alternatively, it can be a smaller matrix if the amino acids are grouped together to fewer classes. Different groups that are used in the present study are summarized in table 1. The entries of are the target of parameter optimization. **PLACE TABLE 1 HERE ** The second type of energy function assigns “environment” or a profile to each of the structural sites [1]. The total energy E profile is written as a sum of the energies of the sites: 5 E profile i ( i , X) . (3) i As previously, i denotes the type of an amino acid that was placed at site i of X . For example if i is a hydrophobic residue, and xi is characterized as a hydrophobic site, the energy i ( i , X) will be low (score will be high). If i is charged then the energy will be high (low score). The total score is given by a sum of the individual site contributions. We consider two profile models. In THOM1 (THreading Onion Model 1), which was used in the past as an effective solvation potential [1,2], the total energy of the protein is a direct sum of the contributions from structural sites and can be written as ETHOM1 i (ni ) . (4) i The energy of a site depends on two indices: (a) the number of neighbors to the site -- ni (a neighbor is defined by a cutoff distance -- formula (2)), and (b) the type of the amino acid at site i -- i . For twenty amino acids and maximum of ten neighbors we have 200 parameters to optimize, a comparable number to the detailed pairwise model. THOM1 provides a non-specific interaction energy, which as we show in section III, has relatively low prediction ability when compared to pairwise interaction models. THOM2 is an attempt to improve the accuracy of the environment model making it more similar to pairwise interactions. We define the energy i ( ni , n j ) of a contact between structural sites i and j, where ni is the number of neighbors to site i and n j is the number of neighbors to site j . The type of amino acid at site i is i . Only one of the amino acids in contact is known. The total contribution to the energy of site i is a sum over all contacts to this site 6 i , THOM2 ( i X) ' (ni , n j ) . The prime indicates that we sum only over sites j that are i j in contact with i (i.e. over sites j satisfying the condition 1.0 rij 6.4 Ang and i j 4 ). The total energy is finally given by a double sum over i and j , ETHOM 2 ( n , n ) . i ' i i j (5) j It is possible to define an effective contact energy using THOM2: Vijeff i (ni , n j ) j (n j , ni ) . (6) Hence, we can formally express the THOM2 energy as a sum of pair energies, ETHOM 2 Vijeff . The effective energy mimics the formalism of pairwise interactions. i j However, in contrast to the usual pair potential, the optimal alignments with gaps can be computed efficiently with THOM2, since structural features alone determine the “identity” of the neighbor. ** PLACE TABLE 2 HERE ** We use a coarse-grained model leading to a reduced set of structural environments (types of contacts) as outlined in table 2. The use of a reduced set makes the number of parameters (300 when all the 20 types of amino acids are considered) comparable to that of the contact potential. Further analysis of the new model is included in section III. II.2 Linear programming protocol for optimization of the energy parameters Consider the alignment of a sequence S of length n , into a structure X of length m . In order to optimize the energy parameters for the amino acid interactions (the gap energies are discussed in section II.3) we employ the so-called gapless threading, in which the sequence S is fitted into the structure X with no deletions or insertions. Hence, 7 the length of the sequence must be shorter or equal to the length of the protein chain. If n is shorter than m we may try m n 1 possible alignments varying the structural site of the first residue (and the following sequence). The energy (score) of the alignment of S into X is denoted by E ( S , X, p) , where X stands (depending on the context) either for the whole structure or only for a substructure of length n. The energy function -- E ( S , X, p) depends on a vector p of q parameters (so far undetermined). Consider the sets of structures Xi and sequences S j . There is an energy value for each of the alignments of the sequences S j into the structures Xi . A good potential will make the alignment of the “native” sequence into its “native” structure the lowest in energy. Let X n be the native structure. A condition for an exact recognition is: E ( S n , X j , p) E ( S n , X n , p) 0 jn. (7) In the set of inequalities (7) the coordinates and sequences are given and the unknowns are the parameters that we need to determine. The LP protocol makes it possible to measure the difficulty of a training set. The number of parameters of the energy function that are necessary to satisfy all the inequalities are derived from the set of structures, as defined in equation (7). Note also, that while the statistical potentials are based on the analysis of native structures only, the LP protocol is using sequences threaded through misfolded structures during the process of learning. As a result the LP has the potential for accumulating more information, attempting to put the energies of the misfolded sequence as far as possible from the energy of the native state. In fact, the LP protocol can be used to optimize the Z score of 8 the distribution of energy gaps [27]. In the remaining part of this section we describe the technique to solve the inequalities of equation (7). The “profile” and pairwise interaction models that are considered in this work can be written as a scalar product: E n ( X) p n( X) p , (8) where p is the vector of parameters that we wish to determine. The index of the vector, , is running over the types of contacts or sites. For example, in the pairwise interaction model the index is running over amino acid pairs, whereas in the THOM1 model it is running over the types of sites characterized by the identity of the amino acid at the site and the number of its neighbors. n (X ) is the number of contacts, or sites of a specific type found in the structure X. Using the representation of equation (8) we may rewrite equation (7) as follows: p n j p (n ( X j ) n ( X n )) 0 j n. (9) Standard linear programming tools can solve equation (9) for p. We use the BPMPD program of Cs. Meszaros [28], which is based on the interior point algorithm. In the present computations we seek a point in parameter space that satisfies the constraints and we do not optimize a function in that space. Without a function to optimize the interior point algorithm places the solution at the “maximally feasible” point, which is at the analytic center of the feasible polyhedron that defines the “accessible” volume of parameters [29, 27]. The set of inequalities that we wish to solve includes tens of millions of constraints that could not be loaded into the computer memory directly (we have access 9 to machines with two to four Gigabytes of memory). Therefore, the following heuristic approach was used. Only a subset of the constraints is considered, namely p n Cj 1 , J with a threshold C chosen to restrict the number of inequalities to a manageable size (which is about 500,000 inequalities for 200 parameters). Hence, during a single iteration, we considered only the inequalities that are more likely to be relevant for further improvement by being smaller than the cutoff C . The subset p n Cj 1 is sent to the LP solver “as is”. If proven infeasible, the J calculation stops (no solution possible). Otherwise, the result is used to test the remaining inequalities for violations of the constraints (equation 9). If no violations are detected the process was stopped (a solution was found). If negative inner products were found in the remaining set, a new subset of inequalities below C was collected. The process was repeated, until it converged. Sometimes convergence was difficult to achieve and human intervention in the choices of the inequalities was necessary. For example, mixing subsets of inequalities from previous runs with the lowest inequalities that were obtained with the new parameters helps to avoid the problem of “oscillating” between solutions. II.3 Protocol for optimization of gap energies In the following section we discuss the derivation of the energy terms for gaps and deletions that enable the detection of homologs. We introduce an “extended” sequence, S , which may include gap “residues” (spaces, or empty structural sites) and deletions (removal of an amino acid, or an amino acid placed at a virtual structural site). The gap residue, “-“, is considered to be another amino acid. We assign to it a score (or energy) -- (X) according to its environment. Gap training is similar to the training of other amino acid residues, in contrast to the usual ad-hoc treatment of gap energies. 10 The proposed treatment is also more symmetric than the different penalties for opening and extending gaps. The database of “native” and decoy structures is different, however, for gapless and gap training. To optimize the gap parameters we need “pseudo-native” structures that include gaps. We construct such “pseudo-native” conformations by removing the true native shape X n of the sequence S n from the coordinate training set and replacing it by a homologous structure -- X h . The best alignment of the native sequence, S n , into the homologous structure, X h , with an initial guess of gap penalties, defines S n . The extended sequence, S n , with gap residues at certain (fixed from now on) positions becomes our new (pseudo-) native sequence of the structure X h . We require that the alignment of S n into the homologous protein will yield the lowest energy compared to all other alignments of the set. Hence, our constraints are: E(Sn , X j , p) E(Sn , Xh , p) 0 j h, n . (10) Equation (10) is different from equation (7) since we consider the “extended” set of “amino acids” -- S instead of S and the native-like structure is X h instead of X n . To limit the scope of the computations we optimize here the scores of the gaps only. Thus, we do not allow the amino acid energies optimized separately (see section III.2) to change while optimizing parameters for gaps. Moreover, the sequence S , obtained by a certain prior (e.g. structure-to-structure) alignment or from the experimental data if available, is held fixed. In other words, threading of the extended sequence with fixed positions and number of gap residues (treated now as any other residue), S , against all other structures in the training set is used, in order to generate a 11 corresponding set of inequalities, (equation (10)). This optimization, while limited and clearly not the final word on the topic, is still expected to be better than a guess. Further studies of gap penalties are in progress [Galor, Meller and Elber, work in progress]. We note that optimization of gaps has been attempted in the past [23,30]. In principle, one could optimize deletion penalties using a similar protocol. In the present manuscript we exploit an assumed symmetry between insertion of a gap residue to a sequence and the placement of a “delete” residue in a virtual structural site. The deletion penalty is set equal to the cost of insertion averaged over two nearest structural sites. No explicit dependence on the amino acid type is assumed. II.4 Double Z score filter for gapped alignments In sections III.4 and IV.2-4 we consider optimal alignments of an extended sequence S with gaps into the library structures X j . We focus on the alignments of complete sequences to complete structures (global alignments [17]) and alignments of continuous fragments of sequences into continuous fragments of structures (local alignment [18]). In global alignments opening and closing gaps (gaps before the first residue and after the last amino acid) reduce the score. In local alignments gaps or deletions at the C and N terminals of the highest scoring segment are ignored. Only one local segment, with the highest score, is considered. Threading experiments that are based on a single criterion (the energy) are usually unsatisfactory [26,31]. While we do hope that the (free) energy function that we design is sufficiently accurate so that the native state (the native sequence threaded through the native structure) is the lowest in energy, this is not always the case. Our exact training is for the training set and for gapless threading only (see section III). The results were not 12 extended to include exact learning with gaps, or exact recognition of structures of related proteins that are not the native. Such extensions are difficult since the number of inequalities for S is exponentially larger than the number of inequalities without gaps. Other investigators use the Z score as an additional or the primary filter [19,32,4,6] and we follow their steps. The novelty in the present protocol is the combined use of global and local Z scores to assess the accuracy of the prediction. This filtering mechanism, in addition to initial energy filter, provides an improved discrimination as compared to a single Z score test. The Z score is a dimensionless, “normalized” score, defined as: Z E Ep E2 E 2 . (12) The energy of the current “probe” i.e. the energy of the optimal alignment of a query sequence into a target structure is denoted by E p . The averages, ... , are over “random” alignments. The Z score measures the deviation of our “hits” from random alignments (alignments with scores that are far from the “random” average value are more significant). Following the common practice [32-34] we generate the distribution of random alignments numerically, employing sequence shuffling. That is, we consider the family of sequences obtained by permutations of the original sequence. The set of shuffled sequences has the same amino acid composition and length as the native sequence and all the shuffled sequences have the same energy in the unfolded state (the energy of an amino acid with no contacts is set to zero). In section III.4 we estimate numerically the probability P( Z p ) of observing a Z score larger than Z p by chance for local threading alignments. The relatively high 13 likelihood of observing large Z scores for false positives makes predictions based on the Z score test problematic. Therefore we propose an additional filtering mechanism, based on a combination of Z scores for global and local alignments. The double Z score filter eliminates false positives, missing a smaller number of correct predictions. Global alignments (in contrast to local alignments) are influenced significantly by a difference in the lengths of the structure and the threaded sequence. The matching of lengths was considered too restricted in previous studies [35]. Nevertheless, using environment dependent gap penalty, the Z score of the global alignment was proven a useful independent filter (see sections III.4 and IV.2-4). We observe that good scores are obtained for length differences (between sequence and structure) that are of order of ten percent. On the other hand, when the differences in length are profound the global alignment fails. Large differences imply identification of domains and not a whole protein. This is a different problem, which the present manuscript does not address. III. Application to potential design and analysis In this section we analyze and compare several pairwise and profile potentials, optimized using the LP protocol. Given the training set, we either obtain a solution (exact recognition on the training set), or the LP problem proves infeasible. We use the infeasibility of a set to test the capacity of an energy model. We compare the capacity of alternative energy models by inquiring how many native folds they can recognize (before hitting an infeasible solution). The larger is the number of proteins that are recognized with the same number of parameters, the better is the energy model. We 14 find that the “profile” potentials have in general lower capacity than the pairwise interaction models. III.1 Training and test sets Two sets of protein structures and sequences are used for the training of parameters in the present study. Hinds and Levitt developed the first set [31] that we call the HL set. It consists of 246 protein structures and sequences. Gapless threading of all sequences into all structures generated the 4,003,727 constraints (i.e. the inequalities of equation (7)). The gapless constraints were used to determine the potential parameters for the twenty amino acids. Since the number of parameters does not exceed a few hundred, the number of inequalities is larger than the number of unknowns by many orders of magnitude. The second set of structures consists of 594 proteins and was developed by Tobi et al. [25]. It is called the TE set and is considerably more demanding. It includes structures chosen according to diversity of protein folds but also some homologous proteins (up to 60 percent sequence identity), and poses a significant challenge to the energy function. For example, the set is infeasible for the THOM1 model, even when using 20 types of amino acids (see section III.2). The total number of inequalities that were obtained from the TE set using gapless threading was 30,211,442. The TE set includes 206 proteins from the HL set. We developed two other sets that are used as testing sets to evaluate the new potentials both in terms of gapped and gapless alignments. These test sets contain proteins that are structurally dissimilar to the proteins included in the training sets, specified by the RMS distance between the structures. A structure-to-structure alignment 15 algorithm, based on the overlap of the contact shells that are defined for the superimposed side-chain centers in analogy with the THOM2 model (disregarding however the identities of amino acids), was used (Meller and Elber, unpublished results). The first testing set, which is referred to as S47, consists of 47 proteins: 25 proteins from the CASP3 competition [36] and 22 related structures, chosen randomly from the list of DALI [37] relatives of the CASP3 targets. Using CASP3-related structures is a convenient way of finding protein structures that are not sampled in the training. None of the 47 structures has homologous counterparts in the HL set and only six (representing three different folds) have counterparts in the TE set, with a cutoff for structural (dis)-similarity of 12 angstrom RMSD (between the superimposed side-chain centers). The second test set, referred to as S1082, consists of 1082 proteins that were not included in the TE set and which are different by at least 3 angstrom RMSD (measured, as previously, between the superimposed side chain centers) with respect to any protein from the TE set and with respect to each other. Thus, the S1082 set is a relatively dense (but non-redundant up to 3 angstrom RMSD) sample of protein families. The training and testing sets are available from the web [38]. III.2 Linear programming training of “minimal” models This section addresses the question: What is the minimal number of parameters that is required to obtain an exact solution for the HL and for the TE sets? By “exact” we mean that each of the sequences picks the native fold as the lowest in energy using a gapless threading procedure. Hence, all the inequalities in equation (7), for all sequences S n and structures X j , are satisfied. 16 The pairwise model requires the smallest number of parameters (55) to solve the HL set exactly (see table 3). Only ten types of amino acids were required: HYD POL CHG CHN GLY ALA PRO TYR TRP HIS (see also table 1). The THOM1 and THOM2 models require 200 and 150 parameters respectively to provide an exact solution on the same (HL) set (see table 3). It is impossible to find an exact potential (of the functional forms we examined) for the HL set without (at least) ten types of amino acids. The potentials optimized on the HL set are then used to predict the folds of the proteins of the TE set. Again, we find that the pairwise interaction model performs better than THOM models. ** PLACE TABLE 3 HERE ** An indication that THOM2 is a better choice than THOM1 is included in the next comparison. It is impossible to find parameters that will solve exactly the TE set using THOM1 (the inequalities form an infeasible set). The infeasibility is obtained even if 20 types of amino acids are considered. In contrast, both THOM2 and the pairwise interaction model led to feasible inequalities if the number of parameters is 300 for THOM2 and 210 for the pairwise potential. Note that the set of parameters that solved exactly the TE set does not solve exactly the HL set since the latter set includes proteins not included in the TE set. We have also attempted to solve the TE set using pair energies and THOM2 with a smaller number of parameters. The problem was proven infeasible even for 17 different types of amino acids and only very similar amino acids grouped together (Leu and Ile, Arg and Lys, Glu and Asp). Similarly, we failed to reduce the number of parameters by 17 grouping together structurally determined types of contacts in THOM2. Enhancing the range of a “dense” site to be a site of seven neighbors or more also results in infeasibility. III.3 Analysis of THOM2 model As discussed in section II, the THOM1 potential provides a new set of parameters for an effective solvation model that was used in the past. Since applying the LP protocol we can only solve the HL set, the solution for that set gives our optimal THOM1 energies, as included in table 4.a. In this section we analyze in detail the THOM2 model, which has significantly higher capacity than THOM1. However, the double layer of neighbors makes the results more difficult to understand. ** PLACE TABLE 4 HERE ** In figure 1 we present a contour plot of the total contributions to the energies of the native alignments in the TE set, as a function of the number of contacts in the first shell, n , and the number of secondary contacts to a primary contact, n ' , respectively. The results for two types of residues, lysine and valine, are presented. The contribution of a given type of contact is defined as f n, n ' , where n, n ' is the energy of a given type of contact and f is the frequency of that contact, observed in the TE set. ** PLACE FIGURE 1 HERE ** It is possible to find a very attractive (or repulsive) site that makes only negligible contribution to the native energies since it is extremely rare (i.e. f is small). For specific examples see table 5. By plotting f n, n ' we emphasize the important contributions. Hydrophobic residues with a large number of contacts stabilize the native alignment, as opposed to polar residues that stabilize the native state only with a small number of neighbors. 18 ** PLACE TABLE 5 HERE ** It has been suggested that pairwise interactions are insufficient to fold proteins and higher order terms are necessary [26]. It is of interest to check if the environment models that we use catch cooperative, many-body effects. As an example we consider the cases of valine-valine and lysine-lysine interactions. We use equation (6) to define the energy of a contact. In the usual pairwise interaction the energy of a valine-valine contact is a constant and independent of other contacts that the valine may have. ** PLACE TABLE 6 HERE ** In table 6 we list the effective energies of contacts between valine residues as a function of the number of neighbors in the primary and secondary sites. The energies differ widely from –1.46 to +3.01. The positive contributions refer to very rare type of contacts. The plausible interpretation is that these rare contacts are used to enhance recognition in some cases, due to specific “homologous features”. Significant differences are observed also for the frequently occurring types of contacts that contribute in accord with the “general principle” of rewarding contacts between hydrophobic sites. For example, the effective energies of contacts between valine of five neighbors with another valine of three, five or seven neighbors are equal to –0.44, -0.54, -0.61, respectively. Hence, the THOM2 model includes significant cooperativity effects. The optimal parameters for THOM2 potential are provided in table 4.b. III.3 Training of gap energies In this section we apply the linear protocol for the optimization of gap energies described in section II.3. Training concerns the gap energies for THOM2 model only and it is limited to a small set of carefully chosen homologous pairs. Despite the limited scope 19 of our training, we obtain satisfactory results in terms of recognition of remote homologs, as discussed in the subsequent parts of the paper. Pairs of homologous proteins from the following families were considered: globins, trypsins, cytochromes and lysozymes (see table 7). The families were selected to represent different folds. The globins are helical, trypsins are mostly -sheets, and lysozymes are / proteins. Note also that the number of gaps differs appreciably from a protein to a protein. For example, S n includes only one gap for the alignment of 1ccr (sequence) vs. 1yea (structure), and 22 gaps for 1ntp vs. 2gch. The structures of the lysozymes 1lz5 and 1lz6 include engineered insertions that allow us to sample experimentally observed gap locations. ** PLACE TABLE 7 HERE ** For the remaining families, the process of generating pseudo-native sequences is as follows: For each pair of native and homologous proteins the alignment of the native sequence S n into the homologous structure X h is constructed using THOM1 potential with an initial guess for the gap energies, provided in table 8a. The ad hoc gap penalties favor gaps at sites with few neighbors and they satisfy the following constraints: The gap penalty should increase with the number of neighbors; The energy of a gap with n contacts must be larger than the energy of an amino acid with the same number of contacts (the gap energy must be higher otherwise gaps will be preferred to real amino acids); The energy of amino acids without contacts is set to zero and therefore the gap energy is greater than zero. Given the above constraints the initial gap penalties are tuned up to minimize the discrepancies with the DALI [37] structure-to-structure alignments 20 (we choose not to use the DALI alignments directly since they involve deletions that are not trained explicitly at this stage – see section II.3). ** PLACE TABLE 8 HERE ** The “pseudo-native” structures with extended sequences, obtained as described above, are added to the HL set (while removing the original native structures). The energy functional form that we used for the gaps is the same as for other amino acids in the THOM2 model. “Gapless” threading into other structures of the HL set generates about 200,000 constraints for the gap energies, which are solved using the LP solver. The resulting gap penalties for THOM2 are given in table 8.b. The value of 10 is the maximal penalty allowed by the optimization protocol that we used. Maximal penalty is assigned to gaps that are found only in decoy states and have no native states to bound the penalty at lower values. For example, using our initial guess for gap penalties we do not observe gaps at the hydrophobic cores of pseudo-native structures. Gaps are usually found in loops with significant solvent exposure and we have no information in our training set on “native” gaps in sites that are neighbor-rich. ** PLACE TABLE 9 HERE ** In table 9 we show the results of optimal threading with gaps (using dynamic programming) for myoglobin (1mba) against leghemoglobin (1lh2) structure. We show the initial alignment (with the ad-hoc gap parameters from table 8.a), defining the pseudo-native state, and the results for optimized gap penalties for THOM2. The location of gaps in the initial alignment is to a large extent consistent with the DALI [37] structure-to-structure alignment. Four (out of seven) insertions coincide with the DALI superposition of the two structures, two insertions are shifted by three residues (see 21 caption for table 9). The THOM2 alignment (different from the initial set-up) is less consistent with the DALI alignment. Interestingly, however, it provides a better superposition of alpha helices. Note that the gaps appear (as expected) in loop regions (e.g., the CD, EF, and GH loops). An exception is the gap at position 9 (in 1lh2), which is located in the middle of the A helix instead of position 2, as suggested by the DALI alignment. Further tests of alignments with gaps are presented in section IV. In section IV we present also threading results for the pairwise TE potential, using the so-called frozen environment approximation (FEA) [23]. In order to compute optimal alignments with the FEA we need to set the gap penalties for the TE potential. Pairwise models are not the focus of our study and we do not attempt to optimize gap energies for the TE potential. Therefore, for the sake of fair comparison we introduce ad hoc gap penalties based on a similar functional model, for both the TE and THOM2 potentials. After some experimentation, the insertion penalties are chosen to be proportional to the number of neighbors to a site, TE (n) 0.2 (n 1) and THOM 2 (n) 1.0 ( n 1) , for the TE and THOM2 potentials (the averaged number of neighbors, n , in a class n belongs to, is used for THOM2 – see table 2), respectively. This choice is consistent with the trained THOM2 gap energies, which also penalize sites of no neighbors. The proportionality coefficients were gauged using the same families that were used to train THOM2 gap energies. However, no LP training was attempted. The deletion penalties are also consistent with the THOM2 model and they are defined in the way described in section II.3. For further comparisons with sequence-to-sequence alignments we also introduce environment dependent gap penalties that are used for family recognition in 22 conjunction with the BLOSUM50 [39] substitution matrix, B 50 (n) (5 n) 8 (see section IV.3). III.4 Assessing the distribution of the Z scores for gapped alignments In this section we compute numerical distributions of the Z scores for local and global threading alignments, using the THOM2 model and the gap penalties of table 8.b, trained in the previous section. Based on these distributions, we derive empirical cutoffs for the double Z score test (discussed in section II.4) that filters out all the incorrect predictions observed in our benchmark. Further tests of the specificity as well as sensitivity of the double Z score filter are included in sections IV.2-4. To establish a cutoff for the Z score (and not the energy itself) that eliminates false positives, we estimate numerically the probability P( Z p ) of observing a Z score larger than Z p by chance. The distribution of Z scores for random alignments is generated by threading sequences of the S47 set through structures included in the HL set. The probe sequences of known structures were selected to ensure no structural similarity between the HL set and the structures of the probe sequences (see section III.1). Therefore any significant hit in this set may be regarded as a false positive. Z scores of local alignments are employed to estimate P Z p . The number of local alignments with “good” energies (significantly lower than zero) is large, underlying the need for an additional selection mechanism to eliminate false positives. ** PLACE FIGURE 2 HERE ** In local alignments a contribution due to a given contact should be only included if it belongs to the alignment (which is not known to start with). This implies a “structural” frozen environment approximation (see also section IV.4). When counting contacts we 23 assume that the sites in contact in the original structure belong to the aligned part of the structure. This may result in spuriously low energies of local matches, making the Z score of the local threading alignment an important filter. As can be seen from figure 2, the attempted analytical fit to the gaussian distribution underestimates the tail of the observed distribution. The analytical fit to the extreme value distribution [40] provides, in turn, an upper bound for the tail. In the realm of sequence comparison, the extreme value distribution has been used to model scores of random sequence alignments for both: local, ungapped alignments [41], as well as local alignments with gaps [42]. We, however, establish our thresholds based on the numerical distribution. The number of random alignments with Z score larger than three, for example, is non-negligible (see the tail in figure 2 and also the analytical estimate in the caption for figure 2). The expected number of false positives observed in N trials is N P Z p . Therefore, only relatively high Z scores (that would miss at same time many correct predictions) may result in significant predictions, when searching large databases. Restricting the Z score test to only best matches (according to energy) is insufficient. We find that the double Z score filter performs better, eliminating false positives with a smaller number of correct predictions that are dismissed as insignificant. In figure 3 we present the joint probability distribution for global and local Z scores for a population of false positives versus a population of correct predictions. The squares at the upper right corner represent correct predictions, resulting from 331 native alignments (of a sequence into its native structure) and homologous alignments (of a sequence into a homologous structure) of the HL set proteins. The circles at the left lower 24 corner are incorrect predictions (false positives) obtained from the alignments of the sequences of the S47 set against all structures in the HL. The procedure is the same as the one used previously to generate the probability density function for the Z scores of local alignments. However, the Z-scores are computed using 1000 shuffled sequences for both global and local alignments, which is sufficient for convergence. The converged results reduce somewhat the tails of the distribution. For example, the number of false positives with a global Z score larger than 2.5 and a local Z score larger than 1.0 is equal to 3, as compared to 7 with only 100 shuffled sequences. ** PLACE FIGURE 3 HERE ** Figure 3 shows that the thresholds of 3.0 for global Z scores and of 2.0 for local Z scores are sufficient to eliminate all the false predictions. These cutoffs result in a number of misses (see also next section). However, this is the price we have to pay for high confidence of our predictions. The total number of pairwise alignments for which we compute the global and the local Z scores, and subsequently test for the presence of false positives, is about 10000. Hence, we estimate that the probability of observing a single false positive with a global and a local Z-score larger than the 3.0 and 2.0 thresholds is smaller than 0.0001. IV. Tests of the model There are four tests that we perform in this section on the THOM2 potential. First, we compare the performance of the THOM2 and pair potentials from the literature, using gapless alignments and the S1082 set of proteins. Next, we consider alignments with 25 gaps. We test the specificity and sensitivity of the double Z score filter that is employed to assess the statistical significance of gapped alignments. Using the double Z score filter, we analyze self-recognition for the S47 set of proteins that contains representatives of folds not sampled in the training. Next, tests of family recognition are presented, including the comparison of THOM2 results with those of a pairwise model using the frozen environment approximation. IV.1 Evaluation of THOM2 and pair potentials by gapless threading To make a comparison to pairwise potentials and to test at the same time the generalization capacity of THOM2, we use the S1082 set. This set does not contain proteins included in the training set. However, as discussed in section III.1, the threshold of 3 angstrom RMS for global structure-to-structure alignments (using side chain centers) excludes only close structural homologs. Therefore, the S1082 set includes many structural variations of the folds used in the training. In general, it is difficult to find completely independent test sets when using training sets covering essentially all the known folds. This problem concerns all the knowledge-based potentials considered here. Using gapless threading we compare the performance of THOM2 to the performance of five knowledge-based pairwise potentials. As can be seen from the table 10, the Godzik-Skolnick-Kolinski (GSK) potential [43] is the best in terms of the number of inequalities that are not satisfied, followed by the Betancourt-Thirumalai (BT) [44], Tobi-Elber (TE) [25], THOM2, Miyazawa-Jernigan (MJ) [45] and the Hinds-Levitt (HL) [46] potentials. However, in terms of the number of proteins recognized exactly (i.e. proteins with native energies lower than energies of all the decoys generated by gapless 26 threading into all the structures in the S1082 set), the HL potential is the best, followed by TE, MJ, THOM2, BT, and GSK potentials. The lack of correlation between the above two criteria is related to the fact that some of the above potentials, while recognizing very well many proteins, fair quite poorly for some of the proteins included in the S1082 set. Reducing the number of violated inequalities becomes important when applying some additional filters to select correct predictions from a small subset of energetically favorable matches (e.g. the Z score test, see section III.4). Therefore, it would be desirable to satisfy both criteria at same time (maximizing also the Z score of the distribution of energy gaps). From this point of view the TE, MJ and THOM2 potentials seem to be somewhat better than the other four potentials. Note that gapless training of energies is still a difficult problem, as reflected in table 10. None of the widely used potentials has better than 90% success rate. In a set of 1000 proteins this means many errors. ** PLACE TABLE 10 HERE ** The conclusion, which is important for the present work, is that the performance of the THOM2 potential is comparable to the performance of pairwise potentials, including the TE potential trained on the same set using a similar LP protocol. Since the proteins used in this test were either not included in the training or represent at least considerable variations of the structures included in the training, we conclude that the exact learning on the training set does not result in over-fitting. This is further supported by the results (presented in the next section) for the S47 set of proteins that represent folds not sampled during the training. 27 IV.2 Self recognition by gapped alignments We summarize first the performance of the THOM2 potential in terms of selfrecognition of the HL set proteins by optimal alignments and Z score filters. The HL set was partially learned (using gapless threading). However, our training did not include the Z score or the possibility of gaps. Successful predictions based on the Z score only, are useful tests even if performed on the training set of structures. Additionally, there are forty proteins in the HL set that were not included in the learning (TE) set. For each sequence we generate all the global and local alignments into all the structures in the HL set. Energy and Z score filters are considered. Of the total of 246 proteins, 234 native (global) alignments obtain the lowest energy and the highest Z score. There are four native alignments resulting in weak Z scores. The four failures are membrane proteins (from the photosynthetic reaction centers) that were not included in the training set. Only five out of the remaining 242 native alignments obtain Z-scores smaller than 3 (four alignments with Z scores larger than 2.5 and one alignment with a Z score smaller than 2.5). For the local alignments we use the Z score as the main filter since there are many incorrect alignments with low energies. There are 226 local native alignments with Z scores larger than 2 (177 of them of rank one and 35 of them of rank two). Among the remaining 20 local native alignments, nine result in very low Z scores ( Z 1.0 ), including six structures from the training set. Using the double Z score filter with the conservative threshold of 3 for global Z scores and of 2 for local Z scores results in dismissing 23 native alignments as insignificant. 28 In order to assess further the generalization capacity of THOM2 in terms of selfrecognition by optimal alignments, we use the S47 set again. The structures of S47 proteins were embedded in the structures of the TE set and the sequences of 25 proteins representing different folds in the S47 set were aligned into all the structures of such extended set. We observe that the native structures are found with high probability. Twenty of twenty-five structures result in native alignments with global Z scores larger than 3 and local Z scores larger than 2 (see table 11). ** PLACE TABLE 11 HERE ** A less encouraging observation is the sensitivity of the results to structural fluctuations. The THOM2 model can identify related structures only if their distance is not too large. Seven, out of 14 homologous structures with the DALI [37] Z score for structure-to-structure alignment larger than 10, are detected with high confidence. Only one homologous structure with the DALI Z score lower than 10 is detected. We would like to point out that only six structures (three pairs of structures representing three folds) of the S47 set had homologous counterparts in the training set. It is therefore reassuring that most of the native structures and significant fraction of the relatives are recognized both in terms of their energies and the Z scores. Moreover, there are no further significant hits into other structures from the TE set. Hence, no false positives above our confidence thresholds are observed in this test. We conclude that our nearly exact learning (on a training set) preserves significant capacity for identification of new folds using optimal alignments with gaps. 29 IV.3 Assessing the specificity of the protocol We present here examples of family recognition (i.e. identification of homologs) in terms of energy and double Z score filter. Only a few homologs are identified in a large set of (decoy) structures. This allows us to assess the specificity of the protocol, providing also a limited analysis of the sensitivity (see the next section for an extended assessment of the sensitivity). The test set S1082 is used. Eight families that have at least three representatives in the S1082 set are chosen to illustrate various aspects of THOM2 threading alignments as compared to DALI [37] structure-to-structure alignments as well as sequence-to-sequence alignments. The latter ones are generated using SmithWaterman algorithm [18], with the BLOSUM50 [39] substitution matrix and structurally biased gap penalties (see section III.3). Since we do not incorporate family profiles in our threading protocol, we consider here only pairwise sequence alignments for comparison. Similarly to threading, the confidence of sequence matches is estimated using Z scores, defined by the distribution of scores for shuffled sequences. We find that structurally biased gap penalties improve the recognition in case of weak sequence similarity. We do not observe false positives (if there is no clear evidence of sharing a common ancestor and a common function, then the structural dissimilarity is used to define false positives) with more than 50% of the query sequence aligned and with a Z score larger than 8 (for sequence alignment). Note that the distribution of Z scores for sequence substitution matrices is different from that of threading potentials, with very high Z score for highly homologous sequences. Regarding the specificity of threading results for the families considered here, we point out that there are only two energy based predictions with relatively high: global and 30 local threading Z scores that are false. They are still below our thresholds. The highest scoring false positive, namely the alignment of the aspartyl protease 1htrB into the xylanase 1clxA (Z scores of 3.7 and 1.5, when converged using 1000 shuffled sequences - see table 12.d), is still below our cutoffs. The alignment of the zinc finger protein 1meyC into the Adrl DNA-binding domain 2adr is potentially the highest scoring false positive among the sequence-based matches. However, even though 1meyC and 2adr are structurally dissimilar according to DALI (RMS of 7.9 angstrom for 40 residues), they share very high sequence similarity (42% for 55 residues), have similar function and are classified as related folds (zinc finger design and classic zinc finger, respectively) by SCOP. Other false positives due to the sequence-to-sequence alignments obtain Z scores between 5 and 7, which makes predictions based on weak sequence similarity difficult. ** PLACE TABLE 12 HERE ** Regarding the sensitivity of the protocol, one finds first that all the native structures are with the lowest energies and are recognized with high confidence in terms of the double Z score filter. We observe a varying degree of success in the recognition of family members and structural homologs, as illustrated in table 12.a to 12.h. Threading predictions are very robust for RAS, lactoglobulin and glutathione transferase families. In the case of the RAS family (table 12.a), a number of matches into remote structural relatives that share certain structural motifs with the RAS fold is observed. The structural similarity between lactoglobulins and bilin-binding proteins (that do not share detectable sequence similarity) is recognized (see alignment of 2blg into 2apd in table 12.b). Glutathione transferases 1aw9 and 1axdA, with very weak signals from sequence alignments, are recognized as well. 31 On the other hand, there are families for which threading performance is erratic, including phosphotransferase, cytochrome and zinc finger families that include matches recognizable by sequence alignment, of similar length and significant structural similarity, yet not recognized by threading (see table 12.c to 12.f). The results for the pepsin-like acid proteases (table 12.g) demonstrate missing matches due to significant differences in length, which are difficult to account for in global alignments. Local sequence and threading alignments for proteases 1pfzA and 1lyaB result in high Z scores, but no signal from global threading alignment is observed. The family of small toxins is an example of relatively weak signals (both from threading and sequence alignment) that are below our universal cutoffs for false positives (see table 12.h). IV.4 Assessing protein family signals and the sensitivity of the protocol Three families are considered here: globins (92 proteins), immunoglobins (Fv fragments, 137 proteins) and the DNA-binding, POU-like domains (26 proteins). Sequences of all family members are aligned optimally to all the structures in the family. Both, the local and global alignments are generated for each sequence-structure pair and the results are compared in terms of a simplified version of the double Z score filter discussed before. Ideally, all the scores should be above the thresholds we presented earlier. The scores should also correlate with the RMS distance. The THOM2 results are compared to the results of the TE pairwise potential, which was trained on the same (TE) set using LP protocol [25]. The alignments due to the TE potential are computed using the first iteration of the frozen environment approximation (FEA) [23]. In THOM2, the number of neighbors to a secondary site determines its identity, whereas in FEA it is approximated by the identity 32 of the native residue at that site. In principle, the FEA should be iterated until selfconsistency is achieved [23]. Alternative to FEA are global optimization techniques [22] that are computationally expensive and difficult to use at the scale of testing presented here. Purely structural characterization of contact types in THOM2 avoids this problem, making the THOM2 potential amendable to dynamic programming, at least for global alignments (see section IV.2). Figures 4.a, 4.b and 4.c show the joint histograms of the sum of Z scores for local and global THOM2 threading alignments (with trained gap penalties of table 8.b) vs. the RMS deviations between the superimposed side chain centers (see section III.1), for globins, immunoglobins and POU-like domains, respectively. The vertical lines in the figures correspond to the sum of global and local Z scores equal to 5, which approximately discriminates the high confidence matches (with the sum of local and global Z scores larger than 5) and lower confidence matches that might be obscured by the false positives. Nearly all pairs differing by less than 3 angstrom RMSD can be identified by THOM2 threading alignments. Most of the matches in the range between 3 and 5 angstrom can be still identified with high confidence. Overall, 60%, 90% and 95% of homologs with RMSD smaller than 5 angstrom are recognized, for POU, globins and immunoglobins families, respectively. However, the number of matches with high confidence quickly decreases with the growing RMS distance. The population of matches that are difficult to identify by pairwise sequence-tosequence alignments, with structurally biased gap penalties (see sections III.3 and IV.3), is represented by the filled squares. All the matches represented by circles can be identified with high confidence by sequence-to-sequence alignments (i.e. they result in Z 33 scores larger than 8.0). Essentially all the pairs with RMSD smaller than 3 angstrom are identified by sequence alignments as well. Below this threshold, we observe many matches that can be still identified by threading but not by sequence alignment (filled rectangles with the sum of threading Z scores higher than 5). We also found examples of matches detected with high confidence by threading and not detected by PsiBLAST [47] (with default parameters and the PDB database) in many of the families considered here, for example: globins 1flp and 1ash, immunoglobin 2hfm and T-cell receptor 1cd8, toxins 1acw and 1pnh, lactoglobulin 2blg and bilin-binding protein 2apd, pheromons 2erl and 1erp, POU-like proteins 1akh and 1mbg. On the other hand, there are many sequence alignment matches that are not detected by threading. ** PLACE FIGURE 4 HERE ** The performance of THOM2 and TE potentials is compared using one-dimensional histograms for the sum of Z scores for local and global threading alignments. For the sake of fair comparison, the ad hoc gap penalties, as defined in section III.3, are used for both potentials. As can be seen from figures 4.d and 4.e, for globins and POU-like domains the number of low Z scores for THOM2 is smaller than the number of low Z scores obtained with the TE potential and FEA. For example, the number of low confidence matches (which can be still roughly defined as matches below the cutoff of 5) for globins increases from 2401 in case of THOM2 to 3350 (out of 8558 matches) in case of the TE potential. One can also notice that the distribution of Z scores is different. The TE potential yields many high Z scores for alignments into very close homologs as opposed to lower scores for more divergent pairs. 34 The somewhat worse performance of the pairwise model for these two families may result from the sub-optimality of the alignments that we generate using the FEA. Interestingly, FEA with the TE potential fails also for a larger number of native alignments. For example, in the family of DNA binding proteins the number of native alignments with very low Z scores (smaller than 4) is equal to 7 for TE and only 2 for THOM2. On the other hand, there are families for which the TE potential works better. An example is the family of the immunoglobins (figure 4.f). The FEA is expected to perform well when the sequence similarity is sufficiently high, since the information about the native sequences is used to generate optimal alignments. The divergence in terms of what can be detected by sequence similarity is larger for globins and POU-like proteins than for immunoglobins. For example, contrary to other families considered here, all the immunoglobins with RMSD smaller than 4 angstrom can be detected by sequence alignments (see figure 4.c). Therefore, good performance of the FEA with the TE potential is expected in this case. The above observation is further supported by the results of the FEA with the TE potential for eight families from the S1082 set, considered in the previous section. We do not include here detailed results. Instead, we summarize them. The threading results with the FEA and the TE potential are robust (and comparable to the THOM2 results) for RAS, SH3 and acid protease families that are represented by proteins of high sequence similarity. The results of the FEA are considerably worse for lactoglobulins and glutathione transferase families that are characterized by much lower success of sequence-based recognition (see table 12). At the same time the FEA performs as poorly 35 as THOM2 for cytochrome and zinc finger families. An exception is observed for the toxin family, for which the FEA performs considerably better than THOM2, although there is no (or low) sequence similarity for some of the matches. V. Conclusions and final remarks In the present manuscript we proposed and applied an automated procedure for the design of threading models. The strength of the procedure, which is based on linear programming tools, is the automation and the ability of continuous exact learning. The LP protocol was used to evaluate different energy functions for accuracy and recognition capacity. Keeping in mind the necessity for efficient threading algorithms with gaps we selected the THOM2 model as our best choice. Statistical filters based on local and global Z scores were outlined. We observe that, while using conservative Z scores that essentially exclude false positives, the new protocol recognizes correctly (without any information about sequences) most of the family members with the RMS distance between the superimposed side chain centers of up to 5 angstrom and differences in length of up to 10 percent. We also observe many instances of successful recognition of family members that are not recognized by pair energies with the so-called frozen environment approximation. The present approach is based on fitness of sequences into structures. Nevertheless, it is easily extendable to include also sequence similarity, family profiles or secondary structures. Such complementary “signals” are often employed in conjunction with pairwise potentials [9-11,16]. Threading protocols that are based exclusively on contact models were shown (consistent with our observations) to be quite sensitive to variations 36 in structures [48]. The THOM2 model provides an alternative comparable in performance to pairwise potentials. Therefore it can be used as a fast component of fold recognition methods employing pair energies, which is the target of a future work. Despite the limitations of the threading protocol that is based on the THOM2 potential and the double Z score filter (in terms of range of variations in structure and length that can be recognized), we found a number of useful predictions for remote homologs (e.g. [49]). Therefore, we decided to take part (group 280) in the recently held Critical Assessment of Fully Automated protein Structure Prediction methods (CAFASP) [50], even at the preliminary phase, without utilizing additional information as secondary structures or family profiles. The performance of the LOOPP server [30] was about average for all fold recognition targets (e.g. LOOPP missed some targets recognizable by Psi-BLAST). However, in the category of difficult-to-recognize targets, it was ranked among the best servers (rank 4 in the MaxSub 5.0 A evaluation), providing the best predictions among the servers for two difficult targets (T0097 and T0102) [50]. Acknowledgements This research was supported by an NIH NCRR grant to the Cornell Theory Center (acting director – Ron Elber) for the developments of Computational Biology Tools. It was further supported by a seed grant from DARPA to Ron Elber. Jaroslaw Meller acknowledges also partial support from the Polish State Committee for Scientific Research (grant 6 P04A 066 14). 37 References 1. Bowie JU, Luthy R, Eisenberg D, “A method to identify protein sequences that fold into a known three-dimensional structure”, Science, 1991; 253:164-170 2. Jones DT, Taylor WR, Thornton JM, “A new approach to protein fold recognition”, Nature 1992; 358:86-89 3. Sippl MJ, Weitckus S, “Detection of native-like models for amino acid sequences of unknown three-dimensional structure in a database of known protein conformations”, Proteins 1992; 13:258-271 4. Godzik A, Kolinski A, Skolnick J, “Topology fingerprint approach to the inverse folding problem”, J. Mol. Biol. 1992; 227:227-238 5. Ouzounis C, Sander C, Scharf M, Schneider R, “Prediction of protein structure by evaluation of sequence-structure fitness. Aligning sequences to contact profiles derived from 3D structures”, J. Mol. Biol. 1993; 232:805-825 6. Bryant SH, Lawrence CE, “An empirical energy function for threading protein sequence through folding motif”, Proteins 1993; 16:92-112 7. Matsuo Y, Nishikawa K, “Protein structural similarities predicted by a sequencestructure compatibility method”, Protein Sci. 1994; 3:2055-2063 8. Mirny LA and Shakhnovich EI, “Protein structure prediction by threading. Why it works and why it does not”, J. Mol. Biol. 1998; 283:507-526 9. Jones DT, “GenTHREADER: An Efficient and Reliable Protein Fold Recognition Method for Genomic Sequenecs”, J. Mol. Biol. 1999; 287:797-815 38 10. Panchenko AR, Marchler-Bauer A, Bryant SH, “Combination of threading potentials and sequence profiles improves fold recognition”, J. Mol. Biol. 2000; 296: 1319-1331 11. Sternberg MJE, Bates PA, Kelley LA, MacCallum RM, “Progress in protein structure prediction: assessment of CASP3”, Curr. Opin. Struct. Biol. 1999; 9:368-373 12. Liwo A, Oldziej S, Pincus MR, Wawak RJ, Rackovsky S, Scheraga HA, “A united-residue force field for off-lattice protein structure simulations: functional forms and parameters of long range side chain interaction potentials from protein crystal data”, J. Comp. Chem. 1997; 18: 849-873 13. Xia Y, Huang ES, Levitt M, Samudrala R, “Ab initio construction of protein tertiary structures using a hierarchical approach”, J. Mol. Biol. 2000; 300: 171185 14. Babajide A, Hofacker IL, Sippl MJ, Stadler PF, “Neural networks in protein space: a computational study based on knowledge-based potentials of mean force”, Folding Design 1997: 2: 261-269 15. Babajide A, Farber R, Hofacker IL, Inman J, Lapedes AS, Stadler PF, “Exploring protein sequence space using knowledge based potentials”, J. Comp. Biol., 1999. submitted 16. Elofsson A, Fischer D, Rice DW, Le Grand S, Eisenberg D, “A study of combined structure-sequence profiles”, Folding & Design 1998; 1:451-461 39 17. Needleman SB, Wunsch CD, “A general method applicable to the search for similarities in the amino acid sequences of two proteins”, J. Mol. Biol. 1970; 48:443-453 18. Smith TF, Waterman MS, “Identification of common molecular subsequences”, j. Mol. Biol. 1981; 147:195-197 19. Johnson MS, Overington JP, Blundell TL, “Alignment and searching for common protein folds using a data bank of structural templates”, J. Mol. Biol. 1993; 231:735-752 20. Croman HT, Leiserson CE, Rivest RL, “Introduction to algorithms”, The MIT press, Cambridge MA 1985, chapter 16 21. Lathrop RH, Smith TF, “Global optimum protein threading with gapped alignment and empirical pair score functions”, J. Mol. Biol. 1996; 255: 641-665 22. Lathrop RH, “The protein threading problem with sequence amino-acid interaction preferences is NP-complete”, Protein Eng. 1994; 7:1059-1068 23. Goldstein RA, Luthey-Schulten ZA, Wolynes PG, “The statistical mechanical basis of sequence alignment algorithms for protein structure prediction”, in “Recent developments in theoretical studies of proteins”, Ed. Ron Elber, World Scientific, Singapore, 1996, chapter 6. 24. Maiorov VN and Crippen GM, “Contact potential that recognizes the correct folding of globular proteins”, J. Mol. Biol. 1992; 227:876-888 25. Tobi D, Shafran G, Linial N, Elber R, “On the design and analysis of protein folding potentials”, Proteins: Structure Function and Genetics 2000, 40: 71-85 40 26. Vendruscolo M, Domany E, “Pairwise contact potentials are unsuitable for protein folding”, J. Chem. Phys. 1998; 109:11101-11108 27. Meller J, Wagner M, Elber R, “Maximum feasibility guideline in the design and analysis of protein folding potentials”, J. Comp. Chem. 2001, submitted 28. Meszaros CS, “Fast Cholesky factorization for interior point methods for linear programming”, Computer and Mathematics with Applications 1996; 31:49-51 29. Adler I, Monteiro RDC, “Limiting behavior of the affine scaling continous trajectories for linear programming problems”, Math. Program. 1991; 50:29-51 30. Taylor WR, Munro RE, “Multiple sequence threading: conditional gap placement”, Folding & Design 1997; 2:S33-S39 31. Hinds DA, Levitt M, “Exploring conformational space with a simple lattice model for protein structure”, J. Mol. Biol. 1994; 243:668-682 32. Bryant SH and Altschul SF, “Statistics of sequence-structure threading”, Curr. Opin. Struct. Biol. 1995; 5:236-244 33. Fitch WM, “Random sequences”, J. Mol. Biol. 1983; 163:171-176 34. Altschul SF, Erickson BW, “Significance of nucleotide sequence alignments: a method for random sequence permutation that preserves dinucleotide and codon usage”, Mol. Biol. Evol. 1985; 2:526-538 35. Fischer D, Elofsson A, Rice D, Eisenberg D, “Assesing the performance of fold recognition methods by means of a comprehensive benchmark”, in: Pacific Symposium on Biocomputing, Hawaii 1996; 300-318 41 36. CASP3. Third community wide experiment on the critical assessment of techniques for protein structure prediction, Proteins: Structure Function and Genetics 1999, Suppl. 3, see also http://Predictioncenter.lnl.gov/casp3 37. Holm L, Sander C, “The FSSP database of structurally aligned protein fold families”, Nucleic Acids Res. 1994; 22:3600-3609, see also DALI server: http://www2.embl-ebi.ac.uk/dali 38. Meller J, Elber R, Learning, Observing and Outputting Protein Patterns (LOOPP) -- a program for protein recognition and design of folding potentials, http://www.tc.cornell.edu/CBIO/loopp 39. Henikoff S, Henikoff JG, “Amino acid substitution matrices from protein blocks”, PNAS USA 1989; 89: 10915-10919 40. Gambel EJ, “Statistics of extremes”, New York: Columbia University Press, 1958 41. Karlin S, Altschul SF, “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes”, Proc. Natl. Acad. Sci. 1990; 87:2264-2268 42. Pearson WR, Lipman DJ, “Improved tools for biological sequence comparison”, Proc. Natl. Acad. Sci. USA 1988; 85:2444-2448; Pearson WR, “Empirical statistical estimates for sequence similarity searches”, J. Mol. Biol. 1998; 276:7184 43. Godzik A, Kolinski A, Skolnick J, “Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets”, Protein Sci. 1995; 4: 2107-2117 42 44. Betancourt MR, Thirumalai D, “Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes”, Protein Sci. 1999; 2: 361-369 45. Miyazawa S, Jernigan RL, “Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term for simulation and threading”, J. Mol. Biol. 1996; 256:623-644 46. Hinds DA, Levitt M, “A lattice model for protein structure prediction at low resolution”, Proc Natl. Acad. Sci. USA 1992; 89: 2536-2540 47. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, Nucleic Acid Res. 1997; 25:3389-3402 48. Bryant SH, “Evaluation of threading specificity and accuracy”, Proteins 1996; 26: 172-185 49. Frary A, Nesbitt TC, Frary A, Grandillo S, van der Knaap E, Cong B, Liu J, Meller J, Elber R, Alpert KP, Tanksley SD, "Cloning Transgenic Expression and Function of fw2.2: a Quantitative Trait Locus Key to the Evolution of Tomato Fruit", Science 2000; 289: 85-88 50. Fischer D, et al. CAFASP-2: The Second Critical Assessment of Fully Automated Structure Prediction Methods. Proteins, Dedicated CASP4 issue, submitted, 2001. 43 Figure legends 1. Contour plots of the total energy contributions to the native alignments in the TE set for valine and lysine residues as a function of the number of neighbors in the first and second shells. Figure 1.a shows that contacts involving valine residues with five to six neighbors with other residues of medium number of neighbors stabilize most of the native alignments. On the other hand (figure 1.b), only contacts involving lysine residues with a small number of neighbors stabilize native alignments. 2. The probability distribution function of the Z scores computed with local threading alignments for the population of false positives. A set of 47 sequences of proteins included in the S47 set is used to sample the distribution of the Z scores for false positives (proteins of the S47 set have no homologs in the HL set see text for details). Each of the sequences is aligned to all the structures included in the HL set. The Z scores are calculated for the two hundred best matches (according to energy), using 100 shuffled sequences. The observed distribution of Z scores for 6813 local threading alignments is represented by '+'. Note the significant tail to the right, which means a relatively high likelihood of observing false positives with large Z scores. The dotted line shows the attempted analytical fit to the gaussian distribution, whereas the solid line the attempted fit to the extreme value distribution (EVD). Note that actual distribution deviates significantly from both. According to the analytical fit to the EVD, the probability of observing a Z score larger than P Z P 1 exp exp 1.313 Z P 0.466 , 44 ZP by chance with the is 98% equal to confidence intervals: 1.313 0.112 and 0.466 0.079 . For example, the probability of observing by chance a Z score larger than 4 is equal to 0.003. We emphasize however, that the analytical fit to the extreme value distribution provides an upper bound for the observed number of observed false positives. 3. The joint probability distribution for the Z scores of global and local alignments. Circles at the lower left corner represent a population of 1081 false positives, resulting from the alignments of the S47 set sequences (see figure 4) against all structures in the HL set (100 best global and 200 best local matches are considered, disregarding matches with positives energies of global alignments). The best pair scoring false positive is slightly below the threshold (3,2). The population in the right upper corner represents (square boxes) 331 pairs of HL sequences aligned to HL structures with global Z scores larger than 2.5 and local Z scores larger than 1. This set includes 236 native alignments and 95 non-native alignments. There are 10 matches that are false positives (filled squares) and they are all below the threshold (3,2). Stiffer energy constraints were employed with only the 10 best global and 200 best local alignments considered. There is a population of true positives below (2.5,1.0), which are not shown in the figure (including 10 native alignments). However, the number of false positives below this threshold makes predictions in this range difficult. 4. Comparison of family recognition by THOM2 and pair energies. The results of THOM2 (with the trained gap penalties of table 8.b) for families of globins, POUlike domains and immunoglobins (Fv fragments) are shown in figures 4.a, 4.b and 4.c, respectively. The joint histograms of the sum of Z scores for local and global 45 threading alignments vs. the RMS deviations between the superimposed (according to structure-to-structure alignments) side-chain centers are presented. The population of matches that are difficult to identify by sequence-to-sequence alignments is represented by the filled squares. Next, the THOM2 results are compared to the results of Tobi-Elber (TE) pairwise potential [25], using ad hoc gap penalties defined in section V.2. The TE potential was optimized using LP protocol and the same training set. The first iteration of the so-called frozen environment approximation (FEA) [23] is performed to obtain approximate alignments for the TE potential. Figures 4.d, 4.e and 4.f show one-dimensional histograms of the sum of Z scores for local and global threading alignments for the globins, POU and immunoglobins families, respectively. Note, that the number of low THOM2 Z scores (smaller than 5) is smaller for families of globins and POU-like proteins. On the other hand the TE potential and the FEA perform better for the family of immunoglobins, which is also easier for sequence alignment methods (see text for details). Table legends 1. The definitions of different groups of amino acids that are used in the present study. Note that ten types of amino acids are found to be sufficient to solve the Hinds-Levitt set either by pairwise interaction models or by THOM2. The amino acid types are: HYD POL CHG CHN GLY ALA PRO TYR TRP HIS. The list implies that when an amino acid appears explicitly, it is excluded from other groups that may contain it. For example, HYD includes in this case CYS, ILE, 46 LEU, MET, and VAL while CHG includes ARG and LYS only, since the negatively charged residues form a separate group. 2. Definitions of energy types for the THOM2 energy model. There are fifteen types of energy terms in THOM2 that are based on contacts in the first and the second contact layers. A contact between two amino acids is “on” if their distance is smaller than 6.4 angstrom. We consider five types of contacts in the first layer and three in the second layer. Thus there are 20 15 300 different energy terms for twenty different amino acids. A reduced set of amino acids is associated with a smaller number of parameters to optimize (for ten types of amino acids the number of parameters is 10 15 150 ). The notation we used for each type of site is based on a representative number of neighbors. The number of neighbors, n, in a given class and its representative are given in the first column (for different classes of sites in the first layer) and in the first row (for different classes of sites in the second layer). The intersections between columns and rows correspond to contacts of different types: a contact between two sites of medium number of neighbors is denoted by 5, 5 , for example. 3. Comparing the capacity of different threading potentials. Capacity for recognition of pairwise and profile threading potentials is measured by gapless threading on HL and TE representative sets of proteins (see section III.1). THOM1 performs significantly worse than pairwise potentials. THOM2 is showing a comparable performance and is able to learn the TE set (see also table 10). For each potential the number of amino acids types used and the resulting number of parameters are reported. The training set used (either HL or TE) is indicated by a star (*) symbol. 47 The number of correct predictions for structures in HL and TE sets is given in the second and third columns, respectively. 4. Parameters of some of the threading potentials trained using the LP protocol. Numerical values of the energy parameters for THOM1 potential trained on HL set of proteins (table 4.a) and THOM2 potential trained on TE set of proteins (table 4.b, see text for details). The rows in the tables correspond to either different types of sites (THOM1) or contacts (THOM2). The columns correspond to different types of amino acids. 5. Characterization of native and decoy structures. Frequencies of different types of sites (relevant for the training of THOM1) found in the native structures of HL set as opposed to decoy structures generated using the HL set are presented in table 5.a. In THOM1 the type of site is defined by number of its neighbors (n). Frequencies are defined by the percentage from the total number of 53012 native sites in HL set and 556.14 millions of decoy sites generated using HL set, respectively. Frequencies of different types of contacts (appropriate for the training of THOM2) found in the native structures of TE set as opposed to decoy structures generated using TE are given in table 5.b. Different classes of contacts are specified in table 2. Frequencies are defined by the percentage from the total number of 439364 native contacts in TE set and 10089.19 millions of decoy contacts generated using TE set, respectively. The overall site and contact distributions are split into distributions for hydrophobic and polar residues (as defined in table 1), which are given in the parenthesis. 48 6. Cooperativity in effective pairwise interactions of the THOM2 potential. For a pair of two amino acids and in contact, we have 25 different possible types of contacts (and consequently 25 different effective energy contributions) as and may occupy sites that belong to one of the five different types characterized by the increasing number of contacts in the first contact shell (see table 2). Moreover, the 5 by 5 interaction matrix will be in general asymmetric. The effective energies of contact between two VAL residues with a different number of neighbors are given in table 6.a, whereas the energies of contacts between two LYS residues are given in table 6.b. 7. Pairs of homologous structures used for the training of gap penalties. For each pair the native and the homologous structures are specified by their PDB codes, names and lengths in the first and second column, respectively. In the third column, the similarity between the native and the homologous proteins is defined in terms of the sequence identity [%], RMS distance [Ang] and length [number of residues], as defined by structure-to-structure alignments obtained by submitting the corresponding pairs to the DALI server [37]. 8. The gap penalties for THOM2 model as trained by the LP protocol. The limited set of homologous structures from table 7 is used. Initial guess of gap penalties for different types of sites in the THOM1 model is given in table 8.a. Optimized gap penalties for different types of contacts in the THOM2 model are given in table 8.b. Penalties that are not specified explicitly are equal to the maximum value of 10.0. Note that the training here is limited and will be extended in a future work. 49 9. Comparison of alignments of myoglobin (1mba) sequence into leghemoglobin (1lh2) structure. The THOM1 alignment with the initial gap penalties is included in table 9.a, whereas the THOM2 alignment with trained gap penalties is included in table 9.b. Note that the location of insertions in the initial alignment (which is used for training of gap energies) is to a large extent consistent with the DALI structure to structure alignment [37], which aligns: residues 2-50 of 1mba to 3-51 of 1lh2, residues 53-56 of 1mba to 52-55 of 1lh2 (implying deletions at positions 51 and 52 in 1mba), residues 59-80 of 1mba to 56-77 of 1lh2, residues 81-86 of 1mba to 82-87 of 1lh2, residues 87-121 of 1mba to 89-123 (with the implied insertion at position 88 in 1lh2), residues 122-139 of 1mba to 126-143 of 1lh2 (implying two insertions at positions 124 and 125 in 1lh2) and residues 140-145 of 1mba to 145-150 of 1lh2 (with an insertion at position 144 in 1lh2), respectively. Alpha helices in both structures are marked in bold. Note that F and G helices are shifted considerably in the DALI alignment (there is no counterpart of the D helix in 1lh2). The initial THOM1 alignment is in perfect agreement with the DALI superposition between residues 88 and 150 of 1lh2, except for two insertions at positions 128 and 147 (shifted by three residues with respect to the DALI alignment). The insertions at positions 88, 125, 151 and 153 coincide with the DALI alignment. The THOM2 alignment, with trained gap penalties of table 9.b, is in perfect agreement with the DALI superposition for residues 10 to 50 of 1lh2 (including A, B and C helices) and then departs from the DALI alignment, overlapping E, F and G helices with a smaller shift. 50 10. Comparison of THOM2 and knowledge-based pairwise potentials using gapless threading. The results of gapless threading on the S1082 set (see text for the details) are reported. The results of THOM2 potential are compared to five other knowledge-based pairwise potentials by: Betancourt and Thirumalai (BT) [44], Hinds and Levitt (HL) [46], Miyazawa and Jerningan (MJ) [45], Godzik, Kolinski and Skolnick (GKS) [43] and Tobi and Elber (TE) [25]. The latter potential was trained using LP protocol and the same (TE) training set. Potentials are ordered according to the number of proteins recognized exactly (out of 1082), given in the 2nd column. The number of inequalities that are not satisfied (out of approximately 95 million inequalities generated from the S1082 set) is given in the 3rd column (in the units of millions). The 4th column reports Z scores (i.e. the ratios of the first and second moments) for the distributions of energy differences between the native and misfolded structures. Note lack of correlation between the number of proteins that are missed and the number of inequalities, which are not satisfied. See text for more details. 11. Self-recognition for folds that were not learned. The S47 set of proteins is used in order to test the self-recognition. It is also a test of the sensitivity of the results to structural fluctuations for 25 different folds (of which 22 were not represented in the training set) using the double Z score test. Pairs of homologous structures belonging to the S47 set are specified in the first column (three folds are represented by a single structure, for 2a2u its structural relative from the training set is included), using their PDB codes and lengths (specified in the parenthesis). If the domain is not specified and one refers to a multi-domain protein, then the A 51 (or first) domain is used. The results of global and local THOM2 threading alignments of the 25 query sequences into an extended TE + S47 set are reported in the 3rd and 4th column, respectively. Query sequences are indicated in italic (for each pair the first line describes the native alignment and the second line an alignment into a homologous structure). Two of 25 native alignments gave weak signals (DNA binding protein 1blo and glycosidase 1bhe). Four other native alignments (2a2u, 1byf, 1jwe and 1bkb) result in global Z scores somewhat smaller than 3. The DALI [37] Z scores and RMS deviations for structure-tostructure alignments into native and homologous structures are reported in the second column. Low DALI Z scores indicate that only short fragments of the respective structures are aligned and the resulting RMS deviation may not be representative. Most of the homologous structures with the DALI Z score larger than 10 are recognized with a high confidence. The alignment of the 2a2u sequence into the 1bbp structure was the only significant hit of any of the query sequences into the structures included in the training (TE) set. Thus, no false positives with scores above our confidence cutoffs were observed. High confidence predictions (global Z score larger than 3.0 and local Z score larger than 2.0) are indicated in bold. 12. Examples of predictions for families of homologous proteins. Eight families, with a number of representatives included in the S1082 set, are used to illustrate various degrees of success of our threading protocol in terms of sensitivity and specificity. The results for global and local threading alignments using THOM2 potential, together with the results for (structurally biased) local sequence-to- 52 sequence alignments and DALI structure-to-structure alignments, are reported. Names of proteins (PDB codes) and their lengths are included in columns 1 and 5, respectively. Representatives used as query sequences that are aligned to all the structures in the S1082 set are marked in bold. The Z scores reported in columns 2, 3 and 6, respectively, are computed using 50 shuffled sequences for a number of alignments with the lowest energies: 20 best matches in case of global threading (GT) alignments, 500 best matches in case of local threading (LT) alignments and 50 best matches in case of local sequence (LS) alignments. The rank of the energy of global threading alignments is reported in the 4th column. The DALI [37] alignments between the (known) structure of a query and the structure of a match are characterized in the last column: Z score, RMS distance, length of the aligned fragment and the identity for this fragment are provided (a star symbol indicates that comparisons with the FSSP representative of the query structure are used instead of a direct DALI alignment). A dash symbol is used to indicate the lack of a detectable (threading, sequence or structural) similarity. Matches are ordered according to the sum of global and local threading Z scores and according to Z scores of the local sequence alignments if no threading signal is detected. False positives (defined as matches with DALI Z scores lower than 2.0) are indicated in italic. The highest scoring false positives for both: threading and sequence alignments are reported for each family. Examples of families, with successful threading predictions that do not share a detectable sequence similarity or have a weak signal from sequence-to-sequence alignment (Z score lower than 8.0), are included in table 12.a, 12.b and 12.c. For families included in table 12.d, 53 12.e and 12.f threading is less successful, missing a number of family members (of similar length) that can be detected by sequence-to-sequence alignment. Lack of detection when the difference in length is significant (see table 12.g) is expected and it is one of the limitations of the present protocol. For the last family (table 12.h) the DALI results could not be retrieved and therefore the SCOP classification is used to define structural relatives (i.e. proteins that do share the knottins fold). 54 Table 1 Hydrophobic (HYD) ALA CYS HIS ILE LEU MET PHE PRO TRP TYR VAL Polar (POL) ARG ASN ASP GLN GLY LYS SER THR Charged (CHG) ARG ASP GLU LYS Negatively Charged (CHN) ASP GLU Table 2 Type of site* n’=1,2; 1 n’=3,4,5,6; 5 n’ 7; 9 n=1,2; 1 ( 1,1 ) ( 1, 5 ) ( 1, 9 ) n=3,4; 3 ( 3,1 ) ( 3, 5 ) ( 3, 9 ) n=5,6; 5 ( 5, 1 ) ( 5, 5 ) ( 5, 9 ) n=7,8; 7 ( 7, 1 ) ( 7, 5 ) ( 7, 9 ) n 9; 9 ( 9, 1 ) ( 9, 5 ) ( 9, 9 ) Table 3 POTENTIAL Hinds-Levitt set Tobi-Elber set Pairwise, HP model, par. free 200 456 Pairwise, 10 aa, 55 par 246* 504 Pairwise, 20 aa, 210 par 246* 530 Pairwise, 20 aa, 210 par 237 594* THOM1, 20 aa, 200 par 246* 474 THOM2, 10 aa, 150 par 246* 478 THOM2, 20 aa, 300 par 246* 428 THOM2, 20 aa, 300 par 236 594* 55 Table 4 Table 4.A: THOM1 ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (1) -0.02 0.10 -0.22 0.02 -0.13 0.02 0.05 -0.05 -0.15 -0.17 -0.04 0.13 -0.40 -0.52 0.29 -0.02 (2) -0.06 -0.23 -0.07 0.20 -0.37 0.21 -0.03 -0.06 -0.05 -0.30 -0.22 0.12 -0.20 -0.25 0.24 -0.01 -0.10 -0.57 -0.27 -0.25 (3) -0.02 -0.01 -0.01 0.43 -0.72 0.09 0.10 0.05 -0.25 -0.48 -0.37 0.19 -0.66 -0.58 0.06 0.05 -0.12 -0.77 -0.37 -0.38 (4) -0.17 0.12 0.29 0.37 -0.70 0.22 0.40 0.14 -0.31 -0.64 -0.41 0.60 -0.50 -0.68 0.22 0.00 0.21 -0.36 -0.39 -0.36 (5) -0.13 0.22 0.20 0.68 -1.13 0.33 0.45 0.38 0.24 -0.53 -0.50 0.37 -0.39 -0.65 0.31 0.31 0.02 -0.65 -0.78 -0.51 (6) 0.02 0.32 0.17 0.43 -1.16 0.02 0.70 0.42 0.36 -0.57 -0.58 0.63 -0.80 -0.82 0.75 0.27 0.24 -0.46 -0.72 -0.51 (7) 0.12 -0.10 0.30 0.27 -0.76 -0.54 0.73 -0.44 -0.40 0.42 0.09 0.36 0.57 -0.66 0.25 0.02 0.36 0.15 -0.26 -0.74 -0.59 1.66 -1.03 1.13 2.23 -0.57 10.00 -0.38 -0.13 0.43 -1.27 0.46 0.39 0.20 (8) -0.07 0.91 -0.12 -0.01 -1.60 0.51 0.83 0.29 -0.71 -1.37 -0.72 (9) 1.36 0.82 10.00 (10) 0.83 0.11 0.35 -1.71 1.57 10.00 10.00 10.00 10.00 10.00 10.00 2.12 3.38 -0.33 1.03 10.00 0.83 10.00 -0.93 -0.47 10.00 10.00 0.02 -0.20 -0.23 -0.16 0.12 -0.39 -0.78 0.40 10.00 10.00 10.00 -0.78 10.00 Table 4.B: THOM2 ALA ARG ASN ASP CYS GLN GLU GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL (1,1) 0.23 -0.03 -0.03 -0.08 -0.82 -0.26 (1,5) -0.21 -0.26 -0.10 (1,9) -6.01 -4.09 -5.42 -6.14 -7.27 -5.88 -5.80 -5.81 -4.75 -5.46 -5.85 -4.91 -4.97 -5.83 -6.17 -5.89 -5.89 -5.25 -6.79 -6.99 (3,1) -0.01 -0.10 -0.17 0.02 -0.50 -0.09 0.11 0.31 (3,5) -0.08 0.18 0.15 0.13 -0.69 0.12 0.24 0.04 -0.03 -0.29 -0.21 (3,9) -0.29 0.06 -0.33 0.08 -0.78 0.18 0.02 -0.13 -0.47 -0.60 -0.49 (5,1) 0.13 -0.21 0.04 0.22 -0.15 -0.11 0.08 0.48 (5,5) 0.06 0.16 0.20 0.17 -0.60 0.13 0.18 -0.04 -0.25 -0.19 (5,9) -0.65 0.68 -0.26 -0.19 -0.82 -0.09 0.43 -0.36 -0.19 -0.47 -0.42 (7,1) 6.29 5.50 5.56 6.02 5.09 5.55 5.68 6.10 5.70 (7,5) 0.17 0.29 0.36 0.39 -0.28 0.28 0.45 0.33 (7,9) 0.08 0.41 0.00 -0.15 -0.30 0.04 -0.27 0.05 0.69 (9,1) 10.00 4.50 6.05 5.21 4.00 5.94 10.00 10.00 10.00 10.00 (9,5) 0.26 0.30 0.26 0.71 0.41 -0.02 (9,9) 0.20 0.04 -0.37 -1.34 -1.19 0.20 -1.11 0.09 0.29 0.07 -0.12 -0.16 -0.02 0.03 0.05 -0.07 -0.50 -0.64 -0.28 0.00 -0.08 0.00 0.03 -0.31 -0.23 -0.13 -0.15 -0.29 -0.23 0.07 -0.09 -0.60 -0.40 -0.36 0.04 0.47 0.32 0.04 -0.10 -0.10 0.11 -0.20 -0.17 -0.02 0.40 0.06 -0.31 -0.29 -0.05 0.14 0.06 0.08 -0.36 -0.28 -0.17 0.09 -0.85 -0.07 0.19 0.23 0.15 -0.15 0.03 -0.27 0.19 -0.15 -0.32 -0.06 -0.15 -0.27 0.17 0.19 0.34 -0.07 0.02 0.26 -0.26 -0.28 0.09 0.11 0.02 -0.36 -0.30 -0.27 0.34 0.32 0.07 0.55 0.22 0.01 0.04 -0.46 -0.58 5.26 6.08 5.64 5.80 5.82 5.23 5.48 6.42 0.28 -0.08 -0.01 0.50 0.24 -0.16 0.42 0.13 0.34 0.04 -0.08 -0.03 0.67 0.06 0.03 -0.71 0.82 0.24 -0.36 5.59 4.91 6.02 9.61 10.00 10.00 0.52 -0.19 0.43 3.07 0.43 1.41 -1.33 6.94 3.22 -0.54 0.83 -0.09 1.37 -1.36 0.21 -0.20 5.59 0.04 -0.17 6.22 1.26 -0.15 1.06 -1.99 -0.25 -0.29 56 0.08 -0.32 -0.05 5.17 0.19 5.53 0.14 -0.25 5.88 10.00 10.00 0.52 -0.08 0.08 0.21 0.81 -0.53 -0.52 0.71 Table 5 A: Type of site* (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) Native (HYD / POL) 16.97 (4.89 / 12.09) 17.30 (6.06 / 11.24) 17.72 (8.29 / 9.43) 16.60 (9.68 / 6.92) 14.62 (10.16 / 4.47) 9.96 (7.66 / 2.30) 4.95 (4.02 / 0.92) 1.57 (1.32 / 0.25) 0.26 (0.21 / 0.05) 0.04 (0.04 / 0.00) Decoys (HYD / POL) 24.20 (11.72 / 12.48) 21.72 (10.52 / 11.20) 18.70 (9.06 / 9.64) 15.00 (7.28 / 7.73) 10.79 (5.24 / 5.55) 6.04 (2.94 / 3.10) 2.63 (1.28 / 1.35) 0.77 (0.38 / 0.40) 0.12 (0.06 / 0.06) 0.02 (0.01 / 0.01) Type of contact ( 1,1 ) Native (HYD / POL) 5.09 (1.59 / 3.50) Decoys (HYD / POL) 11.34 (5.48 / 5.85) ( 1, 5 ) 9.02 (2.99 / 6.04) 12.69 (6.15 / 6.54) ( 1, 9 ) 0.41 (0.15 / 0.26) 0.35 (0.17 / 0.18) ( 3,1 ) 6.25 (2.88 / 3.37) 9.51 (4.60 / 4.91) ( 3, 5 ) 24.09 (13.01 / 11.08) 26.59 (12.91 / 13.68) ( 3, 9 ) 3.23 (1.88 / 1.35) 2.29 (1.12 / 1.18) ( 5, 1 ) 2.77 (1.81 / 0.96) 3.18 (1.54 / 1.64) ( 5, 5 ) 28.36 (20.96 / 7.40) 22.09 (10.75 / 11.34) ( 5, 9 ) 6.85 (5.11 / 1.74) 3.84 (1.87 / 1.96) ( 7, 1 ) 0.40 (0.31 / 0.09) 0.34 (0.16 / 0.17) ( 7, 5 ) 9.56 (8.00 / 1.56) 5.84 (2.85 / 3.00) ( 7, 9 ) 3.21 (2.60 / 0.61) 1.54 (0.75 / 0.79) ( 9, 1 ) 0.01 (0.01 / 0.00) 0.01 (0.01 / 0.01) ( 9, 5 ) 0.52 (0.44 / 0.08) 0.29 (0.15 / 0.14) ( 9, 9 ) 0.23 (0.19 / 0.04) 0.09 (0.05 / 0.05) B: 57 Table 6 A: V( 1 ) V( 3 ) V( 5 ) V( 7 ) V( 9 ) V( 1 ) -0.56 -0.41 -0.17 -1.46 3.01 V( 3 ) -0.41 -0.34 -0.44 -0.30 -0.07 V( 5 ) -0.17 -0.44 -0.54 -0.61 -0.38 V( 7 ) -1.46 -0.30 -0.61 -0.49 -0.76 V( 9 ) 3.01 -0.07 -0.38 -0.76 -1.03 K( 1 ) K( 3 ) K( 5 ) K( 7 ) K( 9 ) K( 1 ) -0.03 -0.03 -0.19 1.18 0.69 K( 3 ) -0.03 0.28 0.40 0.58 0.61 K( 5 ) -0.19 0.40 0.52 0.83 0.86 K( 7 ) 1.18 0.58 0.83 1.34 0.38 K( 9 ) 0.69 0.61 0.86 0.38 -0.59 B: Table 7 Native Homologous Similarity 1mba (myoglobin, 146) 1lh2 (leghemoglobin, 153) 20%, 2.8 Ang, 140 res 1mba (myoglobin, 146) 1babB (hemoglobin, chain B, 146) 17%, 2.3 Ang, 138 res 1ntp (-trypsin, 223) 2gch (-chymotrypsin, 245) 45%, 1.2 Ang, 216 res 1ccr (cytochrome c, 111) 1yea (cytochrome c, 112) 53%, 1.2 Ang, 110 res 1lz1 (lysozyme, 130) 1lz5 (1lz1 + 4 res insert, 134) 99%, 0.5 Ang, 130 res 1lz1 (lysozyme, 130) 1lz6 (1lz1 + 8 res insert, 138) 99%, 0.3 Ang, 129 res 58 Table 8 A: Type of site (0) (1) (2) (3) (4) (5) (6) (7) (8) (9) Penalty 0.1 0.3 0.6 0.9 2.0 4.0 6.0 8.0 9.0 10.0 B: Type of contact (0) ( 1,1 ) Penalty 1.0 8.9 ( 1, 5 ) 5.7 ( 1, 9 ) 10.0 59 Table 9 A: .........1.........2.........3.........4.........5......... SLSAAEADLAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGKSVADIKASPK GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPE .........1.........2.........3.........4.........5......... 1 - 59 1mba 1lh2 1 - 59 6.........7.........8......ii...9.........0.........1...... LRDVSSRIFTRLNEFVNNAANAGKMSA--MLSQFAKEHVGFGVGSAQFENVRSMFPGFV LQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTI 6.........7.........8.........9.........0.........1........ 60 - 116 1mba 1lh2 60 - 118 ...2..i..i.....3.........4..i...i.i ASVAAP-PA-GADAAWTKLFGLIIDALK-AAG-AKEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA .2.........3.........4.........5... 117 - 146 1mba 1lh2 119 - 153 B: ........i.1.........2.........3.........4.........i....i.i. SLSAAEAD-LAGKSWAPVFANKNANGLDFLVALFEKFPDSANFFADFKGK-SVAD-I-K GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPE .........1.........2.........3.........4.........5......... 1 - 55 1mba 1lh2 1 - 59 ....6.........7........i.8.i........9.........0.........1.. ASPKLRDVSSRIFTRLNEFVNNA-ANA-GKMSAMLSQFAKEHVGFGVGSAQFENVRSMF LQAHAGKVFKLVYEAAIQLEVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTI 6.........7.........8.........9.........0.........1........ 56 - 112 1mba 1lh2 60 - 118 .......2.i........3.........4...... PGFVASVAA-PPAGADAAWTKLFGLIIDALKAAGA KEVVGAKWSEELNSAWTIAYDELAIVIKKEMDDAA .2.........3.........4.........5... 113 - 146 1mba 1lh2 119 – 153 Table 10 Potential Recog. structs. Nsat. ineqs. [M] Z score HL 915 1.84 1.14 TE 914 0.20 1.45 MJ 902 0.28 1.23 THOM2 877 0.24 1.35 BT 861 0.17 1.26 GSK 819 0.08 1.35 60 Table 11 name (len) 1hka 1vhi 2a2u 1bbp 2ezm 1qgo 1abe 1byf 1ytt 1jwe 1b79 1b7g 1a7k 1eug 1udh 1d3b 1b34 1dpt 1ca7 1bg8 1dj8 1qfj 1vid 1bkb 1eif 1b0n 1lmb 1bd9 1beh 1bhe 1rmg 1b9k 1qts 1eh2 1qjt 1bqv 1b4f 1ck2 1cn8 1bl0 1jhg 1bnk 1b93 1mjh 1bk7 1bol 1bvb (158) (139) (158) (173) (101) (257) (305) (123) (115) (114) (102) (340) (358) (225) (244) (72) (118) (114) (114) (76) (79) (226) (214) (132) (130) (103) (87) (180) (180) (376) (422) (237) (247) (95) (99) (110) (82) (104) (104) (116) (101) (100) (148) (143) (190) (222) (211) DALI Z-sc (RMS) 33.0 (0.0) 4.3 (5.2) 33.8 (0.0) 11.6 (3.3) 55.3 (0.0) 46.0 (0.0) 6.4 (3.4) 29.5 (0.0) 16.4 (2.2) 26.9 (0.0) 18.7 (1.3) 61.5 (0.0) 25.1 (2.9) 43.0 (0.0) 30.8 (1.7) 18.4 (0.0) 13.4 (1.1) 24.8 (0.0) 18.7 (1.2) 19.1 (0.0) 16.2 (0.7) 42.7 (0.0) 7.1 (3.1) 25.1 (0.0) 17.4 (1.6) 19.5 (0.0) 8.0 (5.3) 38.8 (0.0) 36.0 (0.3) 70.2 (0.0) 36.9 (2.2) 39.7 (0.0) 36.1 (0.7) 24.3 (0.0) 7.6 (2.5) 20.9 (0.0) 3.2 (3.3) 26.0 (0.0) 14.3 (2.2) 24.9 (0.0) 3.4 (6.6) 24.9 (0.0) 31.4 (0.0) 6.1 (3.4) 37.2 (0.0) 19.7 (2.3) 37.3 (0.0) THOM2 Glob Z-sc 7.1 0.2 2.5 3.5 3.7 5.6 0.5 1.8 -0.1 2.6 0.3 8.7 -0.4 3.4 -1.0 3.5 1.9 6.2 4.0 3.4 5.1 8.1 -2.0 2.7 3.5 4.7 0.3 4.5 7.4 6.7 0.9 8.1 3.5 6.0 3.6 3.5 0.0 5.2 5.3 0.5 1.1 5.4 4.0 0.3 7.7 0.1 5.3 61 THOM2 Loc Z-sc 7.1 0.3 4.0 3.0 3.2 7.6 0.4 2.8 1.4 2.3 1.3 8.8 -0.9 3.0 2.9 2.8 2.0 6.0 2.5 3.5 3.9 8.4 0.5 1.5 2.0 5.0 0.1 5.8 5.8 0.6 8.2 6.4 6.5 3.7 2.3 1.7 4.3 2.0 0.5 1.0 6.3 3.2 1.3 9.0 -1.0 4.3 Table 12 A: RAS family Name 121p 1kao 3rabA 1ftn 1hurA 1kevA 1mioB 1hdeA 1ksaA 1cbf GT 5.9 6.1 4.8 3.7 2.8 - LT 9.5 5.9 3.2 3.8 3.6 3.6 3.5 3.4 2.7 - ene 1 2 3 10 4 - len 166 167 169 193 180 351 458 310 366 285 LS 74.8 46.4 17.1 21.7 9.3 5.1 DALI 36.1/0.0/166/100 28.5/1.4/166/49 27.5/1.4/165/31 22.9/1.8/161/35 14.8/2.5/147/15 3.1/4.0/83/11 3.5/3.6/99/10* 2.5/3.6/111/9 1.8/3.9/74/9 B: Lactoglobulin family Name 2blg 1a3yA 1bj7 2apd 1mup 2pcfB 1lgbC 1nglA GT 8.2 4.7 3.0 3.0 1.7 - LT 10.0 3.8 3.1 2.1 2.5 3.0 - ene 1 5 4 2 3 - len 162 149 150 169 157 250 159 179 LS 79.9 10.6 4.4 8.9 6.0 5.8 DALI 35.1/0.0/162/100 17.6/2.4/140/17 17.8/2.4/142/18 11.8/3.0/138/15 19.2/2.2/146/16 12.0/3.5/136/14 C: Glutathione s-transferase family name 2gsq 1axdA 1gsdA 1aw9 1gnwA 1clxA 1fhe 2ljrA 1ao7E GT 7.0 2.0 3.2 4.3 - LT 7.3 5.2 3.7 2.5 4.0 3.7 - Ene 1 3 4 2 - len 202 209 221 216 211 347 217 244 245 LS 87.3 3.3 16.5 4.9 5.0 11.5 5.1 4.9 DALI 37.5/0.0/202/100 18.1/2.9/190/17 25.1/2.1/200/29 18.4/3.1/194/19 17.1/3.1/187/17 0.5/3.9/50/9 20.9/2.3/195/25 15.7/3.1/195/18 - 62 C: Phosphotransferase (SH3 domain) family name 1aww 2semA 1fynA 4hck 1hsq 1a3k 1gbrA 1ark 1nksA GT 3.4 3.9 4.3 3.2 2.7 - LT 4.6 2.6 2.1 3.5 4.0 3.1 - ene 2 3 1 5 6 - len 67 58 62 72 71 137 74 60 194 LS 19.3 13.8 49.8 28.4 10.8 3.2 13.9 12.3 5.5 DALI 8.1/1.7/56/36* 10.2/1.5/56/31* 9.5/1.7/56/47* 8.1/2.0/55/40* 9.8/1.4/54/26* 7.7/2.0/57/34* 7.6/1.9/56/20* - ene 1 - len 124 107 333 322 449 82 107 LS 57.4 15.9 4.9 3.9 2.9 DALI 28.1/0.0/123/100 14.4/1.7/99/36 0.9/3.3/50/8 4.9/2.1/64/19 - Ene 1 - len 87 106 76 60 66 LS 36.8 6.5 11.4 9.9 DALI 8.9/1.8/82/51* 1.2/7.9/40/35* 4.9/2.6/58/33* E: Cytochrome c family name 2cxbA 1co6 1dsn 1crxA 1ndoA 451c 3cyr GT 6.8 - LT 6.0 3.4 3.1 - F: Zinc finger family name 1meyC 1jhb 1iml 2adr 2drpA GT 5.5 - LT 3.6 3.4 2.9 - 63 G: Aspartyl protease family name 1htrB 4cms 2jxrA 1clxA 1egzA 1pfzA 1lyaB 2pia GT 9.6 5.0 5.6 3.7 1.3 LT 8.3 5.7 3.7 1.5 3.6 2.9 2.4 - ene 1 3 2 4 6 len 329 323 329 347 291 380 241 321 LS 95.0 47.5 43.9 32.2 32.0 6.0 len 31 29 28 30 39 35 40 61 LS 15.6 6.0 5.9 4.8 3.3 DALI 56.8/0.0/329/100 39.6/1.7/301/39* 37.0/2.1/307/41 31.7/2.4/298/29 9.3/2.4/83/59 0.5/4.3/49/6 H: Scorpion toxin-like family name 1pnh 1acw 1mea 1bh4 1mtx 2pta 1ica 1ilmA GT 2.8 2.8 2.5 1.1 1.4 - LT 2.9 2.1 1.3 2.5 - ene 2 1 4 5 10 - 64