3D Results Summary

advertisement
CASP5 Methods Abstracts
A-2
123D_server (P0476) - 68 predictions: 68 3D
performed and the hits are combined. Automated model building is carried out
with Modeler, and models are evaluated using Profiles-3D Verify scores.
123D: an Old Program for Fold Recognition
For CASP5 targets, we first use GeneAtlas to help to identify and select
potential PDB templates, and then the alignments are adjusted manually with
the aid of various alignment tools (e.g. Align123) in the Homology module in
InsightII. Align123 is based on ClustalW and augmented with a secondary
structure match term added to the alignment score. If multiple templates are
used to build a model, structure-structure alignments are explored using
InsightII’s structure alignment tools, as well as Modeler’s MALIGN3D, and the
protein structure alignment program CE. Subsequently the sequence-structure
alignment is carried out with Modeler’s Align2D. Multiple models are built
with Modeler, including the new loop refinement routine based on the
optimization of statistical pair potentials. Models are checked for proper
stereochemistry, and evaluated by comparing the restraint violations reported
by Modeler; and by the Profiles-3D Verify scores, which measure the
compatibility of each residue in the model with its environment.
N.Alexandrov
Ceres, Inc. Malibu, CA, USA
nicka@ceres-inc.com
I used the 123D+ web site at http://123d.ncifcrf.gov/ for making predictions.
The predictions were completely automatic, without any manual intervention
with only exceptions made for multi-domain proteins. For such proteins the
strongest local hit was cut out from the query sequence and the rest of the
sequence was submitted again. The program 123D+ uses PSI-blast generated
profiles for both query sequence and the fold library, secondary structure
compatibility, and contact capacity potentials for finding optimal sequence –
structure alignment. Fold library was constructed from 40% non-redundant
Astral set of SCOP-1.59 domains.
In addition, some targets were selected to test two new methods that we have
developed, ChiRotor and Looper, for side-chain and loop prediction. ChiRotor
is a fast algorithm that predicts the conformation of all or part of amino-acid
side chains with an average RMSD of about 1Å for the core residues. The loopmodeling program, Looper, produces a number of energy minimized loop
backbone conformations ranked according to force-field energy terms. Both
algorithms are a combination of a discrete search in dihedral angle space and
CHARMm energy minimization.
Accelrys (P0210) - 24 predictions: 24 3D
Comparative Modeling Using GeneAtlasTM
Dana Haley-Vicente, Velin Spassov, Tina Yeh, Ken Butenhof,
Christoph Schneider, Azat Badretdinov and Lisa Yan
1.
Accelrys Inc., 9685 Scranton Road, San Diego, CA 92121, USA
dhv@accelrys.com
GeneAtlas™ (1) is a high-throughput pipeline for automated protein structure
prediction and function annotation. For template structure identification it uses
PSI-BLAST searches and our fold recognition program, SeqFold. To maximize
homology recognition, both direct and reverse PSI-BLAST searches are
A-3
Kitson et al. (2002) Functional annotation of proteomic sequences based
on consensus of sequence and structural analysis. Briefings in
Bioinformatics 3(1), 1-13.
Advanced-ONIZUKA (P0214) - 92 predictions: 92 3D
k-th core residue. The last step takes M between the last core and the C-term
residue. Finally, the best conformation having minimal energy is selected from
the remaining new conformations as the result of the energy minimization.
Fold Selection and Patchwork Energy Minimization
3) Gap caulking unit
The protein conformation built by patchwork energy minimization unit contains
some gaps inserted or deleted during the alignment process. This unit tries to
caulk those gaps by searching the conformations (selected by the fold
recognition unit) for the combination of two gapless conformation segments at
that region which may substitute the conformation segments containing gaps.
Kentaro Onizuka
Advanced Technology Research Laboratories,
Matsushita Electric Industrial Co. Ltd.
onizuka@mrit.mei.co.jp
.The new method developed to meet CASP5 consists of three units.
1) Fold recognition unit
This unit selects ten to hundred conformations that have relatively good
compatibility to the target protein sequence among approximately two thousand
non-redundant protein structures collected from PDB release 100. The selected
conformations are aligned to the target protein sequence. The compatibility of a
conformation against the target sequence is evaluated as the sum of multidimensional mean-force potentials between all possible pairs of residues in that
conformation, now that having the target sequence aligned.
The multi-dimensional mean force potentials Eabk are pairwise between two
residues with respect to the residue types a and b, sequence separation k, and
the six-dimensional relative configuration whose components are 1) the
distance between two residues, 2) the direction of residue b from a, and 3) the
orientation of b against a (three Euler's angles). The fold recognition unit,
however, first employs singleton potentials with respect only to one residue
type among the pair in order to generate the energy profile of conformations
among non-redundant conformation data-set. Then the target sequence is
aligned to each profile using dynamic programming algorithm. The
compatibility of each conformation to the target sequence is evaluated by
calculating the total energy, which is the sum of pairwise potentials according
to that alignment. The energy minimization unit employs pairwise potentials
plus attractive force potentials because the energy minimization using only the
net mean force potentials1 generates an extended conformation rather than
compact one. The attractive potentials adopted here are such that are
proportional to the square of the distance between residues.
2) Patch work energy minimization unit
This unit builds a protein conformation by concatenating the structure segments
cut out of those conformations selected by the fold recognition unit. The
conformations selected are aligned to the target protein sequence. Here the
concatenation of conformations is done as follows; 1) select two (i-th and j-th)
conformations each aligned to the target protein sequence, 2) choose a residue
M in the sequence as the crossover point 3) the new conformation is generated
by concatenating the segment from N-term (of the target sequence) to M of j-th
conformation and the segment from M to C-term (of the target sequence) of i-th
conformation. The minimization algorithm is analogous partially to genetic
algorithm and also dynamic programming. The minimization procedure first set
the several segment core residues, which should never be the crossover points.
The core residues are those having locally minimal energy, where the energy of
each residue is calculated as the average energy (sum of potentials involving
that residue) over all the selected conformations. The first concatenation step
takes crossover points M between N-term and the first segment core residue.
For i-th conformation, the best combination of M and j with the conformation
having minimal energy is selected. The k-th step takes M between k-1-th and
The performance of the minimization algorithm proposed is intense, although
the algorithm logically does not assure to generate the optimal solution. The
most difficult problem remaining is the potentials for minimization.
1.
2.
A-4
Sippl M.J. (1990) Calculation of Conformational Ensembles from Potentials of Mean Force: An Approach to the Knowledge-based Prediction of
Local Structure in Globular Proteins. J. Mol. Biol., 213, 859-883.
Onizuka K., Noguchi T., Akiyama Y. Matsuda H. (2002) Using Data
Compression for Multidimensional Distribution Analysis. Intelligent
Systems May/June 2002, 48-54.
2) Target – template sequence alignment
To align the template and the target sequence, we used ALAX with solvent
accessibility of residues of the template structure and the PSSM constructed in
the step 1).
ALAX (P0234) - 39 predictions: 39 3D
A New Sequence Alignment Method ALAX and Its
Application to Homology Modeling
3) Model building
The model building was carried out finally by using FAMS [2] program
according to the alignment that was obtained by ALAX. All the processes of
homology modeling, 1) to 3) are fully automatic.
Atsushi Hijikata1, Tosiyuki Noguti 2 and Mitiko Go1
1
Division of Biological Science, Graduate School of Science, Nagoya
University, 2 Saga Medical School
alax@bio.nagoya-u.ac.jp
1.
One of the important issues in homology modeling is to obtain accurate
sequence alignment. Particularly it is true in the case of low sequence identity
(less than 30 %) between the target and template proteins. In low sequence
identity, one of the difficulties lies in locating the insertions/deletions (in/del) at
proper positions. To accommodate the in/del at correct locations, we developed
a new sequence alignment method for protein pairs with weak identity in their
amino acid sequences. A new gap penalty function was introduced that is
based on the solvent accessibility of the corresponding amino acid residues of
the template structure. In the new sequence alignment method, the gap penalty
function and the Position Specific Scoring Matrix (PSSM) of PSI-BLAST [1]
were combined. This alignment method we developed is named ALAX
(ALignment based on ACCessibility). We used ALAX for template-target
sequence alignment and homology modeling software FAMS in
CASP5/CAFASP3.
2.
In CASP5/CAFASP3, we obtained the target models through the following
three steps.
1) Template structure selection
To identify a template structure, we used five iterations of PSI-BLAST against
the non-redundant protein sequence database (nr) of the NCBI. All the
sequences having an e-value lower than 0.1 were included in the PSSM
construction. Then, the PSSM was used to search against the PDB sequence
database. One PDB sequence with the lowest e-value was selected as a template
structure.
A-5
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Ogata K. and Umeyama H. (2000) An automatic homology modeling
method consisting of database searches and simulated annealing. J. Mol.
Graph. Model. 18 (3), 258-272, 305-306.
Aligners (P0064) - 31 predictions: 31 3D
the new rounds of similarity searches. This is an important step because none
of the existing similarity search methods is assured to recover all family
members in one, even iterative, search [2]. If template was not discovered, the
RPS-BLAST program [3] was used and proved helpful in two cases. Several
HMM-based applications were also employed but did not give any gain in the
template identification.
Fold Recognition Using Only Boilerplate Methods of Database
Search and Multiple Sequence Alignment
Arcady Mushegian
Stowers Institute for Medical Research
arm@stowers-institute.org
Sequences of multiple family members, including target, template, and several
homologs with different degree of similarity with both, were aligned using
MACAW [4] and T-COFFEE [5], then converted to the AL format (I thank
Ognen Duzlevski for giving me a converter program). The only manual check
was to assure that the alignment makes structural sense, i.e. that the major
elements of secondary structure are aligned, and their connectivity is possible
given the distances between the aligned elements in each structure. Loops were
not modeled if they could not be aligned on the basis of sequence similarity.
I believe that most if not all approaches for predicting protein structure from
sequence form a continuum of methods, at the core of which lies probabilistic
modeling of evolutionarily related sequence families. (Ab initio methods may
be an exception, but they used to be practical mostly for short peptides). Thus,
there is no “threading” really distinct from “fold recognition” really distinct
from “homology modeling” – the difference is mainly in the atomic detail of
the resulting model.
I submitted 28 models for 28 targets. The assessors are invited to see whether
the results are, on average, comparable with the ones achieved by more
sophisticated approaches.
In order to falsify, and thereby scientifically test, the above statement, one has
to demonstrate that various complementary physico-chemical approaches are 1.
not reducible to probabilistic modeling of protein sequence families and 2.
result in a statistically significant improvement over the methods that use
alignment information alone.
1.
2.
In order to provide a benchmark against which the level of improvement can be
scored, I applied the “no-new-methods” approach for structure prediction of
CASP5 targets. At the first step, I removed the targets that had a statistically
significant match (arbitrary cutoff E=<10-4), at the first iteration of the PSIBLAST program [1] to a sequence with the known (pdb) structure. These are
straight homology modeling targets, where the real issue is not fold recognition
but the RMSD of the model. I know nothing about methods of reducing RMSD.
I also left out several very short peptides. The result is 37 targets where fold
recognition, i.e., identification of and alignment to an appropriate template, is a
legitimate yet non-trivial task.
3.
4.
5.
The main database search program was PSI-BLAST (cutoff for inclusion into a
profile was set at 0.05 and composition-based statistics was used when helpful).
The program was run to convergence, the homologs were collected and used in
A-6
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Aravind L., Koonin E.V. (1999) Gleaning non-trivial structural, functional
and evolutionary information about proteins by iterative database searches
J Mol Biol. 287(5):1023-1040.
Schaffer A.A. et al. (1999). IMPALA: matching a protein sequence against
a collection of PSI-BLAST-constructed position-specific score matrices.
Bioinformatics. 15:1000-1011
Schuler G.D. et al. (1991). A workbench for multiple alignment
construction and analysis. Proteins 9: 180-190
Notredame C. et al. (2000). T-Coffee: A novel method for fast and accurate
multiple sequence alignment. J Mol Biol. 302: 205-217.
arby-scai (P0183) - 68 predictions: 68 3D
template side [7, 8]. The third one is the JProp profile-profile alignment method
recently developed in our group [9, 10]. It compares frequency profiles on the
target side with profiles on the template side using the log average scoring
approach. The fourth method is again the JProp profile-profile alignment
program, but in this version it makes use of additional secondary structure
information on the target and template side (publication in preparation).
The Arby Automated Structure Prediction Server
Ingolf Sommer1, Niklas von Öhsen2
1
2
– Max-Planck-Institute for Informatics,
– FraunhoferInstitute forScientific Computing and Algorithms
sommer@mpi-sb.mpg.de
The quality of each of these search results is assessed using confidence
measures. For PSI-BLAST, these are readily available [11], for the other
methods, these were developed in a recent study [12].
Our fully automated protein structure prediction server Arby combines the
results of several fold recognition methods to find suitable templates in a
database of structural representatives of protein domains.
The target sequence is then annotated with all the produced quadruplets
(subsequence, fold recognition method, search result, confidence value).
Finally, we select a set of non-overlapping annotations along the sequence, by
performing combinatorial optimization of a heuristic score based on the
confidence values. For each of these selected annotations, a separate protein
domain is predicted. The structure of this domain prediction is computed by
aligning the subsequence against the template structure using JProp.
The method starts by constructing a set of subsequences from the query
sequence, each subsequence representing a hypothesis for a possible protein
domain. This is done by scanning against the InterPro database and using hits
as domain hypotheses [1]. Additional hypotheses are constructed using a
secondary structure prediction from PSIPRED [2]. Segments of predicted loops
are used as potential domain boundaries. Finally, the set of subsequences is
reduced to a reasonable size by removing subsequences that are highly similar
or short.
The underlying machinery is a Java based data flow engine, designed for
stability. Since it is general and independent of the specific pipeline (as the one
described above), it can be used as infrastructure for other projects as well: we
developed a component framework in which all algorithms and programs are
encapsulated in small Java classes. Each of these components specifies an
algorithm to be executed along with its input parameters, the output that it
produces, and possible error conditions. The accompanying engine provides a
number of features for the components: First of all, the input/output
dependencies of components are resolved. If all inputs for a specific algorithm
have been determined, the algorithm itself is being scheduled for execution.
The components are executed in parallel on any number of CPUs, in our case
10 CPUs of a SunFire 4800 server. A frequent problem in fully automated
systems is reliable error handling. We solve this problem by catching potential
error conditions and adaptively pruning the data-flow tree. Additionally,
persistence of the computed results is accomplished by using a relational
database, thus offering convenient and fast access to previously computed
results for identical input parameters.
For each subsequence a multiple alignment is constructed by searching the NR
database using PSI-BLAST [3]. A frequency profile is calculated from this
multiple alignment using a slightly modified version of the Henikoff-Henikoff
sequence-weighting algorithm [4].
Each of the potential domains is then subjected to four different fold
recognition methods. Each method searches for an optimal structure in our
template database. The template database is a representative subset of the
SCOP domains with pairwise sequence identity lower than 40% [5, 6]. For each
of these template domains, a frequency profile was constructed as described
above for the targets. The first fold recognition method is PSI-BLAST, which is
used to search through our set of template domains (augmented by the NR
sequence database). The second one is the 123D threading program. It uses
frequency profiles on the target side and 3D structural information on the
A-7
The power of the structure prediction server is based on the use of modern
profile-profile algorithms for fold recognition, the quality assessment using
confidence measures, and the stable and powerful Java data flow engine. In
future work, we will use the latter technology as a basis for our bioinformatics
computing environment.
12. Sommer I., et al. (2002) Confidence measures for protein fold recognition.
Bioinformatics. 18 (6), 802-12.
AS2TS (P0081) - 26 predictions: 26 3D
Acknowledgements. In addition to the authors, the ARBY CAFSP 3 Team
includes Mario Albrecht, Thomas Lengauer, Theo Mevissen, and Ralf Zimmer.
We thank Daniel Hanisch for providing contributions to the Java
implementation. Part of this research has been supported by BMBF grant no. 01
SF 9984/3 (Helmholtz Network for Bioinformatics).
AS2TS – A New Protein Structure Prediction Server
J. Zemla
Independence High School, Brentwood, CA, US
joanna_zemla@yahoo.com
1.
Apweiler R. et al. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic
Acids Res. 29 (1), 37-40.
2. Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J Mol Biol. 292 (2), 195-202.
3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389402.
4. Henikoff S. and Henikoff J.G. (1994) Position-based sequence weights. J
Mol Biol. 243 (4), 574-8.
5. Chandonia J.M., et al. (2002) ASTRAL compendium enhancements.
Nucleic Acids Res. 30 (1), 260-3.
6. Brenner SE, Koehl P, and Levitt M. (2000) The ASTRAL compendium for
protein structure and sequence analysis. Nucleic Acids Res. 28 (1), 254-6.
7. Zien A., Zimmer R., and Lengauer T. (2000) A simple iterative approach
to parameter optimization. J Comput Biol. 7 (3-4), 483-501.
8. Alexandrov N.N., Nussinov R., and Zimmer R. (1996) Fast protein fold
recognition via sequence to structure alignment and contact capacity
potentials. Pac Symp Biocomput, 53-72.
9. Von Öhsen N, Sommer I, and Zimmer R (2003) Profile-Profile Alignment:
A Powerful Tool For Protein Structure Prediction. in Pac Symp Biocomput.
10. Von Öhsen N. and Zimmer R. (2001) Improving profile-profile alignment
via log average scoring. Lecture Notes in Computer Science. 2149, 11-26.
11. Karlin S. and Altschul S.F. (1990) Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes. Proc Natl Acad Sci U S A. 87 (6), 2264-8.
We have attempted to predict structures of twenty-six CASP5 targets using a
preliminary version of a fully automated method AS2TS (Amino acid Sequence
to Tertiary Structure) [1].
The AS2TS server built 3D protein models using a top sequence-structure
alignment provided by PSI-BLAST [2] for a given target. Coordinates for loop
regions were assigned from a library of folds generated by LGA program
(Local-Global Alignment) [3]. Side chains were added using SCWRL program
[4]. Human intervention was limited to enter an amino acid sequence to the
AS2TS server and control whether the process of model building went through.
Our main goal during this round of CASP was to test the ability and
effectiveness of combining two independently working processes: sequence
alignment method with loop building procedure. An analysis of evaluation
results will help in further development of the AS2TS system.
1.
2.
3.
4.
A-8
Zemla A. http://protein.llnl.gov/as2ts
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.
& Lipman D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res 25(17), 3389-3402.
Zemla A. http://PredictionCenter.llnl.gov/local/lga/lga.html
Bower M., Cohen F.E. and Dunbrack R.L. Jr. (1997) Sidechain prediction
from a backbone-dependent rotamer library: A new tool for homology
modeling. J. Mol. Biol. 267, 1268-1282
ATOME (P0464) - 318 predictions: 318 3D
For each target, all the three-dimensional models were ranked according to the
scores computed by PROSA [13] and Verify3D [14]. The top-five models were
deposited for each targets.
Evaluation of an Automatic Pipeline, ATOME for Protein
Structure Modelling
1.
G. Labesse, V. Catherinot, J.-L. Pons, L. Martin and D. Douguet
1
- Centre de Biochimie Structurale (CNRS), Montpellier, France
labesse@cbs.cnrs.fr
2.
3.
The fold compatibility between the targets and PDB entries was analyzed using
our recently developped meta-server [1]. Query sequences are sent
automatically to six distinct fold recognition or protein structure prediction
servers:
3D-PSSM[2],
PDB-BLAST
(http://bioinformatics.burnhaminst.org/pdb_blast/), FUGUE[3], GenTHREADER[4], SAM-T99[5] and JPRED2[6] with default parameters but for PDB-BLAST (10 iterations). No
particular treatment were made for multi-domain targets as proper domain
delimitation was not yet automatized. This likely lead to partially incorrect
alignment or to incorrect fold recognition for a few targets.
4.
5.
6.
7.
As most “threaders” use the “frozen approximation”, each structural alignment
was further evaluated using T.I.T.O [7]. PSI-BLAST [8] on SWISSPROT [9]
sequence database run on the NPSA server [10] was used to search homologous
sequences using the target sequence as a query. The homologs and the target
sequence were used to produce a multiple alignment using CLUSTALW. This
alignment was used to assess the structural alignments.
8.
9.
10.
A consensus ranking was deduced for each template taking into account its
score and its ranking (both computed by the original server), the T.I.T.O score
and the level of sequence identity.
11.
12.
For all targets, three models were built directly using MODELLER 6.0 [11] for
the top-ranking structural alignments. Additional restraints to be used in
MODELLER 6.0, were deduced from template secondary structure assignment
made using P-SEA [12]. Models were evaluated using PROSA [13] and
Verify3D [14]. Side chain modelling in the common core (as defined by targettemplate alignment) was also performed using SCWRL 2.8 [15] and similarly
evaluated but not further refined.
13.
14.
15.
A-9
Douguet D. et al. (2001) Easier threading through web-based comparisons
and cross-validations. BioInformatics 17, 752-753.
Kelley L.A. et al. (2000) Enhanced Genome Annotation using Structural
Profiles in the Program 3D-PSSM. J. Mol. Biol. 299, 501-522
Shi J. et al. (2001) FUGUE: sequence-structure homology recognition
using environment-specific substitution tables and structure-dependent gap
penalties. J. Mol. Biol. 310, 243-257.
McGuffin L.J. et al. (2000) The PSIPRED protein structure prediction
server. Bioinformatics 16, 404-405
Karplus K. et al. (1998) Hidden Markov models for detecting remote
protein homologies. Bioinformatics 14, 846-856.
Cuff J.A. et al. (1998) Jpred: A Consensus Secondary Structure Prediction
Server. Bioinformatics 14, 892-893
Labesse G. et al. (1998) A Tool for Incremental Threading Optimization
(T.I.T.O.) to help alignment and modelling of remote homologs.
Bioinformatics 14, 206-211.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Bairoch A. et al.. (2000) The SWISS-PROT protein sequence database and
its supplement TrEMBL in 2000 Nucleic Acids Res. 28, 45-48
Combet, C. et al. (2000) NPS@: network protein sequence analysis.
Trends Biochem. Sci. 25, 147-150
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234, 779-815.
Labesse G. et al. (1997) P-SEA: a new efficient assignment of secondary
structure from Ca trace of proteins. CABIOS 13, 291-295.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins. Proteins 17, 355-362.
Eisenberg D. et al. (1997) VERIFY3D: assessment of protein models with
three-dimensional profiles. Methods Enzymol 277, 396-404
Dunbrack R.L. et al. (1993) Backbone-dependent rotamer library for
proteins. Application to side-chain prediction. J Mol Biol. 230, 543-574.
Avbelj-Franc (P0341) - 25 predictions: 25 3D
Torsion Space Monte Carlo Simulations of Folding Using
Electrostatic Screening of Backbone and
Charged Side-Chain Interactions
F. Avbelj
National Institute of Chemistry
francl@sg3.ki.si
Three-dimensional structures of small proteins are predicted ab initio using
torsion space Monte Carlo simulations from sequence alone. Protein structures
in the data-bank are not used in this method. The method is based on the
electrostatic screening of main-chain and charged side-chain interactions.
The screening of main-chain electrostatic interactions by water solvation is
used to model the backbone conformational propensities (the electrostatic
screening model: ESM) [1-7]. The strongest support for the ESM has been
provided by the recent experimental studies, which demonstrated that an
enthalpic factor is involved in determining the preferences for α-helices and βstrands [8-10].
The energy function in the Monte Carlo procedure contains: main-chain and
charged side-chain electrostatic interactions, electrostatic solvation free
energies of main-chain and charged side-chain groups, and hydrophobic effect.
The screening of charged side-chain electrostatics by water solvation is used to
model the interactions of charged side-chain groups. The interactions of polar
non-charged side-chain groups are ignored. The hydrophobic effect is modeled
by the long-range interactions. The main-chain and charged side-chain
electrostatic interactions are calculated using Coulomb's law with a dielectric
constant of 1. The electrostatic solvation free energies of polar main-chain and
charged side-chain groups (ESF) are calculated using the finite difference
Poisson-Boltzmann model (DelPhi) with PARSE parameter set [11]. The
electrostatic potential of the molecule is first calculated using a very large box
and large grid size. This potential then provides boundary conditions for more
accurate calculations of electrostatic potential around each residue (focusing).
A-10
Torsion space Monte Carlo simulations of small proteins are performed using
hierarchic condensation. In the first phase of simulation only the local
electrostatic energies and backbone solvation free energies of residues are
activated. After equilibration the protein molecules display native-like local
conformational propensities and dimensions characteristic for the highly
denatured proteins. The calculated NMR J3HNHα coupling constants agree well
with those obtained from the COIL residues in experimental protein structures.
In this phase the β-strands are formed. In the second phase of simulation the
main-chain hydrogen bonds are included in the energy function. In this phase
α-helices and hairpins are formed. In the third phase of simulation the longrange hydrophobic and electrostatic interactions between charged residues are
gradually included in the energy function. The electrostatic interactions
between charged residues are screened by the electrostatic solvation free
energies of charged side-chains. In this phase α-helices and β-strands gradually
condense into compact structures.
In order to improve sampling of the conformational space, a large number of
independent Monte Carlo simulations (~100) are performed. All heavy atoms
including polar hydrogen’s are included in simulations. Geometry of amino
acids is generated using the Discover residue library. Only torsion angles are
allowed to vary during simulations. The ω peptide bond torsion angles are fixed
to 180˚. Hard sphere repulsion is enforced by discarding conformations with
steric clashes. Pairs of atoms related by torsion angles are not checked for steric
clashes. Conformational space is sampled by varying torsion angles of proteins
using different types of moves. The Metropolis criterion is used to decide
whether to accept or reject the move. Temperature is 300 K.
1.
2.
3.
4.
Avbelj F. and Moult J. (1995) Role of electrostatic screening in
determining protein main-chain conformational preferences. Biochemistry,
34, 755-764.
Avbelj F. and Fele L. (1998) Role of main-chain electrostatics,
hydrophobic effect, and side-chain conformational entropy in determining
the secondary structure of proteins. J. Mol. Biol., 279, 665-684.
Avbelj F. (2000) Amino acid conformational preferences and solvation of
polar backbone atoms in peptides and proteins. J. Mol. Biol., 300, 1337-61.
Avbelj F. and Moult J. (1995) The conformation of folding initiation sites
in proteins determined by computer simulation. Proteins: Struc., Funct.,
Genet., 23, 129-141.
5.
Avbelj F. and Fele L. (1998) Prediction of the three dimensional structure
of proteins using the electrostatic screening model and hierarchic
condensation. Proteins: Struc., Funct., Genet., 31, 74-96.
6. Avbelj F. (1992) Use of a potential of mean force to analyze free energy
contributions in protein Folding. Biochemistry, 31, 6290-6297.
7. Avbelj F. and Baldwin R. L. (2002) Role of backbone solvation in
determining thermodynamic β-propensities of the amino acids. Proc. Natl.
Acad. Sci. U.S.A., 99, 1309-1313.
8. Luo P. and Baldwin R. L. (1999) Interactions between water and polar
groups of the helix backbone: An important determinant of helix
propensities. Proc. Natl. Acad. Sci. U.S.A., 96, 4930-4935.
9. Lorch M. et al. (2000) Effects of mutants on the thermodynamics of a
protein folding reactions: Implications for the mechanism of formation of
the intermediate and transition states. Biochemistry, 39, 3480-3485.
10. Thomas S. T. et al. (2001) Hydration of the peptide backbone largely
defines the thermodynamic propensity scale of residues at the C' position
of the C-capping box of α-helices. Proc. Natl. Acad. Sci. U.S.A., 98,
10670-10675.
11. Sitkoff D. et al. (1994) Accurate calculations of hydration free energies
using macroscopic solvent models, J. Phys. Chem., 98, 1978-198.
BAKER (P0002) - 377 predictions: 377 3D
Comparative Modeling Using Rosetta
D. Chivian1+, C.A. Rohl1+, C.E.M. Strauss2, P. Murphy1 and
D. Baker1
1
- University of Washington, 2 - Los Alamos National Laboratory,
+
- authors contributed equally
dabaker@u.washington.edu
Comparative modeling using Rosetta [1] is comprised of up to five steps: A)
detection of the best parent for each putative domain, B) sequence alignment to
that parent, C) modeling of structurally variable regions, D) optimization to
increase the physical reasonableness of the final model, and E) re-assembling
A-11
the complete chain when domains were parsed and processed individually.
(A) Homolog Detection
Queries were initially screened for simple Blast or PSI-Blast parents. Large
regions of query sequence without parent coverage were then submitted to the
Bioinfo meta-server and candidate parents from Pcons2 and Pcons3 [2] were
selected. Occasionally, parents with functions similar to that reported for the
query were also considered. Human intervention was then employed to select
the appropriate parent. Domains for which no significant matches were found
were modeled using the Rosetta de novo prediction protocol [3, and see above
description].
(B) Sequence Alignment
We employed a "kitchen sink" approach, called "K*SYNC", which produces
large sets of candidate alignments by varying the way in which information is
derived and used by a modified Smith-Waterman alignment algorithm. The
information used includes the similarity between PSI-Blast derived residue
substitution profiles for the query and parent, supplementing the parent residue
substitution profile with counts from its FSSP, matching predicted regular
secondary structure (PSIPRED, PHD, SAM, and/or JUFO) with three-state
collapsed DSSP assigned secondary structure, and position specific
obligateness and contiguousness as defined by the occupancy and degree of
gapping for the query and parent in the PSI-Blast multiple sequence alignment
and from the parent's FSSP multiple structural alignment.
The ensemble of sequence alignments was converted to an ensemble of threedimensional template structures, and short to medium unaligned regions (< 17
residues) were modeled in the context of these templates using an abbreviated
insertion modeling procedure (see C below). Alignments containing insertions
that failed to produce conformations in agreement with the geometry of the
template stems were discarded from the ensemble. Remaining alignments were
ranked by evaluation of the structural models by several energy criteria.
Human intervention was employed to either select one of the high-ranking
alignments or to produce a new alignment by recombining the preferred
features of multiple high-ranking alignments.
(C) Insertion modeling
Unaligned regions corresponding to gaps in the sequence alignment as well as
regions estimated likely to show significant structural divergence from the
parent structure were modeled by the Rosetta fragment assembly strategy in the
context of the fixed template. For regions < 17 residues, ~300 initial
conformations were selected from a database of known structures using
similarity of sequence, secondary structure, and stem geometry. Initial
conformations for longer regions were built up using three and nine residue
fragments. The conformations of all variable regions were then optimized using
fragment replacement and random angle perturbations. A gap closure term in
the potential in combination with conjugate gradient minimization was used to
ensure continuity of the peptide backbone. Optimization of variable regions
was accomplished by use of the standard Rosetta potential with centroid
representation of side chains, followed by optimization with explicit side
chains. All variable regions were optimized simultaneously, starting from a
random selection of initial conformations. Generally, ~1000 independent
optimizations were carried out. Variable regions were ranked independently by
energy and low energy conformations for each variable region combined into a
final model, manually ensuring that interacting variable regions were
compatible. For the purposes of evaluating alignments (see B above), variable
regions were modeled sequentially rather than simultaneously, stricter
geometry requirements were enforced in selecting initial conformations, and
the optimization step was severely truncated.
(D) Idealization and Optimization of Template Regions
To make the models more physically reasonable, most structural models were
modified to possess ideal backbone bond lengths and angles. Additionally,
residue clashes were alleviated using a combination of small backbone
perturbations and a rotamer-repacking algorithm. For most targets, models preand post-optimization were submitted. For targets for which either the
idealization and/or optimization resulted in significant backbone perturbation
(> ~1.5 - 2A), this step was eliminated.
(E) Domain Assembly
Domain scope models were combined into a contiguous chain by fragment
insertion in the putative linker region, and evaluated by a coarse energy
function. Finally, side-chains were repacked [4] in either the single or the
multi-domain context.
A-12
1.
2.
3.
4.
Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25.
Lundstrom J. et al. (2001) Pcons: a neural-network-based consensus
predictor that improves fold recognition. Protein Sci. 10 (11), 2354-62.
Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein
structure prediction. Proteins 4 (S5), 119-26.
Kuhlman B. and Baker D. (2000) Native Protein sequences are close to
optimal for their structures. Proc. Natl. Acad. Sci. USA 97 (19), 10383-8.
BAKER (P0002) - 377 predictions: 377 3D
De Novo Structure Predictions Using Rosetta
P. Bradley1+, J. Meiler1+, K.M.S. Misura1+, W.R. Schief1+,
J. Schonbrun1+, W.J. Wedemeyer1+, O. Schueler-Furman1,
M Kuhn1, P. Murphy1, C.E.M. Strauss2, and D. Baker1
1
- University of Washington, 2 - Los Alamos National Laboratory,
+
- authors contributed equally
dabaker@u.washington.edu
De novo structure predictions for CASP5 were made using Rosetta. The basic
method has been described previously [1].
One of the fundamental
assumptions underlying Rosetta is that the conformations adopted by short (3-9
residue) segments of the target polypeptide chain are similar to those adopted
by related sequences in fragments of experimentally determined protein
structures. Fragment libraries for each three and nine residue segment of the
target polypeptide chain were extracted from the protein data bank using a
profile-profile comparison method as described previously [2].
The
conformational space defined by these fragments is then searched using a
Monte Carlo procedure with an energy function favoring compact structures,
buried hydrophobic residues, and paired beta strands. 10,000 - 400,000
independent simulations were carried out for the target sequence and
homologous sequences (when available). Longer sequences were often parsed;
smaller segments were folded and served as nuclei for folding the remainder of
the chain.
3.
For sequences longer that 110 amino acid residues, the resulting models were
subjected to a filter which provided an even distribution of topologies generated
during the Monte Carlo search procedure, and reduced the number of models
with local contacts. The filtered models were then clustered as described in [3].
In some cases, representative decoys from each cluster were refined to improve
the hydrogen bonding of their beta sheets. For proteins with fewer than 110
residues, decoys were scored with the energy function as described above, and
the low free energy models were subjected to a Monte Carlo Minimization
procedure to relieve backbone atomic clashes. Following this, sidechains were
built onto the models using Dunbrack’s backbone dependent rotamer library
and the method described in [4] and a similar Monte Carlo Minimization
procedure was then used to minimize an all-atom energy function dominated by
Lennard-Jones interactions, an orientation dependent hydrogen bonding term,
and an implicit solvation model. Side chain conformations were periodically
optimized using a full combinatorial optimization procedure. Models with the
lowest free energy were selected.
4.
Recent advances in the Rosetta method have been in the areas of decoy
discrimination and improvement of the energy function for small proteins; and
formation of beta sheets, generation of complex topologies and non-local
contacts, and development of a protocol to identify decoys which have
successfully incorporated these features for larger proteins. Attempts were
made to improve secondary structure packing in all decoys. We have also
attempted to compensate for incorrect secondary structure predictions in any
given region of the polypeptide chain, and to increase the conformational space
searched in regions where secondary structure could not be assigned with
confidence. A new method, JUFO (manuscript in preparation), has been
included in efforts to improve the accuracy of secondary structure prediction
and aid generation of a more robust fragment library for a given sequence.
1.
2.
Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein
structure prediction. Proteins 45 (S5), 119-26.
Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25.
A-13
Shortle D., Simons K.T. and Baker D. (1998) Clustering of low energy
conformations near the native structures of small proteins. Proc. Natl.
Acad. Sci. USA 95 (19), 11158-62.
Kuhlman B. and Baker D. (2000) Native Protein sequences are close to
optimal for their structures. Proc. Natl. Acad. Sci. USA 97 (19), 10383-8.
BAKER-ROBETTA (P0029) - 199 predictions: 199 3D
Automated Method for Full Chain Structure Prediction
Using Rosetta
D. Chivian, D.E. Kim, C.A. Rohl, L. Malmstrom, J. Meiler,
T. Robertson, and D. Baker
University of Washington
dabaker@u.washington.edu
We have automated our basic comparative modeling and de novo protocols in
an effort to determine the ability of the Rosetta [1] method to produce full chain
models without human intervention. The server, called “Robetta” provides de
novo, comparative, or mixed models in which the appropriate method is
selected for each putative domain. Additionally, the server provides a
secondary structure prediction from the JUFO-3D [2] method.
Regrettably, the server had several shortcomings during the CAFASP-3
experiment. Much of the code was implemented just prior to the experiment
and not properly tested. It was not entirely free of logical errors, and some of
the models are probably quite poor for this reason. Additionally, the automated
methods that were implemented for CAFASP-3 employed reduced protocols
either in an effort to meet the time demands required of a server method, or
because they could not be completed in time for the experiment. In the interest
of brevity, this abstract will only discuss differences from the full de novo and
comparative modeling protocols (for full protocols, please see the “baker
group” methods abstracts in this volume and [3]).
(A) Homolog Detection and Domain Parsing
We developed a method, called “Ginzu”, to determine domains in the full chain
of the query and assign them for de novo or comparative modeling. It consisted
of sequential processing of the sequence with Blast, PSI-Blast, and Pcons2 [4]
in order to identify regions of the query with parent PDB coverage. A Blast evalue of at least .001 or a Pcons2 confidence value of at least 1.5 was
considered sufficient to justify comparative modeling. A single parent for each
region of coverage was then selected based on confidence and length of
coverage. Next, putative domain boundaries were determined for both
comparative modeling and de novo regions of the chain. PSI-Blast detected
homologous sequences were clustered by region of coverage and assigned to
the query as non-overlapping domain regions in order of cluster size. Cut
points between domains were assigned at positions of reduced occupancy in the
PSI-Blast MSA and strongly predicted loop by PSIPRED [5]. Domain lengths
for de novo regions were forced to be shorter (not more than ~200 residues)
than they probably often were in the native structure in recognition of the
current limitation of Rosetta’s de novo protocol to produce good quality models
for large domains when generating a small decoy ensemble.
(B) Modeling
Domains were modeled either by the de novo or comparative modeling Rosetta
protocol. Reductions to the full protocols included generating a smaller decoy
ensemble for de novo domains, producing only one default weighted K*SYNC
alignment to the most confident parent for comparative modeling domains, and
not rebuilding short and medium loops for unaligned regions in comparative
models with our more rigorous protocol. Lastly, the final full chain model was
produced trivially by spacing the coordinates of each domain model by 100
Angstroms.
(C) Secondary Structure Prediction
JUFO-3D is a version of the JUFO neural-net secondary structure predictor that
uses Rosetta de novo decoys or comparative models in addition to PSI-Blast
multiple sequence information and an amino acid property profile to produce
three-state predictions.
1.
Simons K.T. et al. (1997) Assembly of protein tertiary structures from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25.
A-14
2.
3.
4.
5.
Meiler J. and Baker D. (manuscript in preparation)
Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein
structure prediction. Protein 45 (S5), 119-26.
Lundstrom J. et al. (2001) Pcons: a neural-network-based consensus
predictor that improves fold recognition. Protein Sci. 10 (11), 2354-62.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Baldi (P0021) - 61 predictions: 61 3D
Baldi-CONpro (P0022) - 62 predictions: 62 RR
Baldi-SSpro (P0023) - 63 predictions: 63 SS
CMap23Dpro (P0253) - 1 prediction: 1 3D
CMapPro (P0255) - 0 predictions
SSpro2 (P0254) - 65 predictions: 65 SS
Automated ab initio Prediction of Protein Structure Through
Contact Maps by Recurrent Neural Networks
Gianluca Pollastri and Pierre Baldi
University of California, Irvine
gpollast@ics.uci.edu
The strategy we implemented to predict protein structure splits the problem in
three stages, as described in [1]. The first stage corresponds to modules that
predict structural features including secondary structure and relative solvent
accessibility. The second stage corresponds to modules that predict the contact
map of the protein at the amino acid level, using the primary sequence and the
structural features. The final stage is the reconstruction of 3D coordinates of C
atoms from predicted contact map and secondary structure. All the steps are
entirely automated and performed without any human intervention..
The methods we use for secondary structure and relative solvent accessibility
prediction have been described in [2,3]. These methods try to overcome the
limitations of simple feed-forward networks and consist of BRNNs
(Bidirectional Recurrent Neural Networks) with the capability of capturing at
least partial long-ranged information without overfitting. The recurrent neural
networks are given as input PSI-BLAST profiles derived as described in [2].
Both in the case of secondary structure and relative solvent accessibility an
ensemble of networks is used for the final prediction. Secondary structure
predictions were submitted to CASP in two versions (SSpro2 and Baldi-SSpro),
trained on different data sets. Versions of the methods are also freely available
as web servers (SSpro, ACCpro) at the address:
http://promoter.ics.uci.edu/BRNN-PRED/
In the second step, we go from the primary sequence and the structural features
to the map of contacts between amino acids. Training a large neural network to
directly predict 3D coordinates from primary sequence information is in fact
likely to fail because the problem is highly degenerate. Translations and
rotations leave the structure invariant but greatly change the 3D coordinates. In
contrast, contact maps provide a topological representation of the structure that
is invariant under rotation and translation. Furthermore, contact maps typically
contain enough information to reconstruct the full structure even in presence of
noise [4]. A previous attempt to predict protein contact maps is described in [5].
Our current approach to the problem rests on a generalization of the graphical
model underlying BRNNs to process one-dimensional objects. The
generalization of this architecture to two-dimensional objects, such as contact
maps, is described in [6]. In its basic version the model consists of nodes
regularly arranged in 6 planes: one input plane, one output plane, and 4 hidden
planes. This graphical model is implemented with five feed-forward neural
networks, four representing transitions in the hidden planes given the input, one
representing the input-output transformation. The main advantages of this
model are that it chooses automatically an optimal context to base its decision
on, and that it can capture at least partial long-ranged information without
overfitting. We submitted contact map predictions to CASP at 8 and 12
Angstrom (respectively as CMapPro and Baldi-CONpro). The 12 Angstrom
predictor is trained on a larger data set and proves to be more reliable in our
tests, especially on proteins of length greater than 100 amino acids.
In contrast with the first two stages that heavily rely on machine learning
methods, the last reconstruction step is addressed using distance geometry and
optimization techniques without learning. Our approach partly follows [4] but
with a number of significant modifications due to the fact that, in our case,
A-15
predicted maps differ from exact maps, as well as from random perturbations
of exact maps by uniform additive noise. In particular, in order to deal with the
specific properties of predicted contact maps we use: (1) semi-random moves
of variable length (combining of a random vector and an attraction vector
directed towards putative contacts); (2) a bond length term in the energy
function to deal with unphysical bond lengths introduced by the moves; and (3)
a two-phase search with a first rough phase comprising large steps where only
the predicted contact map contributes to the energy, and a second refinement
phase comprising smaller steps that take into account the effects of chirality,
bond length and amino acid hard-core repulsion forces. The submitted 3D
structures are predicted using two different versions of the reconstruction
algorithm: the first version (CMap23Dpro) uses a direct in-house
implementation of [4], the other (Baldi) uses the modified version described
above.
1.
2.
3.
4.
5.
6.
Baldi P. and Pollastri G. (2002) Machine Learning Structural and
Functional Proteomics, IEEE Intelligent Systems (Intelligent Systems in
Biology II), March/April.
Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Improving the
Prediction of Protein Secondary Structure in Three and Eight Classes
Using Recurrent Neural Networks and Profiles, Proteins, 47, 228-235.
Pollastri G., Baldi P., Fariselli P., Casadio R. (2002) Prediction of
Coordination Number and Relative Solvent Accessibility in Proteins,
Proteins, 47, 142-153.
Vendruscolo M., Domany E. (2000) Protein folding using contact maps.
Vitam Horm. 58, 171-212.
Fariselli P, Olmea O, Valencia A, Casadio R. (2001) Prediction of contact
maps with neural networks and correlated mutations, Protein Eng.
Nov;14(11), 835-43.
Pollastri G, Baldi P. (2002) Prediction of contact maps by GIOHMMs and
recurrent neural networks using lateral propagation from all four cardinal
corners, Bioinformatics. Jul;18 Suppl 1, S62-S70.
Bass-Michael (P0384) - 51 predictions: 51 3D
A Threading Approach to Structure Prediction
M. Bass and R. Luethy
Computational Biology, Amgen Inc.
mbass@amgen.com
The threading approach used here was employed to test the accuracy of a
threading method when applied to a variety of test sequences. In target
sequences that are similar to a known structure, if threading produced a
different alignment, the thread alignment was used to test if the method can
improve the residue shift error in alignments.
The threading method uses residue-based statistical potentials. The potentials
were calculated as log-odds of the interaction. Three potentials were used.
Each potential was given equal weight. The surface area potential was
evaluated by dividing the surface exposure into ten equal bins. The pairwise
interaction potential was calculated by measuring the closest atom-atom contact
between pairs of amino acids such that the pair of amino acids was at least five
amino acids apart. Only interactions between 2.5Å and 12.5Å were counted.
The backbone dihedral potential was calculated for each amino acid. The
dihedral angles were divided into 20 equal bins and separated according to
amino acid. The standard statistical potentials were calculated against a subset
of the Protein Data Bank (July 2002 release) such that no two proteins share
more than 35% sequence identity. This set was reduced by removing any
structures that fail a self-thread test. That is, a sequence must be able to find its
structure with the threading algorithm. This produced a unique subset of the
Protein Data Bank containing 2399 structures. A similar subset of the Protein
Data Bank was used for the query database. This subset contained proteins that
share no more than 45% sequence identity (2875 structures). The algorithm
produces gapped alignments without end penalties using an adaptation of the
Needleman-Wunsch algorithm [1].. The gap creation penalty is 3.5 and the gap
extension penalty is 0.7. The Z-score was calculated for all of the alignments
and alignments producing a Z-score in excess of 5.0 were considered. WUBlast [2] was also run for each target against the structural database to provide
a comparison alignment.
A-16
The sequence alignment was converted into a three-dimensional structure by
the following method. The alignments were converted to a C-alpha trace based
on the coordinates of the template structure. Residues around any gaps in the
alignment were allowed to vary according to the method of Luethy [3]. After
the structure optimization, all-atom coordinates were constructed in the
following way: first all coordinates from the PDB fragments were copied, then
missing backbone atoms were inserted by looking up the closest 5 residue
backbone fragment in PDB, finally missing side-chain atom were copied from
the closest 5 residue fragment from PDB with the same residue in the middle.
The structure was then minimized using TINKER [4] using a steepest descent
method with fixed C-alpha atoms.
1.
2.
3.
4.
Needleman S.B. and Wunsch C.D. (1970) A General Method Applicable
to the Search for Similarities in the Amino Acid Sequence of Two
Proteins. J. Mol. Biol., 48, 443-453.
Gish W. (1996-2002) http://blast.wustl.edu
Luethy R. (2002) Unified Prediction Approach for Comparative Modeling
and ab initio Predictions. CASP5 Abstract.
Ponder J.W. and Richards F.M. (1987) An Efficient Newton-like Method
for Molecular Mechanics Energy Minimization of Large Molecules. J.
Comput. Chem., 8, 1016-1024 (http://dasher.wustl.edu/tinker/)
Bates-Paul (P0096) - 72 predictions: 72 3D
Comparative Modelling By In Silico Recombination of
Templates, Alignments and Models
Bruno Contreras-Moreira, Paul W. Fitzjohn, Marc Offman,
Graham R. Smith and Paul A. Bates
Biomolecular Modelling Laboratory
Cancer Research UK - London Research Institute
paul.bates@cancer.org.uk
After the CASP4 assessment it was concluded that template selection and
sequence alignment remain the main problems awaiting solution in the field of
comparative modelling [1]. Models were rarely found to be closer to the
experimental structures than the optimal template and often manual
intervention only marginally mproved their quality. Similar problems were
found in the fold recognition category [2,4], suggesting that the same approach
may be applied in the search for possible solutions in both fields. During
CASP5 our group has tested a novel procedure to tackle these problems. This
new method was used to generate models for all 67 targets, with roughly half of
them classified as fold recognition targets by the CAFASP3 meta-server
(www.cs.bgu.ac.il/~dfischer/CAFASP3).
This procedure is named in silico protein recombination, as it is a
computational implementation of genetic recombination, a well known
mechanism for generating population variability, but at the protein level. For
each CASP5 target a population of models was generated from a variety of
templates and sequence alignments. Care was taken to assure that models had
similar length and were complete, adding missing loops when necessary and
smoothing their phi/psi geometry to permit later energy calculations and
minimizations. The algorithm can be outlined as:
initial population of models



(1) grow population: r recombination + (1-r) mutation

(2) select best proportion according to fitness

converged? stop : otherwise back to (1)
This is a standard genetic algorithm with two genetic operators (recombination
and mutation) and a fitness function acting as an artificial selection agent. We
will now briefly describe each step in the protocol.
Initial population of models. Initially, our server Domain Fishing [3]
(www.bmm.icnet.uk/servers/
3djigsaw/dom_fish) was used to define protein domains within each target
sequence and to find suitable modelling templates. Resulting alignments were
inspected and corrected if suspected to be incorrect. If reasonable alternative
alignments could be found they too were added to the pool. When possible,
only alignments with bit-scores (average pssm-logodds+secondary structure
A-17
agreement/residue) around 2 were selected. In several cases annotations from
the templates or their corresponding PFAM families were used to check the
correctness of the alignment in active/binding sites. Usually several models
were built using the same template changing parts in the alignment. Models
from these alignments were built using our server 3D-JIGSAW [4]
(www.bmm.icnet.uk/servers/3djigsaw). Additional models were obtained from
the CAFASP3 server after inspection of the alignments to gain extra variability
in sequence alignments, templates used and exposed loops. These models were
taken from different sources, including
FAMS (physchem.pharm.kitasatou.ac.jp/FAMS),
Pmodeller (www.sbc.su.se/~arne/pcons) and
EsyPred3D (www.fundp.ac.be/urbm/bioinfo/esypred).
Models were inspected and missing parts, typically loops, added using in-house
software before going to the next step. In essence, this software explores phi/psi
space to allow a peptide (the missing loop) to connect a gap in a protein fold.
1. Growing the population by recombination and mutation. The initial
population was grown by randomly selecting pairs of protein models and
applying one of the two possible operators. In the case of recombination, the
models were superimposed based on their sequence alignment and a crossover
point drawn. Crossover was not permitted inside secondary structure elements.
The resulting recombinant model inherits the N-terminus from one parent and
the C-terminus from the other. In mutation events (occurring with frequency 1r, where r is the recombination probability) a new protein model was obtained
by simply averaging its parents' coordinates after superimposition. In many
cases this process obtained distorted side-chain conformations.
2. Selecting the best proportion. Fitness function. The whole idea of the
algorithm is that it should be possible to obtain optimized mosaic models by
shuffling them in a rational way. The key point in this approach is thus the
choice of an appropriate fitness function. After some benchmarking
experiments (unpublished results) we chose a function that calculates a free
energy estimate based on two terms: protein contact pair-potentials and sidechain solvation energies estimated from their solvent accessible area. This
function seems to yield a consistent measure of protein structural quality.
When each population reaches the upper limit (between 2 and 4 times its initial
size), this energy function is used to rank its members. Only the worst 25% of
the population is discarded at this point, to assure that quality models are not
lost prematurely.
3. Convergence criterion and final refinements. When the population has
converged to similar energies, there is no room for further generation of
variability and the evolution process stops. At this point the final population is
inspected. In most cases this consists of several representations of the same
protein conformation with average backbone deviations in the order of 0.1Å.
One of these representatives is then taken as the final model, which is carefully
inspected to detect unfavorable peptide conformations and a final energy
minimization using the CHARMM22 force field is performed. This procedure
is able to fix distorted side-chains. At this point we have a CASP5 unrefined
model.
In addition, for targets T0134, T0165, T0177 and T0185 we tested a further
refinement step consisting of running an all-atom, molecular dynamics
simulation inside a water box, with neutral total charge for around 0.5ns. For
these simulations we used the GROMACS package (www.gromacs.org) and
the OPLSAA force field. Snapshots taken from the trajectory were clustered
according to average backbone deviations and one conformation from the most
populated cluster was selected. After a few rounds of CHARMM22 energy
minimization, it was submitted as a refined model.
Insufficient computer resources prevented us from refining all targets.
1. Tramontano A., Leplae R. and Morea V. (2001) Analysis and Assessment
of Comparative Modeling Predictions in CASP4.. Proteins suppl 5, 22-38
2. Sippl M.J., Lackner P., Domingues F.S., Prlic A., Malik R., Andreeva A.
3.
4.
and Wiederstein M.(2001) Assessment of the CASP4 Fold Recognition
Category. Protein suppl 5, 55-67.
Contreras-Moreira B. and Bates P.A. (2002) Domain Fishing: a first step in
protein comparative modelling. Bioinformatics 18, 1141-1142.
Bates P.A., Kelley L.A., MacCallum R.M. and Sternberg M.J.E. (2001)
Enhancement of Protein Modelling by Human Intervention in Applying the
Automatic Programs 3D-JIGSAW and 3D-PSSM. Proteins suppl 5, 39-46.
(www.bmm.icnet.uk/servers/3djigsaw)
Benner-steve (P0524) - 35 predictions: 18 3D, 17 SS
Evolution-based Structure Prediction
D.W. De Kee, T.J. McCormack, and S.A. Benner
Foundation for Applied Molecular Evolution
P.O. Box 13174, Gainesville FL 32604
benner@chem.ufl.edu
Predictions for fourteen CASP5 ab initio targets were submitted in a
collaborative effort to explore the potential for predicting secondary structure in
the transparent secondary structure prediction method [1]. The targets were
selected based on the availability of homologous protein sequences in adequate
numbers and evolutionary distributions in the MasterCatalog, a commercial
naturally organized database developed in collaboration with EraGen
Biosciences (Madison, WI).
Multiple alignments were generated using the automated DARWIN-server
[2].Secondary structures were predicted based on automated heuristics to assign
surface, interior, active site and parsing residues by analysis of patterns of
conservation and variation among homologous protein sequences in light of
evolutionary models that interpret amino acid substitutions as the consequence
of neutral variation subjected to functional constraints [3].
For the targets with a homolog whose structure has been solved, multiple
alignment trials were performed. The alignments were executed with different
gap-opening and gap-extension penalties. The alignments were then evaluated
by visualizing them in relation to the solved structure, with the assumption that
the greatest sequence variation exists outside the boundaries of conserved
secondary structure motifs, i.e., -helices and -strands. Also, additional
homologous sequences were added to the alignments in order to obtain a family
profile, which allowed us to optimize the alignments, since key residues are
more likely to be universally conserved.
1.
A-18
Benner S.A. and Gerloff S.D. (1990) Patterns of divergence in homologous
proteins as indicators of secondary and tertiary structure: a prediction of
the structure of the catalytic domain of protein kinases. Adv. Enzyme
2.
3.
Regul. 31, 121-181.
Gonnet G.H. et al. (1992) Exhaustive matching of the entire protein
sequence database. Science 256, 1443-1445.
Benner S.A. et al. (1994) Bona- fide prediction of aspects of protein
conformation. J. Mol. Biol. 235, 926-958.
MasterCatalog helps identify diagnostics and therapeutics targets, assess the
value of animal models for human disease, and correlate genomic data with
function, starting with pathway interactions and extending to the cell, organism,
ecosystem, and planetary biosphere (6).
At the time that they were introduced, it was clear that evolution-based
structure prediction methods suffer from specific weaknesses inherent in their
formulation. These weaknesses are expected regardless of the details
surrounding its implementation. Thus, the PHD tool, which implements the
same basic idea but in the form of a neural network, is expected to suffer from
the same weaknesses, and this has been suggested anecdotally. The purpose of
our participation in CASP5 is to generate a reference database of record of
predictions done using the 1992 method, which is described in detail, both in
(1), and in the patent literature (7).
Benner-steve (P0524) - 35 predictions: 18 3D, 17 SS
Evolution-based Structure Prediction Tools
Steven A. Benner, Danny De Kee, Thomas McCormack
Unversity of Florida, Foundation for Applied Molecular Evolution
email: benner@chem.ufl.edu
1.
In 1992, the first convincing tools were introduced for predicting protein
conformation from sequence data. These started with a set of aligned
homologous sequences for proteins diverging under functional constraints (1).
These were applied against the two ab initio targets presented in the CASP 1
prediction context, phospho-beta-galactosidase and synaptotagmin, and
generated correct tertiary structure models for both. The judges noted that these
represented the first two successful ab initio predictions in the CASP program
(2).In CASP 2, these tools generated another prediction, this time for the heat
shock protein 90 (3). Here, the prediction was sufficiently accurate that it
correctly assigned HSP90 as a distant homolog of gyrase, generated a
functional hypothesis for HSP90, and identified as incorrect certain
interpretations of experimental data concerning the function of HSP90.
Outside of the CASP project, the tools have been used to analyze the structures
of protein kinase, the pleckstrin homology domain, and ribonucleotide
reductase, among others, where their outcome has gone beyond that of
modelling the fold, but in each case answer questions of interest to biologists
and biomedical researchers working with these systems (4). A version of the
method has been applied to every protein sequence family in GenBank, and
these predictions are incorporated into the MasterCatalog, an interpretive
proteomics database marketed by EraGen Biosciences (Madison WI) (5). The
A-19
2.
3.
4.
5.
Benner S. A., Gerloff D. L. (1991) Patterns of divergence in homologous
proteins as indicators of secondary and tertiary structure. The catalytic
domain of protein kinases. Adv. Enzyme Regulat. 31, 121-181
DeFay T., Cohen F. E. (1995) Evaluation of current techniques for ab initio
protein structure prediction. Proteins 23, 431-445
Gerloff D.L., Cohen F.E., Korostensky C., Turcotte M., Gonnet G.H.,
Benner S.A. (1997) A predicted consensus structure for the N-terminal
fragment of the heat shock protein HSP90 family. Proteins: Struct. Funct.
Genet. 27, 450-458)
Benner S.A., Cannarozzi G., Chelvanayagam G., Turcotte M. (1997) Bona
fide predictions of protein secondary structure using transparent analyses
of multiple sequence alignments. Chem. Rev. 97, 2725-2843
Benner S.A., Chamberlin S.G., Liberles D.A., Govindarajan S., Knecht L.
(2000) Functional inferences from reconstructed evolutionary biology
involving rectified databases. An evolutionarily-grounded approach to
functional genomics. Research Microbiol. 151, 97-106
Bilab (P0080) - 200 predictions: 200 3D
torsion angles were obtained from phi-psi values of amino acids at the center of
all nine-residue fragments with similarity scores larger than a threshold. For
this procedure, the effects of the fragments with higher similarity scores were
enhanced. Smoothing with Gaussian was applied to these maps.
Tertiary Structure Prediction of Proteins Using Probability
Maps of Mainchain Torsion Angles for New Fold Targets and
Comparative Modeling Method for Other Targets
S. Nakamura1, T. Nishimura2, T. Ishida1, T. Miki1,
J. Sasaki1, K. Hibi1, T. Ishizuka1,3 and K. Shimizu1
1
- Department of Biotechnology, the University of Tokyo,
Graduate School of Humanities and Sociology, the University of Tokyo,
3 - Faculty of Industrial Science and Technology, Tokyo University of Science
bilab@bi.a.u-tokyo.ac.jp
2-
We have submitted tertiary structure prediction models for most of CASP5
target proteins except for T0136 and T0145. First we searched structural
templates for the target sequence by using PSI-BLAST and 3D-PSSM server
against Protein Data Bank. When we could not find any templates for the
target, we used ab initio protein structure modeling tool named "ABLE"
developed in our laboratory to produce prediction models. Otherwise we used
MODELLER to build up prediction models based on the alignments of the
templates and the target.
Modeling with ABLE was based on energy minimization of statistical potential
by simulated annealing. First, we built up probability maps for mainchain
torsion angles (phi-psi) at each position of the target sequence. Sequence
similarity scores between nine-residue sub-sequence of the target at each
position and all the fragments in the same length from tertiary structure
database were calculated. As this database, NCBI non-redundant PDB (nrpdb)
was used. Proteins with irregular residues, chain breaks, missing sidechains,
and membrane proteins were eliminated from the list with cutoff p-value of
1.0e-7. 1164 chains were used in total. Sequence similarity score function was
similar to that by Fischer et al [1] including sequence identity and the matching
of secondary structure. BLOSUM62 matrix was used for the calculation of the
sequence identity. Secondary structure prediction of the target was performed
by using PSIPRED server. Weight factors to emphasize matching at the center
of a nine-residue window were used. Probability maps of mainchain phi-psi
A-20
After building probability maps for each amino acid, a number of tertiary
structure models of the target were produced to minimize potential energy by
simulated annealing using these maps. Potential energy function we used was
modification of that by Simons et al [2]. Degree of buriedness of each amino
acid, contacts between residues, and average length between hydrophobic
residues were used to evaluate matching between the sequence and the
structure, and hydrogen bonds between mainchains, packing of secondary
structure segments, exclusive volume to avoid overlap of residues, and radius
of gyration were used to evaluate the plausibility of the model as a protein
tertiary structure. When we could not obtain compact structures for a target,
restrictions of distances between several residue pairs were added to the
potential energy function. Weight factor for each energy term was adjusted for
each target to obtain compact model structures and was changed as the progress
of simulated annealing. For each simulated annealing step, we changed
mainchain phi-psi torsion angles at random position to random values
according to probability maps. About 200 to 5000 structures were produced for
each target by simulated annealing (about 30000 to 200000 steps per each run),
followed by clustering of these structures. Up to five structures which were the
nearest from the centers of large clusters were selected, and sidechain modeling
was performed for these structures by using SCWRL. The order of the
submission was determined by manual inspection of these structures.
Our procedure of modeling with MODELLER was as follows. We searched
templates for the target using the PSI-BLAST and 3D-PSSM. One or more
templates were selected considering the matching scores of these templates, the
matching between secondary and tertiary structures of them, and the results of
secondary structure prediction and tertiary structure prediction of the target by
CAFASP3 servers. Sequence alignment of the templates and the target
sequence was first obtained by using PSI-BLAST or 3D-PSSM, and then
modified manually according to the secondary structures of each sequence and
the results of tertiary structure modeling by MODELLER. When more than one
templates were used, all of the possible combinations of them for alignments
were tried. About 20 to 500 models were produced for each alignment using
MODELLER and a few models were selected according to plausibility of the
secondary structures and compactness of the models. If there were no templates
for some parts of the target sequence, ABLE with restrictions of distances
between residue pairs was used for such parts of the target. Sidechain modeling
was performed to these structures by using SCWRL. Finally, up to five models
were selected and the order of submission was determined by manual
inspection.
1.
2.
3.
4.
Fischer D. et al. (1996) Protein fold recognition using sequence-derived
predictions. Protein Science. 5 (5), 947-955.
Simons K.T. et al. (1999) Improved recognition of native-like protein
structures using a combination of sequence-dependent and sequenceindependent features of proteins. Proteins 34 (1), 82-95.
Benner S.A., Caraco M.D., Thomson J.M., Gaucher E.A. (2002) Planetary
biology. Paleontological, geological, and molecular histories of life.
Science 293, 864-868
Benner, S. A. (1999) Predicting Folded Structures of Proteins. US Patent
5,958,784.
BioInfo.PL (P0006) - 75 predictions: 75 3D
3D-Jury
Leszek Rychlewski
BioInfoBank Institute, Poznań, Poland
leszek@bioinfo.pl
3D-Jury is a simple consensus structure prediction system, which shares
similarity with solutions employed in the field of ab initio fold recognition.
Recent advances in the development in this area can be accredited to the
application of non-energetic constrains such as preferences for high contact
order or the detection of clusters of abundant conformations. The experience
with ab initio prediction methods lead to the conclusion that averages of lowenergy conformations obtained most frequently by folding simulations are
closer to the native structure than the conformation with lowest energy. The
direct translation of this findings into the filed of fold recognition by threading
methods would mean that most abundant high-scoring models are closer to the
native structure than the model with highest score. This is the main rational
behind the 3D-Jury approach.
3D-Jury takes as input groups of models generated by a set of servers. All
models are compared with each other and a similarity score is assigned to each
pair, which equals to the number of C-alpha atom pairs that are within 3.5 Å
after optimal superposition. If this number is below 40, the pair of models is
annotated as not similar and the score is set to Zero. The cutoff value of 40 was
taken from previous benchmarking results and indicates a roughly 90% chance
for both models to belong to the same fold class. The final 3D-Jury score of a
model is the sum of all similarity scores of considered model pairs divided by
the number of considered pairs plus one. The 3D-Jury system can operate in
two modes, which differ by the allowed set of considered model pairs. The
best-model-mode (3D-Jury-single) allows only one model from each server to
be used in the sum, while the all-models-mode (3D-Jury-all) allows the
consideration of all models of the servers:
A-21
N
3DjuryAll ( M a ,b ) 

i
 sim (M a,b , M i, j )
j , a  i OR b  j
1 
i
3DjuryAll ( M a ,b ) 

i
1
I. Kolossváry1,2
j , a  i OR b  j
Ni
max
Low-Mode Optimization
Ni
N
N
Biokol (P0258) - 23 predictions: 23 3D
Ni
j , a  i OR b  j
sim (M
a ,b
1
, M i, j )
– BIOKOL Research LLC, 2 – Budapest University of Technology and
Economics
istvan@kolossvary.hu
N
1  1
The author introduced the low-mode conformational search procedure (LMOD)
a few years ago [1] for automated small molecule conformational analysis and
has further developed it to treat larger and larger systems with applications to
flexible active site docking [2], protein loop optimization [3] and most recently,
fully flexible induced fit docking [4]. The CASP5 experiment has been my first
attempt to use LMOD for making ab initio fold predictions.
i
sim ( M a ,b , M i , j ) : similarity score between model M a ,b and model M i , j
3DjuryAll : 3D - Jury score in the all - models - mode
3DjurySingl e : 3D - Jury score in the best - model - mode
M a ,b : model number b from the server a
M i , j : model number j from the server i
N : number of servers
N i : Number of top ranking models from the server i (maximum 10)
The 3D-Jury system does not utilize directly the reliability score assigned to the
models by the servers. This does not necessary mean that the information about
the original scores will be lost. It can be expected that highly reliable models
produced by fold recognition methods have less ambiguities in the alignments
to template structures, which would result in higher similarity between models
generated on templates with the same fold and finally in higher 3D-Jury scores.
The success of LMOD can mainly be attributed to taking full advantage of
correlation among the moving degrees of freedom. While the total number of
degrees of freedom in a large molecular system is prohibitive to treat them
independently, the high degree of correlation allows for a significant reduction
in dimensionality. Conformational interconversions proceed via concerted
atomic motions and can be described with only a few, non-correlated degrees of
freedom in terms of low-frequency/large-amplitude vibrational modes. LMOD
automatically generates its own, low-dimensional search space, which is
spanned by the low-frequency mode eigenvectors of the Hessian matrix.
LMOD can explore large systems efficiently, because the effective number of
degrees of freedom associated with collective motions responsible for
conformational dynamics is rather small even for proteins.
Another way of looking at LMOD is by referring to domain motions. A domain
can be defined – with respect to a particular vibrational mode – as a set of
atoms in a protein, which move only a little with respect to each other, but
move considerably with respect to the rest of the protein. In other words, the
concerted motion of semi-rigid parts – the domains – of the protein can often
describe protein motions in a simplified way. It is clear that the required
number of degrees of freedom to describe such domain motion is only a tiny
fraction of the total number of degrees of freedom. It is instructive to note that
A-22
the conformational dynamics of certain classes of molecules can be described
by different types of “natural” motions, such as torsional rotation for small to
medium size acyclic molecules, or certain “kayaking” and “flapping” motions
for cycloalkanes [5]. For proteins, domain motion is the natural motion, not
torsional rotation. Levinthal’s paradox does not apply to LMOD, because
LMOD operates in normal mode space, not in torsion space.
LMOD has been utilized in CASP5 to refine crude protein models by
optimizing backbone and side chain conformations simultaneously. The initial
models for each prediction were selected from a diverse set of about 100,000
low-energy C-trace folds obtained by threading. The threading models were
clustered and the ten lowest energy cluster representatives were subjected to
LMOD optimization. First, all ten C models were turned into all-atom models
by fitting an approximate backbone and attaching side chains. The idea of
LMOD optimization is refining/relaxing the backbone and the side chains
simultaneously by applying large-amplitude structural changes along the lowfrequency vibrational modes. The particular modes and their amplitudes were
selected randomly [1-4] to generate a mainly downhill trajectory on the
potential energy landscape (AMBER94 with GB/SA solvation). The five lowest
energy structures found during the course of several independent LMOD runs
were submitted to CASP5. LMOD runs were accomplished on linux clusters
using MacroModel 8.0. A universal LMOD optimization package will be
available from the author.
1.
2.
3.
4.
Kolossváry I., Guida W.C. (1996) Low mode search. An efficient,
automated computational method for conformational analysis: Application
to cyclic and acyclic alkanes and cyclic peptides. J. Am. Chem. Soc., 118
(21), 5011-5019.
Kolossváry I., Guida W.C. (1999) Low-mode conformational search
elucidated. Application to C39H80 and flexible docking of 9-deazaguanine
inhibitors into PNP. J. Comput. Chem. 20 (15), 1671-1684.
Kolossváry I., Keserü G.M. (2001) Hessian-free low-mode conformational
search for large scale protein loop optimization: Application to c-jun Nterminal kinase JNK3. J. Comput. Chem. 22 (1), 21-30.
Keserü G.M., Kolossváry I. (2001) Fully flexible low-mode docking.
Application to induced fit in HIV integrase. J. Am. Chem. Soc., 123 (50),
12708-12709.
A-23
5.
Kolossváry I., Guida W.C. (1993) Comprehensive conformational analysis
of the four- to twelve-membered ring cycloalkanes: Identification of the
complete set of interconversion pathways on the MM2 potential energy
hypersurface. J. Am. Chem. Soc. 115 (6), 2107-2119.
Bion (P0474) - 63 predictions: 63 SS
Secondary Structure Prediction with Shuffled Training by
SPAM
R. Shigeta and J.P. LeFlohic
Bion Bioinformatics Consulting
rtshigeta@yahoo.com
This instance of the Structure Prediction Application Metatool (SPAM) uses
two sequential neural networks. The first is a 15-75-3 sequence-to-structure
network which takes as input the actual residue and a PSIBLAST [2] position
specific sequence profile (PSSM). Similar to the JNET architecture [1], a
window of output from the first neural network is fed into a 15-55-3 structureto-structure network. SPAM also feeds a copy of the original residue and the
PSSM probability data to the second network.
A non-redundant set of 504 protein sequences and structures from the protein
data bank [3] set were used as the training set, with a random 114 set aside for a
non-trained test set. Proteins with out any identified secondary structure were
discarded. Upon loading, the sequences are broken into window length training
patterns and a shuffled such that the neural network is presented with each class
of secondary structure at each training step and a similar number of examples
of each structure.
Training proceeds in epochs until all the errors from the neural networks in the
application cease to change more than an epsilon value which must be assigned
by hand, between 1e-3 and 1e-5. A training epoch is defined as the
presentation of 10,000 patterns, and so the training cycles do not contain
exactly the same data.
The final prediction of beta, helix, or coil is selected by choosing the highest of
the three outputs for each residue. No weighting is applied to the outputs.
The confidence is calculated in the standard way as the difference between the
highest float output and the second highest one. CASP entries were then edited
by hand for improbable patterns in secondary structure.
1.
2.
3.
Cuff J. A and Barton G.J (1999) Application of enhanced multiple
sequence alignment profiles to improve protein secondary structure
prediction, Proteins 40:502-511.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H.,
Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids
Research, 28, 235-242 (2000)
sequence alignment process [3]. This fully automated algorithm routinely
produces alignments that meet or exceed the quality obtained by an expert
human homology modeler.
Models are constructed from the automated alignments using a variant of the
assembly of rigid bodies technique [4] and unconserved side chains are built
using a standard rotamer library [5]. Models are refined via a standard
minimization procedure employing the Dreiding force field [6] in conjunction
with a surface area based solvation potential [7]. Quality of the final model is
assessed using the ProsaII program [8]. A final pG reliability score is computed
from the ProsaII scores [9].
1.
2.
3.
4.
5.
Bionomix (P0475) - 61 predictions: 61 3D
6.
7.
8.
STRUCTFAST: Structure Realization Utilizing Cogent Tips
From Aligned Structural Templates
9.
J.F. Danzer and D.A. Debe
Eidogen, Inc.
jdanzer@stucturesoflife.com
While the alignment methods used in comparative modeling techniques have
recently begun to incorporate structural information from the homology
template, current techniques still do not capture much of the available
information from a multiple structure profile [1-2]. We have developed a novel
dynamic programming algorithm, STRUCTFAST, uniquely capable of
incorporating important information from a structural family directly into the
A-24
Kelley L.A. et al. (2000) Enhanced genome annotation using structural
profiles in the program 3D-PSSM. J. Mol. Biol. 299 (2), 499-520.
Shi J. et al. (2001) FUGUE: Sequence-structure homology recognition
using environment-specific substitution tables and structure-dependent gap
penalties.. J. Mol. Biol. 310 (1), 243-257.
Debe D.A. et al. Unpublished work.
Marti-Renom M.A. et al. (2000) Comparative protein structure modeling
of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291-325.
Tuffery P. et al. (1991) A new approach to the rapid-determination of
protein side-chain conformations. J. Biomol. Struct. Dyn. 8 (6), 1267-1289.
Mayo S.L. et al. (1990) Dreiding -- A generic force-field for molecular
simulations. J. Phys. Chem. 94 (26), 8897-8909.
Danzer J.F. et al. Unpublished work.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins. Proteins 17 (4), 355-362.
Sanchez R. and Sali A. (1998) Large-scale protein structure modeling of
the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95 (23),
13597-13602.
Braun-Werner (P0024) - 65 predictions: 65 3D
Evaluating Sequence Alignment of Fold-Recognition Tools by
Quantitative Scoring of Physical-Chemical Property Based
Motifs
Venkatarajan Mathura, Ovidiu Ivanciuc, Numan Oezguen,
Catherine Schein, Yuan Xu and Werner Braun
Sealy Center for Structural Biology,
Department of Human Biological Chemistry and Genetics,
University of Texas Medical Branch, Galveston, TX 77555
werner@newton.utmb.edu
We participated in CASP5 to test, in a systematic way, our new methods for
incorporating sequence decomposition tools into our modeling methods.
Protein sequences of the CASP targets were separated into sequential motifs
which were then used to judge and improve the alignment between templates
and targets given by fold recognition methods. As in CASP4, we used the 3D
modeling suite EXDIS/DIAMOD/FANTOM to generate 3D models of the
targets, but this time the modeling was facilitated by incorporating the
individual programs into one central tool, MPACK.
Sequence motifs characterizing the protein families of the CASP5 targets were
automatically generated by our program MASIA [1]. These motifs are based on
conservation of physical-chemical properties, as we have previously
demonstrated that even distantly related proteins share contiguous segments
with similar patterns of physical-chemical properties. We have recently
developed five-dimensional quantitative descriptors for each of the 20 amino
acids based on a large number of physical-chemical properties [2].
Conservation of physical-chemical properties at equivalent residue positions in
a protein family is then defined by measuring the standard deviations and the
relative entropies of these descriptors.
For each target we prepared multiple alignments of the target sequence with
similar sequences from other organisms as identified in BLAST/PSIBLAST.
These multiple sequence alignments were then analyzed with MASIA to
A-25
identify regions where the distribution of the quantitative descriptors within the
multiple alignment is significantly different from the 'a priori' expected
distribution. Each motif is quantitatively expressed as a profile consisting of
vector magnitude, standard deviation and the relative entropy. The relative
entropy is used to measure the deviation of the actual observed distribution of
the descriptors from that of randomly occurring amino acids at a given
sequence position in the multiple alignment.
These profiles were then used by our program ALIGNSCORER to determine
which templates and alignments, from the various fold recognition servers that
participated in CAFASP, had the highest score according to their degree of fit
in the highly conserved areas. For some of the targets we combined the
alignment with high scoring motifs from different fold recognition servers and
from different templates. For targets with no significant differences in the
alignment for the motifs we used secondary structure predictions to generate an
improved alignment. In most cases, the profiles of the template molecule were
also determined.
In several cases, where there were few known sequences similar to the target
protein, we determined instead motifs (and molegos) in the suggested templates
from the PDB and matched these to the target in the alignment. For two targets
(T139,T170) we could not find suitable templates and made ab initio 3D
structure prediction based on secondary structure prediction from JPRED and
inside/outside prediction from MASIA.
The final models were judged according to stereochemical criteria
(PROCHECK) and energy (ECEPP) and where possible, by biological ones,
such as surface location of known glycosylation and inter-protein interaction
sites, and similar configuration of the active sites of enzymes.
We have included in all submission files a general description of our method
and a specific section with details for each target.
1.
2.
Zhu H. et al. (2000) MASIA: recognition of common patterns and proerties in multiple aligned protein sequences. Bioinformatics 16, 950-951.
Venkatarajan M. S. and Braun W. (2001) New quantitative descriptors of
amino acids based on multidimensional scaling of a large number of
physical-chemical properties. J. Mol. Modeling 7, 445-453.
Brooks (P0373) - 252 predictions: 252 3D
implementation [6] to provide an implicit solvent description since this function
was found to be most effective in scoring CASP4 predictions [7]. A number of
structures were then selected from the clusters with the lowest average energy
scores, on the order of 10 to 100, and submitted to further refinement.
Structure Prediction with Multiscale Modeling Methods
Michael Feig, Charles L. Brooks III
NIH Research Resource for Multiscale Modeling Tools in Structural Biology
The Scripps Research Institute, La Jolla, CA
brooks@scripps.edu, meikel@scripps.edu
We have used a multiscale modeling strategy to predict protein structures ab
initio or from templates based on sequence homology or fold recognition. In
our approach we combine low-resolution, lattice-based representations of
protein structures with models in full atomic detail using the newly developed
MMTSB Tool Set [1]. Such a modeling method was used to provide both speed
and accuracy while exploring conformational space to search for the global free
energy minimum that is presumed to coincide with the native fold.
In the standard protocol employed for most targets we first generated a large
number of conformations using lattice-based sampling with a modified version
of the program MONSSTER [2] that uses the replica exchange methodology
for enhanced sampling [3]. For ab initio predictions we started from random
extended chains using only information from secondary structure prediction
servers to guide the lattice simulations. If structural templates were available
from sequence homology or fold recognition, the lattice sampling protocol was
applied only to missing regions (from small loops to larger fragments) in the
context of the template structure. In some cases different models were built
from multiple templates and refined and ranked together at a later stage.
Templates were selected based on alignments provided through CAFASP or
from using public fold recognition servers, in particular 3D-pssm
(http://www.sbg.bio.ic.ac.uk/~3dppsm) and PDB-Blast
(http://bioinformatics.burnham-inst.org/pdb_blast/PB_help.html)
The lattice model conformations were subsequently rebuilt to complete allatom models using an accurate reconstruction procedure [4] and clustered
according to distance RMSD. The all-atom models were then minimized and
scored with an all-atom energy function using the CHARMM program [5]. As
part of the scoring function we used a new, highly accurate Generalized Born
A-26
In the final refinement step, we used molecular dynamics simulations with
Generalized Born implicit solvation and replica exchange. In replica exchange
simulations, a number of simulations are run concurrently at different
temperatures while temperatures are exchanged periodically according to
Metropolis criteria based on the system’s total energy. In such simulations the
most favorable conformations populate the lowest temperature windows while
less favorable conformations at higher temperatures can search conformational
space more extensively for lower energy regions. This greatly enhances
sampling towards the global free energy minimum but also provides intrinsic
ranking of structures based on free energies from the average temperature for a
given replica during the course of a simulation. Accordingly, the final
structures at the lowest temperatures from the replica exchange simulations
were submitted as predictions.
1.
2.
3.
4.
5.
6.
7.
Feig M., Karanicolas J., Brooks C.L. III. (2001) MMTSB Tool Set. NIH
Research Resource for Multiscale Modeling Tools in Structural Biology,
The Scripps Research Institute.
http://mmtsb.scripps.edu/software/mmtsbtoolset.html
Kolinski A., Skolnick J. (1994) Monte Carlo Simulations of Protein
Folding. I. Lattice Model and Interaction Scheme. Proteins. 18, 338-352
Sugita Y., Okamoto Y. (1999) Replica-exchange molecular dynamics
method for protein folding. Chem. Phys. Lett.. 314, 141-151.
Feig M., Rotkiewicz P., Kolinski A., Skolnick J., Brooks C.L. III. (2000)
Accurate Reconstruction of All-Atom Protein Representations From SideChain-Based Low-Resolution Models. Proteins. 41, 86-97
Brooks B.R., Bruccoleri R.E., Olafson B.D., Sates D.J., Swaminathan S.,
Karplus M. (1983) CHARMM: A Program for Macromolecular Energy,
Minimization, and Dynamics Calculations. J. Comp. Chem. 4, 187-217
Lee M.S., Salsbury F.R. jr, Brooks C.L. III. (2002) Novel Generalized
Born Methods. J. Chem. Phys. 116, 10606-10614
Feig M., Brooks C.L. III. (2002) Evaluationg CASP4 Predictions with
Physical Energy Functions. Proteins 49, 232-245
Bujnicki-Janusz (P0020) - 215 predictions: 67 3D, 58 SS, 49 RR, 41 DR
Consensus Prediction Using Fragments of Fold-RecognitionBased Homology Models Weighted by the Score of
Inter-Residue Contacts and Compatibility
with the Local Environment
J.M. Bujnicki
International Institute of Molecular and Cell Biology (IIMCB) in Warsaw.
Trojdena 4, 01-109 Warsaw, Poland
iamb@genesilico.pl
A protein structure prediction strategy has been developed, which is applicable
to all prediction categories in CASP5: homology modeling, FR, prediction of
new folds, SS prediction, residue-residue (RR) distance prediction and orderdisorder (DR) regions prediction
i) a multiple sequence alignment is built for the target sequence using as many
homologs as possible, careful refinement is carried out to remove false
positives and correctly align weakly conserved motifs,
ii) FR and SS predictions are carried out for the target sequence/alignment
using as many different methods as possible,
iii) target-template alignments generated by FR methods are converted into
full-atom homology models (HM),
iv) the model structures are clustered to identify the most frequently reported
folds,
v) all models are evaluated using independent criteria to obtain uniformly
scaled values corresponding to the expected accuracy of each model,
vi) the quality of modeled segments is evaluated by calculation of the 3D
profile score in a moving-window scan,
vii) for each candidate fold, the superimposed models are used to generate a
low-resolution weighted consensus structure, using weights at the level of
entire models and individual amino acid residues; models and regions below
the certain cutoff are disregarded,
viii) the sequence-structure alignment for the core of the target is obtained from
superposition of the consensus model onto the family of template structures,
A-27
ix) the alignment of the core is refined to preserve the continuity of SS
elements,
x) the core of the target is modeled based on all templates,
xi) the loops of the target are modeled using the cladistic criteria (from the
template structures, only those loops, which exhibit similar length and sequence
are used to guide the modeling of loops in the target structure),
xii) the model is re-evaluated by calculation of the 3D profile score (using a
knowledge-based atomic-detailed potential) and poorly scoring regions are
subject to refinement (see p. xvi),
xiii) the consensus SS is obtained from the SS calculated from the FR model
and the FR-independent SS prediction (p. ii),
xiv) the weighted consensus of RR contacts is calculated from all intermediate
models for the selected fold (iii), using weights from the 3D profile score (vi),
xv) the disordered regions are predicted by analysis of R-factors and
unstructured regions in the superimposed template structures combined with
analysis of the local sequence composition in the target structure,
xvi) the 3D model is refined, using alternative FR alignments as the starting
points and restraints from the consensus SS, RR and DR predictions.
In CASP5, I submitted predictions in all categories (TS, DR, RR, and SS).
Where possible, all types of prediction algorithms were queried with multiple
sequence alignments. Additional FR alignments and ab initio models for the
CASP5 targets were obtained from the CAFASP website. For homology
modeling, I used Modeller 6v1 [1] and Swiss-Model [2]. Following visual
inspection of the models, the conformation of selected sidechains was adjusted
by hand. Long loops with no counterpart in homologous structures were often
modeled by hand by partial extension of the neighboring secondary structure
elements. For evaluation of the local environment and inter-residue contacts I
used Verify3D [3], window 5aa. In most cases, I submitted only the best single
model.
1.
2.
3.
Sali A. and Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol., 234,779-815.
Guex N. and Peitsch M.C. (1997) SWISS-MODEL and the SwissPdbViewer: an environment for comparative protein modeling.
Electrophoresis, 18, 2714-2723.
Luthy R., et al. (1992) Assessment of protein models with threedimensional profiles. Nature , 356, 83-85.
Bystroff (P0131) - 132 predictions: 45 3D, 40 SS, 45 RR, 2 DR
was checked. The checking was limited to the ends of the gaps. If in the
template structure, the distance between the residues at the opposite ends of an
insertion was inconsistent with the sequence distance in the target, then those
residues were removed from the alignments. This procedure was repeated until
the gap distances in the template were consistent with the sequence distances in
the target. In other words, for all gaps (i,j),
Di' j'  3.8Å  i  j (1)
Contact Map Threading Using HMMSTR
Y. Shao and C. Bystroff
Department of Biology, Rensselear Polytechnic Institute
shaoy@rpi.edu, bystrc@rpi.edu
During the CASP5 prediction, we developed a threading method to predict
protein contact maps. The target sequence was aligned with each of the
template sequences selected from the PDBSelect database [1]. Target contact
maps were generated from these alignments along with the template contact
maps. Then the predicted contact maps were evaluated by the contact free
energy calculation. The final contact map prediction was selected based upon
the contact free energy as well as three other parameters, along with extensive
use of intuition.
The templates we used were selected from the PDBSelect database that
satisfied the following condition: X-ray structure available, high resolution
(<2.5Å) and not -carbon only. The total number of templates was 1239. For
each template, a sequence profile was generated using PSI-BLAST [2].
First, the target sequence profile was aligned against each template sequence
profile by a Bayesian adaptive sequence alignment method [3]. For each
template, the methods generated one alignment score but a large number of
alignments between the target and the template. If the alignment score was
below the threshold value, this template was rejected. Otherwise, if the
alignment score was higher than the threshold, 100 alignments were selected at
random for contact map alignment.
The alignment sets were then improved in two steps. First they were pruned by
the compactness score. The Bayesian alignment method tended to insert large
gaps into the alignments. In order to find which alignments had longer regions
that were aligned, for each alignment we calculated the length of the longest
aligned region (ignoring small gaps ≤ 3 residues). This length was defined as
the compactness score of the alignment. We kept the top-scoring 10 alignments
and discarded the remaining 90. Then the physicality of these 10 alignments
A-28
where i' and j' are the template positions aligned with i and j, and D is the
distance in Å between the alpha-carbons.
For each template, these 10 trimmed alignments along with the template
contact map were used to generate candidate contact map predictions. For
every two target residues that were aligned to the template, if those residues in
the template were in contact, then a contact was predicted for those two
residues in the target. Each candidate contact map (C) was then scored using
the "contact free energy".
Contact free energy (CFE) was calculated in three steps, as stated in detail in
the following paragraphs. In brief, the HMM state contact potentials were precalculated from the database. Secondly, the contact potential map for the target
sequence was calculated. Finally, the candidate contact maps were multiplied
by the contact potential map to give the CFE score.
The HMMSTR position-specific state probability matrices ( matrices [4]) were
pre-calculated for all the 1239 templates using HMMSTR, a hidden Markov
model for local sequence-structure correlations in proteins [5]. The HMM state
contact potential between any two HMM states p and q (G(p,q,s)) was
calculated as the negative log of the ratio between the sum of the product of
HMM state probabilities for states p and q at residues i and i+s, respectively,
that are in contact (C-alpha distance less than 8Å) in the template database, and
the sum of the same product over all residue pairs i and i+s (Equation 2).

G(p,q, s)   log
  (i, p) *  (i  s,q)
PDBSelect i  Di ,i s 8Å
(2)
   (i, p) *  (i  s, q)
PDBSelect i
where (i,p) is the probability of state p at position i in the sequence/structure,
calculated using the forward/backward algorithm [1,5]. The sensitivity of
discriminating contacts from non-contacts was increased by calculating G as a
function of the sequence separation s (4 ≤ s ≤ 20). For sequence separations
greater than 20, s=20 was used. The total number of potential functions was
1037153, one for every pair of 247 Markov states in HMMSTR and every
separation distance from 4 to 20.
The contact potential between residues i and j (E(i,j)) in the target was
calculated in the following way. First the target sequence profile file was
generated by PSI-BLAST [2], then the HMM state probability matrix () was
generated by HMMSTR [5] using the sequence profile. Finally, the contact
potential was calculated as the -weighted sum of the contact potentials,
G(p,q,s), for all HMM state pairs (p and q) with sequence separation s = |i-j|.
E(i, j)     (i, p) *  (j,q)* G(p,q,s)
p
Often several of the top-scoring candidates contained the same fold. Consensus
was considered a strong indicator, especially if the fold was uncommon.
Multiple candidates were sometimes used to construct a single composite map.
If no promising prediction, consensus prediction or composite was found, then
an ab initio contact map was made, based on contact potential alone, and rulebased filtering/editing.
1.
2.
q
Finally, the CFE was calculated by summing the contact potential of all the
pairs of residues that were predicted to be in contact in the candidate contact
map, C.
CFE 
judging each contact map by eye, along with the other 3 parameters, and
manually editing if necessary.
 E(i, j) 
E
3.
4.
5.
i, j  Cij 1( j (i 3))
where <E> is the mean contact potential for the target. For each template, we
calculated the CFE for all the 10 target contact map candidates and chose the
one with the best energy as the contact map prediction for that template.
After we carried out the above procedure for every template in our dataset, we
usually generated several hundred target contact map predictions. How to
evaluate them and choose one as the final prediction became a problem itself.
The decision was made by referring to 4 parameters: the CFE, the Bayesian
alignment score, the compactness score and the sequence length similarity
between the target and the template. The primary parameter was the CFE since
it represented the free energy of the sequence aligned to the template. But as we
observed during the CASP5 prediction, better alignments (which were
represented by Bayesian alignment score and compactness) and similar lengths
between the target and template sequences improved the (perceived) prediction
quality. The strategy we used during the CASP5 prediction was to first sort the
predicted contact maps by CFE, and then among the top-scoring 20 – 30
contact maps, chose (subjectively) one candidate as the final prediction by
A-29
Hobohm U. and Sander C. (1994) Enlarged representative of protein
structures. Protein Science 3, 522.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Zhu J. et al (1998) Bayesian adaptive sequence alignment algorithms.
Bioinformatics. 14 (1), 25-29.
Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected
applications in speech recognition. Proc IEEE 77, 257-286.
Bystroff C. et al (2000) HMMSTR: a Hidden Markov Model for local
Sequence-Structure Correlations in Proteins. J. Mol. Biol. 301 (2), 173190.
Cam-Biochem (P0447) - 74 predictions: 74 3D
continuum solvation model. SLOOP selects loops from a database of loop
conformations based on sequence/structure profiles and surrounding secondary
structure.
Iteration of Alignment and Model Building, Using Novel
Techniques for Modelling and Validation
Side chains were then built with CELIAN[7], which models side chain
conformations by (1) borrowing side chain conformations, where appropriate,
from parent structures and (2) building remaining side chains from a highquality rotamer library[8], optimising the packing by using a SCWRL-like[9]
algorithm. Rules determining when to borrow conformations from parent
structures are based on an analysis of substitutions in defined structural
environments.
T. L. Blundell, V. Bolanos-Garcia, S. C. Brewerton, D. F. Burke,
L. Chen, M. V. Cubellis, P. I. W. de Bakker, M. A. DePristo, A.
C. B. Drake, M. T. Ehebauer, H. S. Gweon, N. J. Harmer, M. L.
Kilkenny, B. S. Kochupurakkal, M. J. Lai, C. M. C. Lobley, S. C.
Lovell, R. N. Miguel, K. Mizuguchi, R. W. Montaluoa, B.
Popovic, R. P. Shetty, L. A. Stebbings, J. J. C. Thorpe
Department of Biochemistry, University of Cambridge, 80 Tennis Court Road,
Cambridge, CB2 1GA, United Kingdom
simon@cryst.bioc.cam.ac.uk
In the CASP 5 experiment the Cam-Biochem group submitted models of
proteins predicted to have either close or distant relationships to proteins of
known structure. For both of these categories we used a number of novel
methods, many of which have been developed since CASP 4.
In the closely related/comparative modelling class we used FUGUE[1] to
identify evolutionary relationships between the targets and HOMSTRAD[2]
families. Alignments were often edited by hand. Backbone models of
conserved regions were built using SCORE[3], and backbone models of
variable regions using RAPPER[4,5] or SLOOP[6]. SCORE defines conserved
regions of the parents based on geometric features. It then selects the
appropriate conserved parent fragment based on the match between it and the
target sequence. The target sequence/parent fragment match is derived from
environmentally-constrained substitution tables. RAPPER combines both ab
initio and knowledge-based methods with model selection using state-of-the-art
energy functions. The ab initio method samples conformations using finegrained phi/psi state sets and side chain conformer libraries under idealizedgeometry, excluded-volume and chain-closure restraints, generating thousands
of plausible conformations. These conformations, supplemented with
compatible fragments from the PDB, are ranked by an all-atom statistical
potential (RAPDF) and by the AMBER force field with the generalized Born
A-30
In the distantly related/fold recognition category we used FUGUE[1] in
conjunction with other fold recognition servers to identify potential
homologues and MODELLER to build models. FUGUE uses environmentspecific substitution tables derived from the HOMSTRAD database, along with
structure-dependent gap penalties, to construct structural profiles. The latest
version (FUGUE2[10]) uses structural profiles enriched by homologous
sequences. In most cases, variable regions were rebuilt using RAPPER.
In previous CASPs, we have had a degree of success due to the care taken in
the production and validation of the alignment, and in validation of the
model[11, 12]. An important element of our strategy in CASP 5 was the use of
our new program, HARMONY3[13] which validates models and alignments. It
uses substitution probabilities, amino-acid propensities and a unique "alignment
flexibility" score to predict which regions of the alignment are likely to be
incorrect. Based on this, and on visual examination of alignments after
annotation with JOY[14], we iterated rounds of alignment and model building.
In CASP 4 we applied strict validation criteria which resulted in us submitting a
small number of high-quality models. In CASP 5 we took equal care with the
alignment and modelling procedure, but submitted a substantially larger
number of models.
1.
Shi J et al. (2001) “FUGUE: sequence-structure homology recognition
using environment-specific substitution tables and structure-dependent gap
penalties” J. Mol. Biol., 310 243-257.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
de Bakker, P.I.W. et al. (2001) HOMSTRAD: Adding sequence
information to structure-based alignments of homologous protein families”
Bioinfomatics 17 748-749
Deane C.M., et al. (2001) SCORE: predicting the core of protein models.
Bioinformatics. 17(6):541-50.
DePristo M.A., et al. (2002) Ab initio construction of polypeptide
fragments: Efficient generation of consistent, representative ensembles.
Proteins Structure, Function and Genetics in press
de Bakker P.I.W., et al. (2002) Ab initio construction of polypeptide
fragments:II Accuracy of loop decoy discrimination by an all-atom
statistical potential and the AMBER force field with the Generalized Born
solvation model. Proteins Structure, Function and Genetics in press
Burke D.F. et al. (2001) Improved loop prediction from sequence alone.
Protein Engineering 14 (7) 473-478
Chen L. et al. unpublished
Lovell S.C. et al. (2000) “The Penultimate Rotamer Library” Proteins
Structure, Function and Genetics 40 389-408
Bower et al. (1997) “Sidechain prediction from a backbone-dependent
rotamer library. A new tool for homology modelling” J. Mol. Biol. 267
1268-1282
Shi J. et al. unpublished.
Burke D.F. et al. (1999) “An iterative structure-assisted approach to
sequence alignment and comparative modelling” Proteins Structure,
Function and Genetics Suppl 3 55-60
Williams et al. (2002) “Sequence-structure homology recognition by
iterative alignment refinement and comparative modelling” Proteins
Structure, Function and Genetics Suppl 5 92-97
Shi J. (2001) Thesis, University of Cambridge
Mizuguchi et al. (1998) “JOY: protein sequence-structure representation
and analysis” Bioinformatics 14 617-623
Camacho-Carlos (P0098) - 46 predictions: 46 3D
Automated Consensus Method of Alignment for Accurate
Comparative Modeling
Jahnavi C. Prasad, Sandor Vajda, Carlos J. Camacho
Bioinformatics Program, Boston University, Boston, MA 02215
ccamacho@bu.edu
Quality of the alignment has been cited as the major determinant of the
accuracy of the final predicted structure in comparative modeling. Errors that
occur in the alignment stage cannot be recovered from in the later stages.
Besides, the target may significantly diverge from the template in certain
regions, thus making it undesirable to model the entire target from that
template. Therefore, in addition to having the best possible alignment, it is also
important to identify the target regions that are likely to be structurally
dissimilar from the template. We have developed and completely automated a
method that addresses both these issues.
Several methods exist for alignment. In a separate benchmarking analysis [1],
we tested ten widely used methods and selected five of them in a hierarchical
manner so that we cover a broad range of alignments. We have developed a
methodology to build a consensus based on the alignments from variations of
the finally selected five methods. Each position in the consensus alignment is
assigned a confidence level. Then the regions reliable for homology modeling
are predicted by applying criteria involving secondary structure and solvent
exposure profile of the template, predicted secondary structure of the target,
consensus confidence level, template domain boundaries and structural
continuity of the predicted region with other predicted regions.
The method is best suited to predict accurate (structural) alignments given a
template. For CAFASP3 we implemented a template search algorithm that
resembles PDB-BLAST. This method provided templates for X number of
targets, mostly homology models.
Alignments are then obtained from the five chosen methods. Method 1 is a
SAM-T99 [2] based method that uses a PSI-BLAST[4] multiple alignment of
target and template hits as its seed. The second method is similar but based on
A-31
HMMER[3]. Method 3 is a simple pairwise alignment by BLAST[4]. This
alignment is used only when it is long enough and the E-value for it is
acceptable. Fourth and fifth methods are again SAM-T99 and HMMER based
methods that use HSSP alignments [5] of the template as their seed alignment
for generating the first iteration HMM.
Camacho-Carlos (P0099) - 184 predictions: 184 3D
Then the consensus alignment is built from these alignments. Now regions
suitable for homology modeling are 'selected' in several stages, some of which
will be mentioned here. The first stage involves selection of highly confident
alignment regions corresponding to buried regions in the template structure. In
the second stage, these selected core regions are extended on both sides to
lower confidence levels till a Glycine is encountered. Regions that are buried,
inside an alpha helix on the template side and satisfying a certain consensus
threshold are then selected. Similarly regions corresponding to beta sheets on
the template are then selected subject to certain other criteria. If the terminal
regions of template or target are suspected to be loose, i.e. dissimilar, then such
regions are deselected.
Department of Biomedical Engineering, Boston University, Boston, MA 02215
ccamacho@bu.edu
After the selection procedure, the selected regions of target are predicted by
simply following the backbone of corresponding regions of the template. If the
template is multi-domain, target regions corresponding to each domain are
predicted separately. In such cases, full predictions were submitted as first
models and the domains as subsequent models. The entire method is automatic
and is available as a server at
http://structure.bu.edu/cgi-bin/consensus/consensus.cgi
1.
2.
3.
4.
5.
Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. Confident Homology
Modeling Based On Consensus Alignment. Submitted for publication.
Hughey R. and Krogh A. (1995) SAM: Sequence alignment and modeling
software system. Technical Report UCSC-CRL-95-7, Univ. of Calif.,
Santa Cruz, CA.
Eddy S.R. (2001) Profile hidden Markov models. Bioinformatics 14 755763, 1998.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-342
Holm L. and Sander C. (1996) Mapping the protein universe. Science
273,:595-602
A-32
Building Protein Structures from Spare Parts
Carlos J. Camacho
Protein structure motifs can be divided into two qualitatively different groups:
more or less unique disorder regions (mostly solvent exposed loops), and a
rather limited set of recurrent motifs or spare parts (helices, sheets, alpha-turnalpha, alpha-turn-beta, etc). Given the more than 10,000 crystal structures
available to date, it is very likely that most spare parts are already present in the
Protein Data Bank (PDB). With this in mind, we have developed an algorithm
named Consensus [1-2] that attempts to detect spare part motifs even in
sequences with no apparent sequence similarity with other proteins. We
generalized this automated procedure to select motifs from multiple templates,
and then combined them to predict protein models including side chains. Based
on sequence, template secondary structure, secondary structure prediction, and
simple energetic constraints (no functional information), we apply this new
technology to all CASP5 targets.
Specifically, for any given target sequence, we run Consensus on multiple
templates selected from CAFASP3 results, obtaining confident structural
alignments of different length. Then, guided by the confidence level associated
with each motif, we put together the pieces and build the scaffold of the protein
by methodically matching the jointures. We did not put much effort predicting
all disorder regions. The main strength of the method is the ability of
Consensus to confidently select the portion of templates that bear structural
similarities with the target.
For CASP5, we had just finished the implementation of this software.
Therefore, some intermediate steps were done manually. Since the main
application we envision for the models generated by our technique is largescale prediction of physical interactions between proteins, our first aim was to
predict only one high confidence model for each target. For practical reasons,
we often submitted one model and smaller portions of the same model. For the
first 13 targets (deadline before Aug 24th) the order of our submissions was
based on length (shorter alignment first) and not overall confidence. We
submitted multiple different models only for very few targets (mainly for the
short NMR targets). Due to time constraints we were not able to finish the
analysis for Targets 187 and 194, so we only submitted preliminary models. In
summary, we submitted 185 structures: one structure for 11 targets, two for 14,
three for 21, thirteen for 13, and for 6 targets we submitted all 5 possible
models.
1.
2.
Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. (2002) Confident
homology modeling based on consensus alignment. Submitted for
publication.
Prasad J.C., Vajda S., Camacho C.J. (2002) Using Consensus to predict
confident structural alignments. Abstract CASP5 Asilomar Meeting.
CaspIta (P0108) - 133 predictions: 70 3D, 63 SS
Secondary Structure Prediction by Consensus and Homology
S. C. E. Tosatto1, M. Albrecht2, A. Cestaro1,
S. Toppo1 and G. Valle1
1
- CRIBI Biotechnology Centre, Universita' di Padova
2
- Max-Planck-Institut fuer Informatik
silvio@cribi.unipd.it
The secondary structure of the CASP targets was predicted by an automated
consensus of three public servers: Psi-Pred [1], ssPRO [2] and Sam-T2K [3].
The consensus is built from a jury decision of the three servers. In case of a tie,
where possible, the neighbouring secondary structure is extended, otherwise the
‘coil’ state is predicted. The prediction is then filtered to remove single
mispredictions (e.g. ‘EEEHEE’). Secondary structure elements shorter than two
(strand) or three residues (helix) are converted to ‘coil’.
Targets with high sequence identity (>= 60%) to a template structure were
predicted using a different approach. In an attempt to reach higher prediction
rates, the true secondary structure of a constructed 3D model is used. The
model is constructed from the PSI-BLAST [4] alignment using comparative
modeling techniques as described in another abstract. The program DSSP [5] is
then used to extract the secondary structure.
1.
2.
3.
4.
5.
A-33
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Baldi P. et al. (1999) Exploiting the past and the future in protein
secondary structure prediction. Bioinformatics 15, 937-946.
Karplus K. at al. (2001) What is the Value Added by Human Intervention
in Protein Structure Prediction? Proteins Suppl 5, 86-91.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Kabsch W., Sander C. (1983) Dictionary of protein secondary structure:
Pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 22, 2577-2637.
CaspIta (P0108) - 133 predictions: 70 3D, 63 SS
Once a plausible function has been established, the relevant structural and
biological features (e.g. substrate, interacting cofactors and/or ions, cellular
localization) are retrieved from the literature. For targets where no template
structure has been identified with sufficient confidence, the program
MANIFOLD [4] is used to suggest possible folds.
Modeling Protein Structures Using a Combination of
Biological Information, Fold Recognition and
Loop Modeling Methods
S. C. E. Tosatto1, A. Cestaro1, E. Bindewald2,
F. Fogolari3 and G. Valle1
1
- CRIBI Biotechnology Centre, Universita' di Padova
2
- Bioinformatics Center, Buffalo University
3
- Science and Technology Dept., Universita' di Verona
silvio@cribi.unipd.it
The first step in predicting a protein structure consists in searching for database
information on the target sequence. Four cycles of PSI-BLAST [1] are
performed on the NR database of protein sequences with an e-value inclusion
threshold of 0.005. Only sequences with at least 30% sequence identity, e-value
<= 10-10 and alignment of length at least 2/3 of the target sequence are
considered. In difficult cases the e-value inclusion threshold in PSI-BLAST is
lowered to 0.02 and sequences up to an e-value of 10-5 aligned for over half of
the target sequence considered. The domain structure of the target is searched
in parallel on the PFAM database [2].
If the results allow the unambiguous identification of a function for the target,
the existence of structures associated to this function and their degree of
homology to the target are verified. For easier targets this is accomplished by
using the previously generated sequence profile to scan the PDBAA database
with PSI-BLAST (i.e. PDBBLAST protocol). If the target is associated to a
PFAM domain, the alignment is analyzed with respect to conserved and/or
functionally important residues and the position of gaps; the presence of a
representative structure is also verified. In order to accumulate more functional
information, the target is scanned for PROSITE patterns [3]. Only patterns
compatible with the characteristics of the target are considered, e.g. all
eukariotic patterns are excluded in bacterial targets.
A-34
MANIFOLD is a fold recognition program based on sequence, secondary
structure, accessibility and functional similarity. For sequence similarity, it uses
the output of PDBBLAST performed on the SCOP 1.53 [5] database of domain
structures with a very large e-value cutoff (200). The consensus of secondary
structure, described in a different abstract, is compared to the predicted
secondary structure of all SCOP 1.53 domains using the segment overlap
criterion. The sequence and secondary structure similarities are augmented with
accessibility, predicted with ACCPRO [6], and function, as expressed by the
enzyme code (EC) number (where applicable), to rank the structures. Each of
the four features is weighted using a non-linear scoring function that mimicks
the behaviour of a neural network.
The top 20 results from MANIFOLD are compared to the previously collected
functional information. The templates are re-ranked based on the characteristics
shared with the target, mainly function and interacting molecules. The SCOP
classes of all candidate templates are analyzed to establish the functional
similarity. If a sufficiently similar structure is found, a template is selected. If
the SCOP classes are very diverse and the target function uncertain, the search
is abandoned without selecting a template.
Depending on how the template was selected, one of three protocols is used to
align target and template sequences. In all cases, the automatically generated
alignment is inspected visually by locating proposed insertions and deletions on
the template structure. The position of insertions and deletions is shifted to
order to optimize their position relative to secondary structure elements.
For comparative modeling targets, i.e. those easily detected by PDBBLAST, 4
rounds of PSI-BLAST search are first performed for the target sequence against
the NR database of protein sequences. The sequences aligned in this way are
used to build a hidden Markov model (HMM) using the HMMer package [7].
This HMM is then used to align target and template.
Fold recognition targets which are believed to be related on the sequence level
are subjected to an extended version of the previous protocol. Rather than
constructing only a HMM of the target sequence, this is also done with the
template sequence. Different similarity cutoffs for inclusion in the HMMs are
empirically used to select a sequence alignment that will conserve most of the
secondary structure elements.
improve the overall quality of the results. After repeating the above procedure
for all insertions and deletions, the model is subjected to a limited local
minimization with CHARMM, typically 100 steps of steepest descent, to
reduce bad contacts. The final model is again inspected visually before being
submitted.
1.
The third and most difficult case is for fold recognition targets which are either
believed to be only related on the secondary structure level or for which the
previous protocol fails to produce a satisfactory alignment. In these cases the
alignment is constructed from the output of the MANIFOLD program.
MANIFOLD uses a global alignment heuristic based on optimizing the
segment overlap measure of the secondary structure elements. In practice, these
alignmenmts are a starting point for manual intervention, due to their
fragmented nature.
The model was generated using the package HOMER [8]. This involves the
following steps. First a raw model of the conserved parts is constructed from
the template. The backbone 3D coordinates of target amino acids aligned with
the template are copied. The coordinates of conserved side chains are modeled
at this stage, with only the Cb atoms being copied for all other. SCWRL [9] is
used to place all missing side chain atoms. Some basic checks are performed on
the model, e.g. the RAPDF [10] and solvation [11] energies are calculated and
the model inspected visually, to exclude obvious errors.
Insertions and deletions are reconstructed after raw model generation using an
enhanced version of the fast divide & conquer loop modeling method [12]. This
method uses a database of pre-calculated loop fragments derived from a
Ramachandran plot distribution of (phi,psi) torsion angles to generate a set of
candidate loops. Candidates with steric clashes or amino acids in prohibited
areas of the Ramachandran distribution are filtered out. The remaining
solutions are then ranked according to a combination of RAPDF energy and
geometric fit to the anchor regions. Enhancements to the published protocol
consist in adding side chains with SCWRL to the top 20 solutions after ranking
before performing a restrained local minimization, typically 500 cycles of
conjugate gradient, for each solution. These solutions are then re-ranked
according to their CHARMM [13] force field energy and inspected visually to
select the most appropriate solution. The extended protocol was found to
A-35
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Bateman A. et al. (2000) The Pfam protein families database. Nucleic
Acids Res. 28, 263-266.
Falquet L. et al. (2002) The PROSITE database, its status in 2002. Nucleic
Acids Res. 30, 235-238.
Bindewald E. et al. (2002) In preparation.
Murzin A.G. et al. (1995) SCOP: A structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol.
247, 536-540.
Pollastri G. et al. (2002) Prediction of coordination number and relative
solvent accessibility in proteins. Proteins 47(2), 142-53.
Eddy S.R. (1998) Profile Hidden Markov Models. Bioinformatics 14 (9),
755-763.
Tosatto S.C.E. et al. (2002) In preparation.
Bower M.J. et al. (1997). Prediction of protein side-chain rotamers from a
backbone-dependent rotamer library: A new homology modeling tool. J.
Mol. Biol. 267, 1268-1282.
Samudrala R., Moult J. (1998) An all-atom distance-dependent conditional
probability discriminatory function for protein structure prediction. J. Mol.
Biol. 275, 895-916.
Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop
modeling. Protein Eng. 15(4), 279-286.
Jones D.T. (1999) GenTHREADER: An e.cient and reliable protein fold
recognition method for genomic sequences. J. Mol. Biol. 287, 797-815.
MacKerell J.A.D. et al. (1998) All-hydrogen empirical potential for
molecular modeling and dynamics studies of proteins using the
CHARMM22 force field. J. Phys. Chem. B 102, 3586-3616.
CBC-FOLD (P0008) - 151 predictions: 151 3D
protein domains in PDB, with hydrophobic ratio 0.71 +/- 0.08. This surprising
and interesting feature was thus used in our new procedure, Hydro, to screen
candidate structures. Furthermore, a new hydrophobic score has been defined
to detect native-like proteins from decoy structures. Thorough tests on three
widely used decoy sets showed very encouraging results [2]. We also examined
the correlations between hydrophobicity and proximity to the surface of
residues along the sequence. The hydrophobic moment profiling and
hydrophobic score provided useful complementary information to the force
field calculations and were used as a filtering tool before and after refinement.
What’s so Good About Real Proteins?
Ajay K. Royyuru, Ruhong Zhou, Prasanna Athma, B. David
Silverman, Gelonia Dent and Rosalia Tungaraza
Computational Biology Center, IBM Thomas J. Watson Research Center,
Yorktown Heights, NY 10598, USA
ajayr@us.ibm.com
The basic protocol used in our CASP5 effort was to create initial candidate
structures with Comparative Modeling or Fold Recognition processes and to
refine them with enhanced sampling methods using an all-atom force field with
a continuum solvent model. A newly developed hydrophobic profiling
procedure was used to filter candidates before and after the refinement.
For targets with high sequence similarity to existing proteins in PDB, we used
Comparative Modeling. The target sequence was aligned to each sequence in
pdb_select95 database using Psi-Blast. The alignment score was based on the
BLOSUM62 substitution matrix, and the one with the highest score was
selected as the best template. To generate an all atom model, we copied
coordinates of the backbone atoms of aligned residues from the template to the
target. Sidechain coordinates were copied if the atom type in the target was
identical or very similar to that in the template. Coordinates of remaining
sidechain atoms and gaps in the template were generated with random initial
locations and the simulated annealing protocol in X-PLOR. The candidate
structure is then subjected to hydrophobic profiling and refinement. For targets
without sufficient sequence similarity, we used a Fold Recognition process. We
used 3DJury alignments from the CAFASP3 server to build all-atom models
corresponding to every alignment and scored and refined them using the same
protocol.
One interesting question about protein structures is: What is so good about real
proteins? Recently, a detailed spatial profiling of hydrophobicity in native
proteins had revealed that the shape of the 2 nd order moment profiles is
comparable, yielding a relative constant called the hydrophobic ratio [1]. The
shape of the hydrophobic profile is similar for 5387 non-redundant globular
A-36
The candidate structures were then refined with enhanced conformational space
sampling using Replica Exchange Method [3]. The method utilizes high
temperature walkers to cross over the energy barriers, which would otherwise
be difficult for low temperature walkers to overcome. The OPLSAA force field
was adopted with a continuum solvent model, Surface Generalized Born
(SGB) model. The following procedure was used: (a) conjugate gradient
minimization followed by a short molecular dynamics equilibration at 310K;
(b) Launching extensive conformation space searching with Replica Exchange
Method using 21 replicas for temperatures ranging from 300K to 500K; (c)
Sampled conformations at 310K are then clustered to retain structures that
differ by at least 1A (1st clustering); (d) Structures from each cluster bin are
minimized; (e) Structures are ranked by OPLSAA/SGB energy to identify the
one with lowest energy; (f) The ensemble of sampled structures at various
temperatures is examined to identify those within 1 A of the lowest energy
structure identified above. Then these structures are clustered again to retain
structures that differ by >0.25 A (2nd clustering); (g) Structures from 2nd
clustering are then minimized for 1000 steps; (h) Structures are finally ranked
by OPLSAA/SGB energy; (i) Five lowest energy structures were analyzed
through structural alignment to identify distinct and optimal models.
Predictions for 23 targets were obtained through Comparative Modeling and 35
through Fold Recognition.
1.
Silverman B. D. (2001), Hydrophobic moments of protein structures:
spatially profiling the distribution. (1997) Proc. Natl. Acad. Sci. 98, 49965001
2.
3.
Zhou R. and Silverman B. D. (2002), Detecting native protein folds among
large decoy sets with hydrophobic moment profiling, Pac. Symp.
Biocomput. 02, 673-84
Zhou R., Berne B. J. and Germain R. (2001) The free energy landscape for
-hairpin folding in explicit water, Proc. Natl. Acad. Sci. 98, 14931-14936.
CBRC (P0041) - 385 predictions: 279 3D, 105 SS, 1 DR
Integrating a New Fold Recognition Method with an
Exhaustive Molecular Modeling System: FORTE1 and
FORTE-SUITE
K. Tomii1, T. Hirokawa1, T. Noguchi1,
A. Suenaga2 and Y. Akiyama1
1
- Computational Biology Research Center
National Institute of Advanced Industrial Science and Technology, Japan
2
– Bioinformatics Group, RIKEN Genomic Science Center, Japan
akiyama-yutaka@aist.go.jp
The CBRC team attempted to submit TS/AL prediction results for all CASP5
targets. Basically, our prediction method is a pipeline composed of two or three
steps: (1) fold recognition and alignment by the new FORTE1 program, (2)
exhaustive 3-D structure modeling by the FORTE-SUITE system and, if
needed, (3) molecular dynamics simulation for further structure refinement.
(1) Fold recognition and alignment by FORTE1
We have devised a novel profile-profile comparison technique to increase the
sensitivity of fold recognition and improve alignment accuracy. The FORTE1
program by Tomii et al. [1] has distinct features of measuring similarity
between two profiles as compared with other published methods, such as FFAS
[2] and the method developed by Yona and Levitt [3], which exploit alignment
information. The FORTE1 program utilizes the sequence profiles of both a
target and templates to predict the structure of target sequence. The sequences
A-37
of templates were derived from the ASTRAL [4] (version 1.59) 40% identity
list and the selected PDB entries which are not registered in SCOP (1.59
release) database. With the exceptional-strength computational resource (the
Magi cluster, http://www.cbrc.jp/magi/), we performed PSI-BLAST iterations
maximally 20 times to prepare the profiles of both target and templates with the
NCBI non-redundant database. In profile comparisons the global-local
algorithm was employed to build an optimal alignment of a query sequence
profile onto a template one. Statistical significance of each alignment score was
estimated by calculating Z-score with a simple log-length correction. The
candidates of the templates were sorted by Z-scores, and then prediction results
in the AL format were submitted.
(2) 3D structure modeling and evaluation
3D models were built with the target-parent(s) alignments from FORTE1. Two
molecular modeling programs, MODELLER [5] and SegMod [6], were used.
The modeling scheme of MODELLER is to optimize probability density
functions for each of the restraint features of the model, while SegMod is a
segment match modeling using a database of known protein X-ray structures.
For longer-loop cases, we gave priority to the SegMod results. Human
inspection, multiple sequence alignments and secondary structure predictions
were also performed to refine the target-parent(s) alignments as possible. The
modeling process we followed is divided into two categories by FORTE1 Zscore levels: (a) for CM and easy FR targets a simple modeling was done using
only promising parents with very high Z-scores and (b) for FR/NF targets an
exhaustive modeling was performed utilizing available parents (maximum 100
parents each for both modeling programs) with a medium Z-score level and
final models were selected based on the structural quality score (q-score)
calculated by Verify3D [7]. For the latter process, Hirokawa developed a semiautomatic exhaustive modeling and evaluation system on a parallel computer
environment, called FORTE-SUITE. The q-score ordering was in some cases
overridden by human intervention, when related knowledge was available from
literature or other bioinformatics analysis results.
(3) Molecular dynamics simulation for further structure refinement
When the q-score of the model was not sufficiently high (typically not greater
than 0.5), we tried to perform molecular dynamics calculation in order to obtain
a structure with better q-score. In all simulations, parm96 force field was
adapted and explicit water molecule model were used. Non-bonded interactions
were calculated without cutoff approximation using Barnes-Hut tree algorithm.
For CASP5, we employed two different parallelized MD programs on different
parallel computers. The first program is called the MolTreC2 [8] running on the
Magi cluster (976 Gflops in total, 30 Gflops in average for a simulation) at
CBRC and was operated by Noguchi and Akiyama. The second program is a
modified version of AMBER6.0 [9] (AMBER for the MDM special-purpose
computer) running on a PC with two MD-Grape2 boards (16 chips, 240 Gflops)
and was operated by Suenaga at RIKEN. With the MolTreC2 simulation, we
tried 14 CASP5 targets and 7 models for 5 targets out of them were submitted
because they showed improved q-scores (T0132_4, T0140_5, T0153_2,
T0155_2, T0180_{1,2,3}). The simulations were done for one or two nanoseconds typically. With the AMBER simulation, we tried three CASP5 targets
and 3 models for 2 targets out of them were submitted (T0129_4,
T0135_{1,2}).
(4) Secondary structure prediction
We have also submitted SS predictions for all CASP5 target proteins based on
our fold recognition techniques. The SS-FORTE program was newly developed
for CASP5 and was utilizing 50 secondary structures of template proteins
suggested by FORTE1 with sequence weighting according to the Z-score of
FORTE1. We also combined the result from our previous program the New
SSThread [10] which utilizes an averaged output from other threading methods.
The secondary structure predictions were mainly done by Noguchi.
1.
2.
3.
4.
5.
6.
Tomii K. et al. (2002) Fold recognition using FORTE1 server, in CASP5.
Rychlewski L. et al. (2000) Comparison of sequence profiles. Strategies
for structural predictions using sequence information. Protein Science. 9
(2), 232-241.
Yona G. et al. (2002) Within the twilight zone: a sensitive profile-profile
comparison tool based on information theory. J. Mol. Biol. 315 (5), 12571275.
Chandonia J.M. et al. (2002) ASTRAL compendium enhancements.
Nucleic Acids Res. 30 (1), 260-263.
Sali A. and Blundell T.L. (1993) Comparative protein modeling by
satisfaction of spatial restraints, J. Mol. Biol. 234, 779-815.
Levitt M. (1992) Accurate modeling of protein conformation by automatic
segment matching, J. Mol. Biol. 226, 507-533.
A-38
7.
Luthy R., Bowie J.U. and Eisenberg D. (1992) Assessment of protein
models with three-dimensional profiles, Nature 356, 83-85.
8. Misoo K, Akiyama Y. et al. (2000) Development of Molecular Dynamics
Programs for Protein with a Parallelized Barnes-Hut Code, Proc. HPCAsia 2000, 1103-1111.
9. Case D.A., et al. (1999) Amber 6.0, University of California San
Francisco.
10. Noguchi T. et al. (2001) Prediction of Protein Secondary Structure Using
the Threading Algorithm and Local Sequence Similarity, Research. Comm.
in Biochem., Cell & Mol. Biol., 5, 115-131.
CBSU (P0417) - 173 predictions: 173 3D
An Attempt to Improve over the CAFASP Prediction Servers
by Means of Manual Intervention
D. Ripoll and J. Pillardy
Computational Biology Service Unit,Cornell Theory Center- Cornell
University; Rhodes Hall Ithaca NY 14853-3801
cbsu@tc.cornell.edu
The process of structure prediction of the CASP5 targets was carried out using
any sequence and structure information that was possible for us to gather. The
principal source of structural information for each target was obtained from the
CAFASP summary web page[1]. The templates used in the structure generation
of our models were selected using the following conditions: (a) CAFASP
predictions from different servers that consistently pointed to a structure (or
domain of a structure) from PDB[2] (b) If the servers provided predictions with
low-level of confidence, only those templates that were consistent with the
secondary structure prediction of the respective target sequence were further
analyzed; (c) Server predictions showing similarities to isolated fragments only
were discarded.
For each template, attempts were made to improve the sequence alignments
provided by the servers by generating all-atoms 3D models[3] with reasonable
loops and hydrophilic/hydrophobic arrangements of the residues. In addition,
in the process of model generation of some targets, we tried to use information
from structural neighbors of the template protein. The Combinatorial Extension
method[4] was used to obtain the corresponding sequence alignments of
proteins having low sequence similarity and low C  rms deviations with the
template. This information was used to attempt to improve the alignment
between the target and template sequences obtained from the servers.
1.
2.
3.
4.
http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H.,
Shindyalov I.N., Bourne P.E. (2000). The Protein Data Bank. Nucleic
Acids Research 28, 235-242.
Šali A., Blundell T.L. (1993). Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.
Shindyalov I.N., Bourne P.E. (1998). Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path. Protein
Engineering 11(9), 739-747.
Celltech (P0028) - 347 predictions: 347 3D
Error Detection in Sequence-Structure Alignment Using
HARMONY3
J. Shi
Celltech R&D Inc, 1631 220th Street SE, Bothell, WA 98021, USA
jshi@sea.celltechgroup.com
MOTIVATION: In the comparative modeling practice, the accuracy of
sequence-structure alignment is the dominant factor of the quality of final
models [1; 2; 3]. The initial sequence-structure alignment for distant
homologues, even produced by experienced modeling experts, often contains
erroneous regions. In practice, predictors can identify alignment errors that
have significant impact on the model quality according to the problems they
find in the 3D models built based on the alignment, and can thus iteratively
refine the alignments and improve the quality of the models [4]. Unfortunately,
this is a subjective and time-consuming process, and requires detailed insights
A-39
into protein structures, which is not an option for novices or large-scale
modeling efforts. No methods have yet been reported to automate this process.
OBJECTIVES: An automated procedure, HARMONY3, has been developed
recently in Blundell’s lab to tackle this problem [5]. Our key objectives in
CASP5 are to validate the capability of HARMONY3 in identifying erroneous
alignment regions that have significant impact on the model quality, and to test
the potential of integrating this procedure into fully automated comparative
modeling platform.
METHODS: In the CASP5 experiment, we used FUGUE [6] to identify
structural templates of the target, and to produce sequence-structure alignment.
HARMONY3 was then used to build models (using MODELLER [7]) and to
predict problematic alignment regions (see below for detailed description on
the HARMONY3 methodology). If any problematic regions were found, the
alignment was manually adjusted and then passed to HARMONY3 again to
form refinement iteration. Due to limited resources (only 1 predictor available),
the iteration number was limited to 2, and no more than 2 hours of humanintervention were spent on each target. As a result, neither literature nor
functional information was used for the predictions.
Thus, the performance of HARMONY3 can be evaluated by the comparison
between our results and the results of the FUGUE servers, which are registered
with CAFASP3 under the names of FUGUE2 and FUGUE3. We are interested
to see whether 2-hour human intervention on the problematic sequencestructure alignment regions, as predicted by HARMONY3, could improve the
accuracy of the final models. If the HARMONY3 predictions were incorrect,
our results should be no better than the results of the FUGUE servers. We are
also interested to see whether the structural templates were indeed incorrect
when HARMONY3 indicated global alignment errors, even after manual
refinement.
The HARMONY3 protocol [5] consists of the following steps. (a) Five models
were generated from a given sequence-structure alignment using MODELLER
[7]. (b) The observed local structural environment, as defined by main-chain
conformation, solvent accessibility and hydrogen-bonding status, was
calculated from the models for each residue. (c) The observed amino acid
distribution at each sequence position was derived from an alignment between
the target sequence and its sequence homologues collected by PSI-Blast [8]. (d)
An amino acid propensity score M was calculated from the agreement between
the observed local environment of each residue and the expected value based
on known structures. (e) An amino acid substitution score N was calculated
from the agreement among the observed amino acid distribution and two
expected distributions, one being predicted from the local environment of the
models and the environment-specific substitution tables [6], and the other from
an environment-independent substitution table. (f) A local alignment flexibility
score F was calculated for each position of the given sequence-structure
alignment. (g) Local alignment errors were predicted by evaluating the M, N
and F scores and averaging the results over 5 models.
The M and N scores were used to assess the correctness of the model. It has
been reported that the amino acid propensity score M could be used to describe
whether a model is reasonable from the perspective of sequence-structure
compatibility [9]. However, this score uses only the structural information
derived from the models, and makes no use of the readily available information
from the sequence homologues of the target. In HARMONY3, the amino acid
substitution score N was introduced to take into account such information. The
assumption was that the structural constraints on amino acid substitutions
provide extra predictive power on the amino acid distribution at each sequence
position [6]. Thus, if the local environment of the model is correct, the positionspecific amino acid distribution derived from sequence homologues of the
target should agree better with the expected distribution derived from the
environment-specific substitution tables, than with that derived from the
environment-independent substitution table.
The local alignment flexibility score F was introduced to account for the
observation that local alignment errors are more likely to occur in the regions of
low sequence identity and the regions where many insertion/deletions are
found. Furthermore, there is a problem with the M and N scores: the empirically
derived structural restraints cannot be applied towards functional sites, where
the amino acid conservation/substitution is mainly constrained by functional
reasons. The score F can also “mask” functional sites and minimize such
problems, because functional sites are usually well conserved.
The combination of the M, N and F scores indicated the problems in a model
that were most likely introduced by alignment errors. Averaging over 5 models
A-40
further reduced the noise from random modeling errors. The final result was
mapped to the sequence-structure alignment to indicate erroneous alignment
regions.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Moult J., et al. (1999). Critical assessment of methods of protein structure
prediction (CASP): round III. Proteins Suppl(3), 2-6.
Venclovas C., et al. (1999). Some measures of comparative performance in
the three CASPs. Proteins Suppl(3), 231-7.
Johnson M.S., et al. (1994). Knowledge-based protein modeling. Crit Rev
Biochem Mol Biol 29(1), 1-68.
Williams M.G., et al. (2001). Sequence-structure homology recognition by
iterative alignment refinement and comparative modeling. Proteins
Suppl(5), 92-7.
Shi J. & Blundell T.L. (unpublished).
Shi J., et al. (2001). Fugue: sequence-structure homology recognition using
environment- specific substitution tables and structure-dependent gap
penalties. J Mol Biol 310(1), 243-57.
Sali A., et al. (1993). Comparative protein modelling by satisfaction of
spatial restraints. J Mol Biol 234(3), 779-815.
Altschul S.F., et al. (1997). Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res 25(17),
3389-402.
Luthy R., et al. (1992). Assessment of protein models with threedimensional profiles. Nature 356(6364), 83-5.
CHEN-WENDY (P0264) - 37 predictions: 37 3D
Loop closure was partially constructed automatically. In a similar manner to
side-chain minimization, the backbone conformation of loops was also
determined using the Powell algorithm. The side-chain conformations of loop
residues were taken into account by optimising their rotamer. Identification of
edged residues and conformational refinements of loops for residue deletions
and insertions were manually performed using computational graphics.
Additionally minor refinements were performed by XPLOR energyminimization in the CHARMM-22 force field. To avoid over-minimization, the
convergence criterion was set to between 1 and 4kcal/mol/Å while the
Coulombic interaction was turned off for minimizing side-chain atoms. Each
model was visually scrutinized to identify potential conflicts in side-chain
conformations and to maximize side-chain-to-main-chain hydrogen bonds.
A Newtonian Force-Based Algorithm for Mixed-Integer
Optimization in Comparative Protein Modeling
1
1
1
J.L. Pellequer , G. Imbert , O. Pible and S-w. W. Chen
2
1
- CEA Valrhô – Centre de Marcoule – DSV/DIEP – Unit of post-genomique
Biochemistry and Nuclear Toxicology. BP17171 – 30207 Bagnols sur Cèze –
2
France, - 13 ave. de la Mayre – 30200 Bagnols sur Cèze – France
cmft551@yahoo.com
Our comparative modeling approach is based on semi-automatic prediction
schemes with permanent user interventions. Putative template molecules were
identified indulging in the CAFASP3 web server where we placed strong
emphases on threading methods. Protein sequences of identified putative
templates were aligned with each other using CLUSTALW/T-COFFEE. In case
of unsatisfactory resulting multiple alignments, we developed a new
methodology collecting all protein sequences of putative templates and realigned them to select the most appropriate one. Subsequently, a pair-wise
alignment was generated by taking into account of the locations of secondary
structure elements of the selected template. Indels locations were identified
through an in-depth analysis of the three-dimensional structure of the selected
template. In modelling CASP5 targets, most of time was spent in the sequence
alignment.
The positions of side-chain atoms were placed using our automatic program.
Replaced side chains were clustered and optimised in two steps: first, rotamers
(from Tufféry et al.1991. J. Biomol. Struct. Dynam. 8:1267-1289) of side
chains were optimised at a cluster level, and second, the chi dihedral angles of
each side chain were minimized at a residue level. Most of CASP5 targets
resulted in one large cluster (>50 residues) and several other small ones. We
employed the Powell algorithm to perform minimization. We chose nonbonded energy (in the 12-6-1 format) with CHARMM-22 all-atom force field
parameters as a scoring function.
A-41
CHIMERA (P0153) - 94 predictions: 94 3D
Comparative Modeling Using CHIMERA Modeling System
Mayuko Takeda-Shitaka, Chieko Chiba, Hirokazu Tanaka,
Daisuke Takaya and Hideaki Umeyama
Kitasato University
shitakam@pharm.kitasato-u.ac.jp
Our laboratory registered group CHIMERA [1, 2] in CASP5 and group FAMS
[3, 4] in CAFASP3. Procedure of group FAMS, full automatic modeling
system, is very important and essential for large-scale genome modeling. In
some cases, however, the procedures using human intervention are more
accurate than fully automated modeling procedures. Therefore we tried to
construct more accurate models with human intervention.
We constructed 3D model structures using CHIMERA modeling system, which
predicts protein structure based on homology modeling methods using more
than one reference protein. The modeling procedure is 1) selection of reference
proteins, 2) alignments, and 3) construction of model structures. CHIMERA is
partially automatic modeling system that enables human intervention at
necessary stage.
Selection of reference proteins
Searches for reference proteins are based on results shown by group FAMS
(see abstract of group FAMS). According to the target information given by
CASP5 organizer beforehand, related papers, secondary structure predictions
and CAFASP3 meta-server, we select reference proteins.
CHIMERAX (P0170) - 74 predictions: 74 3D
Alignments
We generate alignments taking biologically important region, secondary
structure predictions, homology, hydrophobic core etc. into consideration.
Multiple templates are used when possible.
Genki Terashi, Ryota Yamatsu, Youji Kurihara, Mayuko TakedaShitaka, Mitsuo Iwadate and Hideaki Umeyama
Construction of model structures
First, main chain is constructed by loop searches if necessary. Second, side
chains are replaced by suitable amino acids keeping the original side-chain
torsional angles where possible. Short contacts within 2.0 angstrom are
removed.
Different alignments and several alignments for different alignment ranges
between target and reference proteins are provided by the PSI-BLAST and
other programs. Homology modelings have been performed based upon each
alignment, and template models are prepared to construct full length protein.
Generally full length proteins are not obtained due to low homology of the
target sequence for the reference. In such cases, plural numbers of template
protein models are connected in overlapping a few amino acid residues between
neighboring template models. In addition, other connecting methods are used to
make full length protein models.
Accuracy of the models depends on selection of reference proteins and on
generating alignments. If reference proteins and alignments are wrong, model
structures become wrong even though the modeling software is reliable.
Therefore, we laid emphasis on these steps. The result shows tendency that we
selected same reference proteins as group FAMS did in high homology cases,
and different ones in low homology cases. Even in case of same reference
proteins, we manually modified alignments to maintain active site residues,
secondary structures, hydrophobic core etc.
1.
2.
3.
4.
Yoneda, T., Komooka, H. and Umeyama, H. (1997) A computer modeling
study of the interaction between tissue factor pathway inhibitor and blood
coagulation factor Xa. J. Protein Chem. 16, 597-605.
Takeda-Shitaka, M. and Umeyama, H. (1998) Effect of excepional valine
replacement for highly conserved Ala55 in serine proteases. FEBS Lett.
425, 448-452.
Ogata, K. and Umeyama, H. (2000) An automatic homology modeling
method consisting of database searches and simulated annealing. J. Mol.
Graphics Mod. 18, 258-272.
Iwadate M., Ebisawa K. and Umeyama H. (2001) Comparative Modeling
of CAFASP2 Competition. Chem-Bio Informatics Journal 1, 136-148.
A-42
Full Length Protein Modeling Using CHIMERA eXtending
Procedure
Kitasato University
kuriharay@pharm.kitasato-u.ac.jp
Method
Making of template models and selection of base model
After template models are made for many alignments, a primary candidate is
selected as a base reference protein of the homology modeling. The primary
base protein is selected from modeling products of FAMS [1, 2], FAMSD and
CHIMERA [3, 4] teams in Umeyama’s group. The criterion of that choice
comes from alignment length and matching degree between predicted
secondary structures for the target and calculated ones for the reference.
Architecture of secondary structure database
Fragmented structures including more than two secondary structures are
modeled using initially obtained alignments, and those are conserved as
secondary data base.
Connecting base model with template models
Other template models are fitted on the base protein model, until the full length
protein is produced step by step.
Extension of models with secondary structure database
The N-terminal or C-terminal moieties for which the fitted model are not
constructed to the full length target protein are modeled in similar RMS fitting
procedure by using modeling database based upon secondary structures.
Model refinement
Finally, in order to refine the connected protein model from several modeled
proteins, we use the FAMS(full automatic modeling system) program again.
Results and Discussion
A high homology model for the primary reference protein is almost complete
for the length of the target protein, because almost complete alignments
between target and reference proteins are obtained with including smaller
insertion and deletion. As the result, since the region in which the model should
be extended is very small. The planned extension of the primary base protein is
very easy. However, in the case of low homology for the primary reference
protein, some models are largely extended in the fitting process of template
models on the primary base protein. Then, some low homology models have
totally good modeling feature, because local modeled moieties are thought to
be comparatively proper structure from using the guaranteed alignment with the
low E-value in making use of template and base models.
1.
2.
3.
4.
Iwadate M., Ebisawa K., and Umeyama H. (2001) Homology modeling of
CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148.
Ogata K., and Umeyama H. (2000) An automatic homology modeling
method consisting of database searches and simulated annealing. J. Mol.
Graph. Model. 18 (3), 258-272, 305-306.
Yoneda T., Komooka H. and Umeyama H. (1997) A computer modeling
study of the interaction between tissue factor pathway inhibitor and blood
coagulation factor Xa. J. Protein Chem. 16, 597-605.
Takeda-Shitaka M. and Umeyama H. (1998) Effect of excepional valine
replacement for highly conserved Ala55 in serine proteases. FEBS Lett.
425, 448-452.
A-43
CIRB (P0397) - 263 predictions: 200 3D, 63 RR
Prediction of the Residue-Residue Contacts With Neural
Networks
P. Fariselli1, O. Olmea2, A. Valencia2 and R. Casadio1
1
- CIRB and Dept. of Biology,
University of Bologna Via Irnerio 42, 40126 Bologna, Italy.
2
- Protein Design Group. CNB-CSIC, Cantoblanco, Madrid 28049. Spain.
piero@biocomp.unibo.it, casadio@alma.unibo.it
We use an ab initio method based on neural networks to predict residue-residue
contacts. Our networks were trained to learn the association rules between the
covalent structure of each protein from a selected data base and its contact map.
The neural network implemented here is similar to that previously described in
[1] and called NET, including in the input code evolutionary information in the
form of sequence profile.
For training the network, we use a large set of non-homologous proteins of
known 3D structure. The list includes all proteins in the PDB-select list of non
sequence-redundant protein structures whose chains were not interrupted and
for which alignments with more than 15 sequences were obtained: in total our
set includes 173 proteins [1,2].
We consider two residues to be in contact when the Euclidean distance between
the coordinates of the corresponding C-beta atoms is lower than 8 Å ( ||ri - rj || <
8).
The topology of the neural network consists of: (i) a single output neuron which
codes for contact (output value close to 1) and non contact (output value close
to 0); (ii) a hidden layer of 8 neurons; (iii) an input coding of 1050 input
neurons, which represent the ordered pairs (in the parallel and anti-parallel
pairing of two segments of 3-residue long) as described in [1].
The basic novelty is that after the neural network predictions, a filter procedure
is applied in order to have an upper bound to the possible number of contacts
per residue. Based on the output value, the procedure eliminates the less
probable contacts for those residues whose number of predicted contacts is
larger than 10. Eventually, the backbone connectivity and the secondary
structure predictions are included in the filtering procedure. This is done in two
ways: 1) setting the intra-helical predictions as contacts, 2) increasing or
decreasing the number of contacts among strands depending on their relative
network activation values. The average accuracy, measured as number of
correct contacts/ number of predicted contacts and evaluated with a cross
validation procedure, is in the range of 0.20.
1.
2.
Fariselli P. et al. (2001) Prediction of contact maps with neural networks
and correlated mutations. Protein Eng 14, 835-843.
Fariselli P. et al. (2001) Progress in predicting inter- residue contacts of
proteins with neural networks and correlated mutations- Proteins: Suppl 5,
157-162.
CIRB (P0397) - 263 predictions: 200 3D, 63 RR
D= PTA S PB
represents the “dot” matrix for the profile comparison of the two strings. This
can be efficiently computed by means of standard linear algebra routines. The
D matrix can be searched for high-scoring alignment by means of standard
techniques.
In Tangram for a given target/template comparison, we compute a generalized
dot matrix D as follows
D= a*[ PTA. Bl62 .PB] + b*[ PTA,ss. Sss .PB,ss] + c*[PTA.C.PB]
where a, b and c are the weights of the linear combination, P x and PX,ss are the
composition and secondary-structure profiles, Bl is the BLOSUM62 [2]
substitution matrix, Sss is a secondary-structure substitution matrix proposed in
[3], C is a long-range contact capacity potential [4]. Then D is searched for the
top scoring alignment using the local Smith-Waterman dynamic programming
algorithm [5].
The composition profiles are generated by multiple alignment of the sequences
reported from a three-iteration PSI-BLAST [6] search on the Non-Redundant
database, using an inclusion threshold of E=10 -3. Secondary structure profiles
for the target are generated by means of a neural network predictor [7]. Our
template set comprises 2167 PDB structures whose sequence homology is less
than 40%, which has also been used to derive the long-range contact capacity
potential.
Building on the Basics: Protein Fold Recognition/Threading
Based on “Generalized” Profile Alignments
E. Capriotti1,3, P. Fariselli2, I. Rossi2,3 , and R. Casadio2
1
- Dept. of Physics/CIRB, University of Bologna, Italy, 2 - Dept. of
Biology/CIRB, University of Bologna, Italy, 3 - BioDec srl, Bologna, Italy
ivan@biocomp.unibo.it, casadio@alma.unibo.it
We developed Tangram, a method for protein fold recognition/threading based
on sequence profile alignments.
The profile-profile comparison algorithm known as BASIC [1] can be
generalized to represent any type of “property“ profile comparison. Assuming
that A and B are two strings of symbols, P A and PB are the rectangular matrices
representing the position-specific frequency of the alphabet symbols composing
the strings (superscript T indicates a matrix transpose operation), S is a
(symmetric) substitution matrix, it can be derived that the matrix D, defined as:
A-44
On a set comprising 185 template/target couples of PDB structures that share
the same SCOP label with less than 25% sequence identity, we measured 70%
accuracy in detecting the correct SCOP assignment.
1.
2.
3.
Rychlewski J. et al. (1998) Fold and function predictions for Mycoplasma
genitalium proteins. Fold. Des. 3, 229-238
Henikoff S. et al. (1998). Superior performance in protein homology
detection with the BLOCKS database server. Nucleic Acids Res. 26, 309312.
Wallqvist A. et al. (2000) Iterative sequence/secondary structure search for
protein homologs: comparison with amino acid sequence alignments and
4.
5.
6.
7.
application to fold recognition in genome databases. Bionformatics 16,
988-1002
Alexandrov N.N. et al. (1996) Fast protein fold recognition via sequence to
structure alignment and contact capacity potentials. Pac Symp Biocomput.,
53-72.
Smith T.S. and Waterman M.S. (1981)Identification of common molecular
subsequences. J. Mol. Biol. 147, 145-147
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Jacoboni I. et al. (2000) Predictions of protein segments with the same
amino acid sequence and different secondary structure: a benchmark for
predictive methods. Proteins 41, 535-544
critical linguistic information while suppressing irrelevant information such as
speaker-specific characteristics, channel characteristics, and noise. This
analysis-synthesis technique is based on the transformation of a signal into its
cepstrum which is a measure of the periodic wiggliness of a frequency response
plot. The cepstrum is calculated as the logarithm of the power spectrum of a
signal and leads to a logarithmic periodgram for which the spectral envelope is
obtained as a smooth curve depicted by connecting the main local peaks of the
minute structure of the frequency spectrum.
The technique applied to the analysis the profile of physicochemical features of
the protein sequence allows extraction of information in the form of the
spectral envelop which is used to model the relationship between the primary
and tertiary structures of a protein.
Comparison of two sequences is then reduced to an alignment of spectral
envelopes representing the primary structures. After obtaining the profile of
physicochemical characteristics, this is converted to the frequency domain by
applying a Fourier transform.
DelCLAB (P0050) - 310 predictions: 310 3D
Protein Folding Prediction by Spectral Analysis Methods
2. Spectral Alignment and Protein Structure Similarity
Carlos A. Del Carpio-Muñoz
Lab. for BioInformatics. Dept. of Ecological Eng.
Toyohashi University of Technology,
Tempaku. Toyohashi. 441-8580
carlos@translell.eco.tut.ac.jp
Our methodology [1-2] combines the spectral analysis of the physicochemical
properties of the amino acids constituting a particular sequence with a
consensus analysis of the secondary structure predicted for that structure by
several methodologies reported hitherto, including those participating at the
CAFASP contest.
1. Spectral Representation of a Protein Folding
We adopted a well known technique of front-end processing in robust
automatic speech recognition (ASR) the objective of which is to preserve
A-45
Spectral similarities are then obtained by alignment of spectra representing two
primary sequences using a dynamic programming algorithm (DP). The
hypothesis underlying the methodology is that patterns bearing similarity in
spectral space represent similar folding patterns in proteins. Here, while in
common sequence alignment by DP a penalty is imposed when no match
occurs and the search can continue in both the vertical and horizontal
directions, this can’t be done with spectral matching since it would lead to an
unlimited flexibility of the match operation. To avoid this negative effect when
using DP for spectral alignment, a gradient is imposed to the search so that the
match can continue smoothly in the diagonal direction. This equates to reduce
the number of gaps in the DP process, since a horizontal and vertical advance is
allowed only once, and recurrent gaps are inexistent.
Then the similarity of two sequences in frequency space is computed as the
Euclidean distance among the spectral harmonics. Values close to zero stand
for high similarity while higher values stand for increased dissimilarity.
3. Dominant Parameters and the SCOP Data Base
4.1) Consensus Secondary Structure.
Finding the parameters for which the alignments are optimal (within a protein
folding category) leads to the determination of the dominant physicochemical
properties for a particular folding, class, family and finally super-family of
proteins.
This evaluation is performed for the proteins recorded in the SCOP database.
The analysis introduced here is carried out after a preprocessing of the
structural information found in each family and super-family in the database.
This consists in having each super-family maintain diversity in residue
sequence deleting sequences with higher than 80% similarity. Physicochemical
parameters are obtained from the AAINDEX database which is a compilation
of 434 amino acid indices for the twenty naturally occurring residues.
Dominant physicochemical parameters for each class of proteins are obtained
by alignment of the spectra for all the pairs of proteins constituting the class
and using the 434 amino acid indices. Five indices are selected as the dominant
physicochemical parameters for which the spectral alignment scores are the
highest.
4. Threading of the Target Sequence onto the Template of a Candidate
Folding Pattern
Since the methodology allows the comparison of amino acids sequences of
different length, it poses some difficulties at the moment of using the candidate
folding patterns as the template for modeling the target protein. Since the
number of amino acids of the candidate may be larger or smaller than that of
the target, we propose the combination of two procedures as a paradigm to
build an unknown protein from instances derived from analogical analysis as
the one presented here. This paradigm involves the following steps:
Derivation of a consensus secondary structure based on 2ry structure prediction
methodologies.
A threading algorithm based on a Genetic Algorithm.
We describe briefly each of these procedures:
A-46
Several methodologies for prediction of the 2ry structure of a sequence of
amino acid residues have been proposed in the literature. These amount to more
than 20 methods all available through Internet. Besides, using a consensus
secondary structure obtained from analysis of the CAFASP contest, allows to
have an idea on the secondary structure of the target. We have developed an
alignment of secondary structures between that of the recognized folding
pattern (the template structure output by the spectral analysis described above)
and the consensus secondary structure.
4.2) Threading based on a Genetic Algorithm
This algorithm proceeds with the threading of the target sequence of amino acid
residues on the template candidate. The procedure uses the results of the
alignment of the secondary structures in the precedent step, then cutting and/or
inserting pieces of backbone in the template so as to achieve the consensus
secondary structure, proceeds to build the target structure. To constrain
distortions of the backbone to a reasonable difference in RMS, this threading
operation is executed by a Genetic algorithm, the penalty function being that of
least RMS deviation.
1.
2.
Del Carpio C.A. and Yoshimori A. (2002) Fully Automated Protein
Tertiary Structure Prediction Using Fourier Transform Spectral Methods.
Protein Structure Prediction: Bioinformatic Approach. Edited by: Igor
Tsigelny. University of California. International University Line Inc. 173197
Del Carpio-Muñoz C.A. (2002). Folding Pattern Recognition in Proteins
Using Spectral Analysis Methods. Genome Informatics. In Press.
Doniach (P0401) - 42 predictions: 42 3D
DOROTA (P0589) - 1 prediction: 1 3D
Ab Initio Protein Structure Prediction Method - Topological
Assembly Of Predicted Secondary Structure Segments
Structure Based Modeling of the Target T0190
Wenjun Zheng1 and Sebastian Doniach1,2
Lawrence Livermore National Laboratory, Livermore, California
sawicka1@llnl.gov
D. Sawicka
1
– Department of Physics, 2 – Department of Applied Physics
Stanford University, Stanford, CA94305
zhengwj@stanford.edu
The basic idea of our method is assembling predicted secondary structure
segments (helices and strands) so that the 'hydrophobic' center of mass of each
segment is in contact with at least another one.
The above hydrophobic contacts are reached by two ways: First, if the number
of secondary structure segments is small, human observation is used to specify
a short list of possible contact schemes and run folding simulation to satisfy
them as pairwise contact constraints. Second, if the number of secondary
structure segments is large and the number of combinations is beyond human
enumeration, we then run folding simulation directly for multiple times trying
to pair each helices/strand with a non-predefined counterpart, thus generating
multiple models for later selection. The folding simulation is implemented by a
Monte Carlo simulated annealing process where the fitness score is compiled
from the contact constraints, Rg and the hydrophobic burial score [1].
The AS2TS server [1] was used to select the main template required for model
building and also to identify possible templates for loop building. Structure
based modeling of the high homology target TO190 was performed using the
crystal structure of human transthyretin, (pdb code 1f41), as the main
template.[2] Modeler [3] was used to generate the 3D coordinates based on the
1f41 as the template using an alignment generated by PSIBLAST[4]. Only one
region required loop building in the target for which the human GM2-activator
protein was used, (pdb code 1g13).[5] Side chains were built using the
SCWRL program[6] and PROCHECK was used to assess the quality of the
final model.[7].
1.
2.
3.
To better model the formation of beta sheet, we construct all topologically
different ways of forming beta sheet given predicted beta strands before
running folding simulation.
4.
The selection of top 5 predictions is based on the following criteria: 1.
compactness measured by Rg of all non-loop residues; 2. burial of hydrophobic
Residues [1]; 3. human observation of helix docking and strand pairing.
5.
1.
Huang E. S., Subbiah S. & Levitt M. (1995). Recognizing native folds by
the arrangement of hydrophobic and polar residues. J Mol Biol 252 (5),
709-720.
A-47
6.
7.
Zemla A.: http://protein.llnl.gov/as2ts
Hornberg A., Eneqvist T., Olofsson A., Lundgren E.,Sauer-Eriksson A.E.
(2000) A Comparative analysis of 23 structures of the amyloidogenic
protein transthyretin. J Mol Biol, 302, 649.
Sali A. and Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol, 234, 779-815.
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W..
and Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res, 25,
3389-3402.
Wright CS, Li S-C, Rastinejad F. (2000) Crystal Structure of Human GM2Activator protein with a novel beta-cup topology. J Mol Biol, 304, 411.
Bower M.J., Cohen F.E. and Dunbrack R.L., Jr. (1997) Prediction of
protein side-chain rotamers from a backbone-dependent rotamer library: a
new homology modeling tool. J Mol Biol, 267, 1268-1282.
Laskowski R. A., MacArthur M. W., Moss D. S. & Thornton J. M. (1993)
PROCHECK: a program to check the stereochemical quality of protein
structures. J Appl Cryst, 26, 283-291.
Dunbrack (P0329) - 46 predictions: 46 3D
Comparative Modeling of CASP5 Targets
R. L. Dunbrack, Jr., Y. Li, G. Wang, and A. A. Canutescu
Fox Chase Cancer Center, Philadelphia PA USA
RL_Dunbrack@fccc.edu
We used two different profile-profile alignment/search methods to identify
known structures homologous to the CASP5 targets. Profiles for all sequences
in the PDB and for the target sequences were derived with PSI-BLAST[1]
using a common procedure as follows. The non-redundant protein sequence
database was searched for 5 rounds with E-value cutoff ("-h") for inclusion in
the position-specific score matrix of 0.002. We checked for drift by noting
whether hits in one round of PSI-BLAST with E-value better than 0.002 were
present in subsequent rounds with E-value worse than 0.002. If drift occurred,
then we used the last round without drift. All hits identified with E-value less
than 10.0 were saved and placed in a new database. Multiple sequence
alignments were then created by searching this database with PSI-BLAST. The
multiple alignment was culled at 98% sequence identity to remove redundant
sequence information, and the sequences were weighted according to the
method of Henikoff and Henikoff [2] to produce the sequence profile.
We have developed two new scoring mechanisms for profile-profile
alignments. The first is a Dirichlet mixture substitution matrix (DIMSUM)
analogous to ordinary amino acid substitution matrices, but in which the scores
represent probabilities of substituting profile columns for one another. The
columns in the profiles are represented as components of a Dirichlet mixture
developed from multiple sequence alignments and structural characteristics
(secondary structure and surface exposure). The DIMSUM matrices were
developed from structure alignments of homologous proteins using the CE
program [3] in a manner similar to the BLOSUM matrices [4]. The profileprofile alignments are performed with a standard local-alignment dynamic
programming algorithm.
The second scoring method is a combination of an amino acid substitution
matrix and a matrix that represents the probability of predicted secondary
A-48
structure in one profile (the CASP target) aligning to known secondary
structure in the PDB entry. This matrix (SSAAC) was also developed from
structure alignments by determining the substitution rates of predicted
secondary structure in one protein in each structural alignment versus known
secondary structure in the other protein. We combined both DIMSUM and
SSAAC with a structure-derived amino acid substitution matrix (SDM) [5],
applied to the two profile columns, such that the score is the sum over all i,j of
piqjSij where pi and pj are the probabilities of amino acid types i and j in the two
columns and Sij is the element from the substitution matrix. We use a gap
penalty scheme that is dependent on the evolutionary distance of the two
profiles. The scoring schemes were optimized at 50% SDM/50%DIMSUM for
the DIMSUM method and 65% SDM/35% SSAAC for the SSAAC method.
We applied both the DIMSUM and SSAAC methods to identify homologues of
known structure for the CASP targets, and chose parent structures and
alignments that subjectively appeared to represent the best alignments, either by
length, biological function, or alignment quality. This alignment was then
optimized by visual examination of the known structure, with modest changes
made to the alignments near insertion-deletion regions. Using visual
examination, we chose segments of the parent protein to be replaced by new
sequence of different length (i.e., indels) using a loop modeling method we are
developing. In most cases, we chose to replace the entire loop between the
flanking regular secondary structures. If the loop was too long to model, in
some cases we chose shorter segments that seemed likely to be affected by the
insertion or deletion.
We have developed an energy function that determines ranking of
Ramachandran conformations for a protein loop segment based purely on its
sequence. The function is derived from database analysis using Bayesian
methods. The analysis provides probabilities of Ramachandran conformations
for each position in a loop as a function of the amino acid type of the residue at
that position as well as the amino acid type and conformation of the residue
previous to that position and the amino acid type and conformation of the
residue following it. This allows us to search the entire space of Ramachandran
conformations and sort their energies. We find empirically that 99% of true
conformations can be found in low energy conformations (<1.5 kcal/mol per
residue) from this function. Generally the correct loop conformation is found in
the top 100 conformations sorted by energy.
We built 100 random conformations for the top 100 Ramachandran
conformations by sampling from phi,psi distributions for each of the 20 amino
acids from loop residues in the PDB. We have developed a new loop closure
method using an algorithm from robotics called “cyclic coordinate descent”
(CCD). The loop closure problem occurs in nearly all loop building methods,
and the current algorithms such as “random tweak” [6] are slow and sometimes
do not converge. CCD works by by altering the structure of an initial loop
conformation built from the N-terminal anchor that does not close the loop at
the C-terminal anchor..It alters one dihedral angle at a time to optimize overlap
of the C-terminal residue of the constructed loop and the fixed C-terminal
anchor residue. Each move in the process is defined by the solution to an
equation in one variable (the solution is an inverse tangent). This in contrast to
tweak which alters all dihedrals and requires matrix inversion. CCD is very fast
and converges 99.95% of the time. It fails occasionally only for very short,
highly extended loops, where it may get stuck in a local minimum. CCD is also
very flexible in that any desired constraints can be applied to each dihedral
angle as a Metropolis criterion either to accept a proposed move or to reject it.
We added a Ramachandran probability map, so that moves to higher
probability phi,psi conformations for the residue type were accepted, and lower
probability conformations were accepted with probability of p(new)/p(old).
We used CCD to close the 10,000 trials per loop. We are in the process of
developing and optimizing an appropriate energy function for choosing the best
conformation from the set of trial conformations. For CASP, we performed two
calculations on the 10,000 trials: first, we built the side chains onto the loop
conformation using our program SCWRL [7] using the rest of the protein
model as a steric frame; second, we used CHARMM [8] to minimize the
energy of the loop conformation (including side chains built with SCWRL). We
used these two energies to choose a Ramachandran conformation that appeared
most frequently at low energy.One of these conformations was used as the
predicted loop and placed into the structure.
1.
2.
3.
4.
5.
6.
7.
8.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Henikoff S. and Henikoff J.G. (1994) Position-based sequence weights. J.
Mol. Biol. 243 (4), 574-578.
Shindyalov I.N. and Bourne P.E. (1998) Protein structure alignment by
incremental combinatorial extension (CE) of the optimzl path. Protein
Eng. 11 (9), 739-747.
Henikoff S. and Henikoff J.G. (1993) Performance evaluation of amino
acid substitution matrices. Proteins 17 (1), 49-61.
Prlic A., Dominques F.S., and Sippl M.J. Structure-derived substitution
matrices for alignment of distantly related sequences. Protein Eng. 13 (8)
545-550.
Shenkin P.S. et al. (1987) Predicting antibody hypervariable loop
conformation. I. Ensembles of random conformations for ringlike
structures. Biopolymers 26 (12), 2053-2085.
Bower M.J., Cohen F.E., and Dunbrack R.L. Jr. Prediction of protein sidechain rotamers from a backbone-dependent rotamer library: a new
homology modeling tool. J. Mol. Biol. 267 (5), 1268-1282.
MacKerell A.D. et al. All-atom empirical potential for molecular modeling
and dynamics studies of proteins. J. Phys. Chem. B102, 3586-3616.
Dunker-Keith (P0355) - 195 predictions: 195 DR
Predicting Intrinsic Protein Disorder
A. Keith Dunker1, Pedro Romero1, Xiahong Li1, Ethan C. Garner1,
Celeste J. Brown1, Predrag Radivojac2
1
We used SCWRL to predict the side chains on the entire structure, including
the constructed loops. Finally, we performed a brief CHARMM energy
minimization of the structure to relieve short contacts that result from the
rotamer assumption in SCWRL and to fix the phi,psi conformations of
nonPro Pro mutations.
A-49
- Washington State University, 2 - Temple University
dunker@mail.wsu.edu
Many protein segments and a few whole proteins are unstructured under their
putative physiological conditions. These “intrinsically disordered sequences”
have important functions. Recognizing the over-simplification of partitioning
into two states, order and disorder, and recognizing that all protein structure is
condition-dependent, we are nevertheless focusing on the simplified problem of
predicting intrinsic order and disorder from amino acid sequence. From our
point of view, a region existing as an ensemble of Ramachandran ,  angles,
whether static or dynamic, is intrinsically disordered.
PONDR is a collection of neural network Predictors of Natural Disordered
Regions. MODEL 1 is an integration of three predictors, one for each termini
and one for internal sequences [1-2]. For the internal sequences, a training set
of 15 disordered regions having a total of 1149 residues was compiled and
balanced by an equal number of ordered residues taken randomly from
NRL_3D. Of the 15 disordered regions in the training set, 8 were characterized
by X-ray diffraction (PDB IDs: 2tbv, 2ts1, 1aui, 1bgw, 1elo, 1af3, 1ati and
1lbh) and 7 by NMR (SW IDs: prio_mouse, h5_chick, flgm_salty, regn_lambd,
hsf_klula, and hmgi_human, and PIR accession: S50866).
From an initial pool of 31 attributes, a branch and bound search was used to
select 10 attributes that gave the best collective discrimination between order
and disorder in the training set using a Mahalanobis distance criterion. The 31
attributes in the initial pool included the 20 amino acid compositions, two
different hydropathy scales, flexibility index, alpha-moment, beta-moment, net
charge (K + R - D - E), aromatic composition (W + F + Y), coordination
number, codon number, alphabet size, and side chain volumes. The attributes
selected by this process were fraction of W, Y, F, D, E, K, R, aromatic
composition, coordination number, and net charge.
The back-propagation learning algorithm was used to train a feedforward
neural network having the ten selected attributes as inputs, a fully connected
hidden layer of ten neurons and a single output. To estimate errors, the training
was repeated on 5 disjoint subsets each having 80% of the data with 3 different
initializations, so neural network training was repeated 5 x 3 = 15 times. Once
the accuracy was established by this 5-cross validation procedure, a new neural
network was trained to the same accuracy using all the data.
To enable prediction from the first to the last residue in a protein, disorder was
partitioned according to position, with the development of different predictors
for N-terminal, and C-terminal regions. These predictors used 8 inputs.
A-50
The integration of the three predictors is carried out in 3 steps. First, predictions
are made by the three predictors over their respective domains, with
overlapping predictions for positions 11 - 14 by the N-terminal and internal
predictors, and, for a protein of length M, with overlapping predictions from M14 to M-11 by the C-terminal and internal predictors. Second, the values for
each of the 4 pairs of overlapping prediction are averaged. Third, the now
integrated prediction outputs are smoothed by averaging over sliding windows
of 9 amino acids, with the first and last 4 sequence positions being assigned the
unsmoothed prediction output values from the N- and C-terminal predictors,
respectively. This integrated predictor is used herein. Studies have shown that
neural network scores are only roughly equivalent to probabilities.
The cutoff for disorder is indicated by a score greater than 0.5. Short strings of
amino acids are erroneously predicted to be disordered more often than long
strings of predicted disorder. The predictor used here has a per residue error
rate of 22%, whereas the error rate for consecutive lengths of disorder 30 or
longer is only 3% and 40 residues or longer is only 0.4%. In the remarks
section of each prediction, consecutive lengths of disorder 17 residues or
shorter were determined to be not significant, except at the termini. Ten or
more consecutive residues at the termini were considered significant.
MODEL 2 is a linear predictor known as PONDR VL2 (S. Vucetic, C.J.
Brown, A.K. Dunker, Z. Obradovic, Supervised Partitioning of Disordered
Proteins, in progress). PONDR VL2 was trained on 145 disordered regions
whose lengths were 40 amino acids or longer. These regions were identified
either by missing electron density in X-ray crystal structures, or by author's
designation for NMR, circular dichroism or proteolysis. The ordered training
set was composed of 130 completely ordered proteins with no sequence
similarity. (Both datasets are available at http://disorder.chem.wsu.edu). The
attributes used for training this predictor were 19 amino acid compositions
(excluding F), flexibility and sequence complexity. These values were
calculated for a window size of 41. A collapsing window size was used for the
termini of each sequence. The algorithm used for prediction was ordinary least
squares regression with ordered windows designated 0 and disordered windows
designated 1. In this case, prediction values can range below 0 and above 1;
any value below 0.5 is considered ordered and greater than or equal to 0.5 is
considered disordered.
1.
MODEL 3 is based on an ensemble of feed-forward neural networks, combined
using bagging methodology and augmented using an order/disorder boundary
predictor. The dataset of disordered proteins had 154 disordered regions that
were 30 residues long or longer. The dataset of ordered proteins had 290
completely ordered proteins. (Both datasets are available at
http://disorder.chem.wsu.edu). All networks were trained using the LevenbergMarquardt algorithm with at most 200 iterations, and were optimized for
detecting intrinsic disorder of length 30 residues or longer.
The ensemble of predictors contains 50 feed-forward neural networks. Each
predictor in the ensemble uses 20 input attributes, 19 are amino acid
frequencies in a sliding window of 41 residues, and the last attribute is
sequence complexity. The structure of each network is 20 x 5 x 2. All hidden
neurons use a logistic activation function. All output neurons use a linear
activation function. Output nodes approximate the conditional posterior
probability of each class provided that the number of examples used for
training was large enough and the training algorithm found a global optimum.
Predictions were windowed using an output window of length 31.
The order/disorder boundary predictor [3] is a logistic regression predictor that
uses the frequencies of specific residues at each position on either side of an
order/disorder boundary. Training sequences came from the same data sets as
the ensemble predictors, however, there were only 123 order/disorder
boundaries in the disorder set. Inputs to the predictors are the frequencies of
amino acids at each position in a window of 24 residues. To reduce
dimensionality, only amino acids that showed significantly different
frequencies at an order/disorder boundary relative to the same positions in
completely disordered and completely ordered segments were used as inputs.
Dimensionality was further reduced using principle components analysis.
After bagging the ensemble of neural network predictors, the order/disorder
boundary predictor was used, and the prediction were recalculated. The final
predictor is therefore a boundary-augmented, bagged ensemble of feed-forward
neural networks.
2.
3.
Romero P. et al. (2001) Sequence complexity of disordered protein.
Proteins: Struc. Func. Genetics 42, 38-48.
Li X., et al. (1999) Predicting protein disorder for N-, C-, and internal
regions. Genome Informatics 10, 30-40.
Radivojac P., et al. (2003) Prediction of boundaries between intrinsically
ordered and disordered protein regions. Pacific Symp Biocomputing (in
press).
ESyPred3D (P0034) - 36 predictions: 36 3D
ESyPred3D: Prediction of Proteins 3D Structures
C. Lambert and E. Depiereux
Unité de Recherche en Biologie Moléculaire, Facultés Universitaires NotreDame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium
christophe.lambert@fundp.ac.be
The aim of our work is to propose a reliable automatic method for homology
modeling (ESyPred3D[1]), especially when the protein of interest shares a low
percentage of identities (20-30%) with the chosen template.
Our strategy consists in the usual steps for homology modeling: search for the
template in databanks, target-template alignment and modeling. Actually, our
method does not provide any assessment of the model.
For the search of a template in databank, we used four iterations of PSIBLAST[2] on the non redundant protein database (nr) of the NCBI. All
sequences having a expected value lower than 0.001 are included in the profile
building. The template is chosen as the sequence of known structure (PDB) that
has the lower expected value. The search in the nr databank also gives us a
large number of similar sequences.
As far as possible, two sets of sequences are built. The first one contains the 50
best hits below the expected value cutoff of 0.001. The second one contains a
A-51
subset of the sequences, after dropping too redundant ones. This method aims
at creating different conditions to run multiple alignment programs and
extracting different consensus in order to raise the confidence of the sequencestructure alignment.
The two sets are then submitted to five alignment programs: ClustalW[3],
Dialign2[4], Match-Box[5], Multalin[6] and T-Coffee[7]. A pairwise alignment
between the target and template sequences is extracted from each multiple
alignment. All the pairwise alignments including the one provided by PSIBLAST are used to generate a database of aligned positions (boxes). A neural
network is used to assign a score to each box. Most confident boxes are taken
as anchor points for the building of the final sequence-structure alignment. A
three-dimensional model is built using MODELLER [8] on this final alignment.
ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred
1.
2.
3.
4.
5.
6.
7.
8.
Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures.
Bioinformatics. 18 (9), 1250-1256
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice, Nucleic Acids
Res. 22, 4673-4680
Morgenstern B. et al. (1998) DIALIGN: Finding local similarities by
multiple sequence alignment. Bioinformatics 14, 290-294
Depiereux E. et al. (1997) Match-Box server: a multiple sequence
alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13,
249-256
Corpet F. (1988) Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res. 16, 10881-10890
Notredame C. et al. (2000) T-Coffee: A novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol. 302(1), 205-217
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234(3), 779-815.
A-52
evolutionaries (P0180) - 99 predictions: 99 3D
A Phylogenomic Approach to Fold Prediction
Kimmen Sjölander1, Emma Hill1, David Konerding1,
Steven Brenner1, Andrej Sali2 and Andras Fiser2
1
– UC Berkeley, 2 – Rockefeller University
kimmen@uclink.berkeley.edu
The Berkeley Evolutionaries approach focused on the use of phylogenetic
inference to guide our selection of structural templates for targets and to
produce an alignment of a target and template pair. These alignments were
used as the basis for model construction (using MODELLER), from which the
best model was chosen using a statistical potential function (PROSAII).
1. HMM library construction. We constructed an HMM library using the
SCOP PDB40 sequence set (Astral version 1.57, with 4013 domains) as seeds,
the UCSC SAM-T99 software to cluster and align homologous sequences in the
NR database to each seed, and the UCSC fw0.5 software to construct a general
HMM for each cluster. We then ran BETE (Bayesian Evolutionary Tree
Estimation) to construct a phylogenetic tree and identify subfamilies (Section
7), and constructed subfamily HMMs (Section 7). This HMM library was
expanded to include HMMs for new sequences submitted to PDB since the
Astral 1.57 PDB40 database had been completed.
2. Creation of a set of sequence homologs to PDB structures. We generated
a consensus sequence for each subfamily in the HMM library, creating a
representative set of ‘sequences’ for each cluster. We collected all consensus
sequences into one large file (“sfreps.seqs”), containing 552,000 sequences,
each of which is mapped to a specific PDB40 domain.
3. Target identification. Each target went through a multi-stage analysis, using
first those methods which are computationally efficient but less sensitive, and
continuing (as necessary) to the more computationally expensive and sensitive
methods. In cases where a target appeared to be composed of multiple domains,
we constructed HMMs for each domain separately, and each domain was
treated as a separate target with all stages of the target identification process
performed independently. All putative target-template matches were assessed
using various criteria (see Section 4). Stage 1. Target is scored against the
general HMMs in the HMM library. Stage 2. Target is scored against
subfamily HMMs for high-scoring clusters. Stage 3. The FlowerPower
algorithm (Section 5) is used to identify homologs to the target from the NR
database, and to construct a multiple sequence alignment (MSA). The target
homologs are scored against the general HMMs in the HMM library, followed
by scoring of subfamily HMMs for high-scoring clusters. Stage 4. We construct
a general and subfamily HMMs for the target and homologs (using BETE); the
HMMs are then used to score the PDB database. Stage 5. The target general
HMM is used to score the ‘sfreps.seqs’ file (Section 2), followed by scoring all
the top hits against the target subfamily HMMs. Stage 6. We score the NR
database with the general HMM constructed for the target to find additional
homologs to the target. Any accepted sequences are included in the target
HMM training set, and stages 3-5 are repeated
4. Assessing the likelihood of a target-template match. We used a three-stage
alignment analysis: (1) analysis of the MSA for the target and homologs to
identify key residues, conserved motifs, regions of variability, and so on; (2) an
identical analysis of any template MSAs (including literature search for
experimentally determined key residues); and (3) construction and analysis of a
joint alignment of the members of the target and template families. In cases
where the target-template sequence similarity was very low, we used a variety
of alignment methods, including SATCHMO (Section 8), and joint HMM
construction to generate a joint alignment. Joint HMM construction employs
subfamily HMMs to detect intermediate sequences from the target and template
homolog sets. These sequences are mutually aligned, using one of the
subfamily HMMs, followed by FlowerPower expansion of the joint HMM
training set, until the target and template structure can be aligned accurately.
Structural alignments (from DALI) were also used as inputs to FlowerPower,
for inclusion of sequence homologs, and construction of general and subfamily
HMMs. These structurally informed HMMs were then used to align the target
and generate a pairwise alignment between the target and template. All
alignments produced for the target and template and their homologs were then
inspected for agreement at predicted or known key positions in either the target
or template structure.
5. FlowerPower clustering and multiple sequence alignment. FlowerPower
integrates clustering and alignment into a single process, and thus has obvious
similarities to both PSI-BLAST and SAM-T99. There are two fundamental
differences which distinguish our approach. First, instead of using a single
A-53
HMM or profile to expand the existing cluster, we use a set of HMMs: a
general HMM for the family as a whole, and a subfamily HMM for each
subfamily found by BETE. Each HMM competes for all sequences, and is used
to align those sequences most closely related to it. Because subfamily HMMs
have specificity for individual subfamilies included in previous iterations, they
prevent profile drift (i.e., sequences identified as homologous in early iterations
will continue to be identified as homologs at all subsequent iterations) and
result in improved alignment, particularly in regions of overall diversity among
family members. Second, FlowerPower uses alignment quality control after
each cluster expansion step, to prevent the potential intrusion of non-homologs
and/or poorly aligned homologs.
6. Bayesian Evolutionary Tree Estimation (BETE). BETE[1] employs
agglomerative clustering to construct a tree, given an input multiple sequence
alignment. Initially all sequences are in separate classes, and form the leaves of
the tree. For each sequence in the alignment we construct a profile to represent
the amino acid probabilities at each position, using Dirichlet mixture priors.
We measure the distance between all pairs of classes, using a symmetrized
form of relative entropy between the profiles, and find the two closest classes.
We then estimate a new profile to represent all the sequences in the merged
class, compute the distances between the new class and the other classes, and
join the closest pair. Dirichlet mixture densities are also used to compute the
encoding cost of the set of alignments defined at each stage of the
agglomeration; the point during the agglomeration producing the minimum
encoding cost defines a cut of the tree into subtrees, and a decomposition of the
sequences into subfamilies. BETE was used at Celera Genomics (in
combination with subfamily HMM construction) to produce the functional
classification of the proteins encoded in the human genome[2].
7. Subfamily HMM construction and performance. For each subfamily, and
at each position, we use Dirichlet mixture priors to weight the contribution of
amino acids aligned by other subfamilies.. This enables subfamily HMMs to
retain specificity in homolog detection without sacrificing sensitivity, even
when individual subfamilies may contain very few members. Experimental
validation on the PDB40 datasets show that subfamily HMMs provide very
high specificity of classification of novel sequences, and improve sensitivity of
homolog detection, particularly in the case of fragment detection[3].
8. Simultaneous Alignment and Tree Construction using Hidden Markov
mOdels (SATCHMO). SATCHMO[4] simultaneously estimates a
phylogenetic tree and generates a set of multiple sequence alignments, one for
each node in the tree. Because SATCHMO requires sequences in each subtree
to retain their mutual alignment when two subtrees are joined, sequences are
not allowed to get out of register with others that are closely related. The
multiple sequence alignment at each node models the consensus structure held
by the sequences descending from that node; at the root, the alignment predicts
the ‘conserved core structure’ shared by all members of the family, with
alignments increasing in length and in specificity as a path is traced from the
root towards a leaf. In experiments on the BAliBASE benchmark alignment
database, SATCHMO is shown to perform comparably to ClustalW and the UCSC
SAM alignment ‘tune-up’ software.
9. Building a full-atom model. Once alignments of the target sequence with
several different candidate template structures or alternative alignments with a
given template were obtained, MODELLER was used to construct all-atom
models of the target[5]. MODELLER implements comparative modeling by
satisfaction of spatial restraints. For each template selection and alignment, 20
models were built and subsequently evaluated by statistical potential functions
in PROSAII[6]. The template selections and alignments were iteratively refined
by hand to increase the PROSAII Z-scores of the corresponding models. In
addition, well-defined insertions were modelled with the 'ab initio' loop
modeling module of MODELLER[7]. For the difficult modeling cases
involving remotely related templates, some predicted helices and strands were
also explicitly restrained to maintain the predicted fold. In a few cases, the
target sequence was modeled in complex with corresponding cofactors or
inhibitors. After the final alignment and template selection were found by
optimizing the PROSAII Z-score, the best among the final 20 models was
selected based on the value of the MODELLER objective function.
1.
2.
3.
Sjolander K. (1998) Phylogenetic inference in protein superfamilies:
analysis of SH2 domains. Proc Int Conf Intell Syst Mol Biol 6: p. 165-74.
Venter J.C. et al. (2001) The sequence of the human genome. Science
291(5507): p. 1304-51.
Christopher W. and Sjolander K. The sum of the parts is greater than the
whole: Protein classification using subfamily HMMs. Submitted.
A-54
4.
5.
6.
7.
Edgar R. and Sjolander K.. Simultaneous Sequence Alignment and Tree
Construction using Hidden Markov Models, To appear in Pacific
Symposium on Biocomputing. 2003. Kauai, HI.
Sali A. and Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J Mol Bio.l 234(3): p. 779-815.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins. Proteins 17(4): p. 355-62.
Fiser A., Do R.K., and Sali A. (2000) Modeling of loops in protein
structures. Protein Sci 9(9): p. 1753-1773.
FAMS (P0168) - 324 predictions: 324 3D
Homology Modeling Server, FAMS
M. Iwadate, R. Yamatsu, G. Terashi, R. Arai and H. Umeyama
School of Pharmaceutical Sciences, Kitasato University
iwadatem@pharm.kitasato-u.ac.jp
We developed a homology-modeling server, FAMS [1]. In this server
homology modeling software FAMS (Full Automatic Modeling System) was
used [2]. Some kinds of BLAST type software, BLAST, PSI-BLAST, RPSBLAST, IMPALA [3] and some original alignment software and re-alignment
software were used.
After calculating many model structures based upon various alignments,
considering E-values of alignments, hydrophobic interactions of model
structures and secondary structures, favorable scoring function was defined.
And highest score 5 models were submitted.
In the case of the target protein having the high homological sequence with a
known structure in the PDB database, some kinds of BLAST type software
produce many alignments. About each alignment, FAMS modeling was
executed. Therefore many models were produced. Considering limited time
within 48 hours of CAFASP3 deadline, priority order of alignment was decided
by a sorted list for the E-value in the case of each target.
In the case of no or low homology with the known reference structure, our
original alignment algorithm effectively worked. Then the alignment algorithm
always could give many model structures for all the 67 CAFASP3 targets. Total
number of submitted structure was 335 (67 * 5) from this FAMS server.
Actually used computing time of each target strongly depended on the number
of query sequences of same day deadline. Sometimes complete calculation
requires more than 48 hours. To choice the best structure from many structures
within the limited time, we constructed the PC cluster computing system with
150 CPU Linux machines on which FAMS software programs run. Thus such
large number of model structures could be calculated from large number of
alignments.
1.
2.
3.
Iwadate M. , Ebisawa K. and Hideaki U. (2001) Homology modeling of
CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148
Ogata K. and Hideaki U. (2000) An automatic homology modeling
method consisting of database searches and simulated annealing, J. Mol.
Graph. Model., 18 (3), 258-272, 305-306
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Progressive changes from FAMS server are mainly 2 points. 1) For homology
search database, SCOP domain database was used instead of PDB database. 2)
E-value threshold was highly set up, and, 10 in all kinds of BLAST type
alignment software. Favorable scoring function calculated from modeled
structures and E-value was defined to determine the priority order. And 5
models having the highest scores were submitted.
In the case of high homology target protein with known reference structure,
similar alignments and many temperate protein coordinates were selected from
server FAMS, and some kinds of BLAST type software produce many
alignments at high E-value threshold. Therefore many models were produced.
Considering time limitation of 48 hours CAFASP3 deadline, priority order of
alignment was decided by the E-value sorting for the each target. Also, in the
case of no or low homology with known reference structure, many model
structures always were obtained due to the high E-value threshold in all 67
CAFASP3 targets. Total number of submitted structures was 334 (for FAMSD
T0129, 4 structures was returned) from this server.
In FAMSD server, new techniques, which are belonging to model fitting
algorithm and alignment combining algorithm, were introduced. In some
targets these algorithm worked effectively, considering the score mentioned
above. Alignment combining algorithm was also used in this FAMS server.
1.
FAMSD (P0169) - 322 predictions: 322 3D
2.
Homology Modeling Server, FAMSD
3.
M. Iwadate, R. Yamatsu, G. Terashi, R. Arai and H. Umeyama
School of Pharmaceutical Sciences, Kitasato University
iwadatem@pharm.kitasato-u.ac.jp
We developed a homology-modeling server FAMSD, the advanced version of
FAMS [1]. In this server homology modeling software FAMS (Full Automatic
Modeling System) was used [2]. Some kinds of BLAST type software [3] and
some original re-alignment software were used.
A-55
Iwadate M. , Ebisawa K. and Hideaki U. (2001) Homology modeling of
CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148
Ogata K. and Hideaki U. (2000) An automatic homology modeling
method consisting of database searches and simulated annealing, J. Mol.
Graph. Model., 18 (3), 258-272, 305-306
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
FFAS03 (P0309) - 314 predictions: 314 3D
4.
FFAS03: Automated Profile-Profile Distant Homology
Recognition Server Applied to Fold Recognition.
5.
L.Jaroszewski1 and A.Godzik2
6.
1
JCSG Bioinformatics, UCSD, 2 – The Burnham Institute
lukasz@sdsc.edu
7.
FFAS03 (Fold and Function Assignment System 03) automated fold
recognition server is based on the dynamic programming alignment of
sequence profiles and builds on the FFAS algorithm described previously [1-3].
FFAS03 sequence profiles are calculated from multiple sequence alignments
obtained with PSI-BLAST [4] with a special weighting system. Five iterations
of PSI-BLAST were applied to collect sequences from non-redundant NCBI
database after clustering it at 85% sequence identity with CD-HIT program [5]
and masking low-complexity regions with SEG (6) program. FFAS03 uses
sequences identified in PSI-BLAST up to the E-value below 0.01 and the
alignment from the PSI-BLAST ouput.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Li W, Jaroszewski Ł, Godzik A. (2002) Clustering of highly homologous
sequences to reduce the size of large protein databases. Bioinformatics 17:
282-283
Wootton J.C. (1994) Non-globular domains in protein sequences:
automated segmentation using complexity measures. Comput Chem.
18:269-85.
Murzin A.G., Brenner S.E., Hubbard T., Chothia C. (1995). SCOP: a
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol 247:536-540.
Flohil (P0545) - 3 predictions: 3 3D
Completion and Refinement of 3-D Homology Models With
Restricted Molecular Dynamics
J.A. Flohil, S.W. de Leeuw
A two dimensional weighting system is based on the matrix of all to all
similarity within the homologous family. In addition, FFAS03 performs a
normalization of the matrix containing the comparison scores between all
positions of both aligned profiles. Final (normalized) score is obtained from
raw dynamic programming score by comparison with empirically obtained
distribution of raw scores on the representative domain library of different folds
based on SCOP(7) database.
1.
2.
3.
Rychlewski L., Jaroszewski Ł., Li W. & Godzik A. (2000) Comparison of
sequence profiles. Strategies for structural predictions using sequence
information. Protein Science 9, 232-241
Jaroszewski Ł., Rychlewski L. & Godzik A. (2000).Improving the quality
of twilight-zone alignments. Protein Science 9, 1487-1496
Jaroszewski Ł., Li W., & Godzik A. (2002) In the search for more accurate
alignments in the twilight zone. Protein Science 11:1702-13
A-56
Delft University of Technology, Department of Applied Physics
Lorentzweg 1, 2628 CJ Delft, The Netherlands
j.a.flohil@tnw.tudelft.nl
Modeling and refinement protocol has been applied according the method
described in [1].
http://www3.interscience.wiley.com/cgibin/fulltext?ID=95016016&PLACEBO=IE.pdf
1.
Flohil J.A. et al. (2002) Completion and Refinement of 3-D Homology
Models with Restricted Molecular Dynamics: Application to Targets
47,58, and 111 in the CASP Modeling Competition and Posterior Analysis.
Proteins 48, 593-604.
Flohil (P0545) - 3 predictions: 3 3D
1. Lindahl E., et al. (2001) GROMACS 3.0: A Package for Molecular
Simulation and Trajectory Analysis. Journal Molecular Modeling, 7, 306317.
Molecular Dynamics Simulation of Hydrophobic Collapsing
From a Model in Extended State
J.A. Flohil, S.W. de Leeuw
FISCHER (P0427) - 161 predictions: 161 3D
Delft University of Technology, Department of Applied Physics
Lorentzweg 1, 2628 CJ Delft, The Netherlands
j.a.flohil@tnw.tudelft.nl
Fold Recognition Using 3D-Shotgun
An initial simulation model was created by mutation of a fully extended
polyglycine into the target sequence. GROMACS [1] with GROMOS96 force
field was used for all simulations, and among applied parameters were periodic
boundary conditions, temperature coupling and long range electrostatics.
The system was initially built up by a shell of explicit water of 0.6 nm added
around the stretched model, and the system was placed in a 12x12x150 nm box.
Before each of the following simulations an energy minimization was
performed. After adding water or renewing the water, a 10 ps run with position
restraints on the protein was done to equillibrate the water.
A main collapsing run of 1 ns was performed. This run started with all amino
acid positions harmonically restrained, releasing each 10 ps a consecutive
residue from its restraints until the complete chain was able to remove free. The
first residue restraint was released at the N-terminal side, the last restraint at the
C-terminal side. If the residue releasing procedure was completed before the
end of the simulation, then the remaining time was used to continue without
restraints. From the 1 ns trajectory, each 1 ps a snapshot was recorded, and for
each snapshot the radius of gyration about the x, y and z axes of the protein
atoms were computed. The frame with the model in the most compact state was
selected for further refinement. Drifting water molecules having no contact
with water shell or protein were removed from the box, and the dimensions of
the box were maximally reduced, and empty holes in the box were filled with
water molecules. The simulation was restarted to run for another 3 ns, and
based on the evolution of forming secondary and tertiary segments, as well as
backbone-backbone contacts and free energy of the water, the best model
conformation was selected for submission.
A-57
D. Fischer1 and N. Siew1,2
1
Bioinformatics/Computer Science, 2 Dept. of Chemistry
Ben Gurion University, Beer-Sheva, Israel
dfischer@cs.bgu.ac.il
Fully automated structure-prediction methods can currently produce reliable
models for only a fraction of the target sequences. However, using a number of
semi-automated procedures, human-expert predictors are often able to produce
more and better predictions than automated methods. We have recently
developed a novel, fully automatic, fold-recognition meta-predictor, named 3DSHOTGUN[1] that incorporates some of the strategies human predictors have
successfully applied. This new method is reminiscent of the so-called
cooperative algorithms of Computer Vision. The input to 3D-SHOTGUN is the
top models predicted by a number of independent fold-recognition methods.
The meta-predictor consists of four steps:
Assembly of hybrid models,
Confidence assignment,
Selection, and
Model Refinement.
The three first steps are fully automated within the bioinbgu fold-recognition
server. MaxSub [2] and LiveBench tests have demonstrated that 3DSHOTGUN is more sensitive than any of the individual methods, and the
predicted hybrid models are, in average, more similar to their corresponding
native structures than those produced by the individual servers. The models
produced by bioinbgu were submitted to CAFASP. The fourth step, which
includes model refinement using Modeller, is not yet part of the server,
although it is also fully automatic. Fischer’s group predictions to CASP were
the results of the application of this fourth step to the server’s 3D-SHOTGUN
predictions.
1.
2.
Fischer D. (2002) 3D-SHOTGUN: A Novel, Cooperative, FoldRecognition Meta-Predictor. Proteins, In press.
Siew N, Elofsson A, Rychlewski L, and Fischer D. (2000). MaxSub: an
Automated Measure for the Assessment of Protein Structure Prediction
Quality. Bioinformatics 16 (9), 776-85.
Floudas-C.A. (P0011) - 15 predictions: 15 3D
ASTRO-FOLD: Ab-Initio Tertiary Structure Prediction of
Proteins
Christodoulos A. Floudas and John L. Klepeis
Department of Chemical Engineering, Princeton University, Princeton, NJ
floudas@titan.princeton.edu
ASTRO-FOLD is an integrated methodology for the ab-initio structure
prediction of proteins based on an overall deterministic global optimization
framework coupled with mixed-integer optimization. The novel four-stage
approach combines the classical and new views of protein folding, while using
free energy calculations and integer linear optimization to predict the location
of helical segments and the topology of beta-sheet structures and disulfide
bridges, respectively. Detailed atomistic-level energy modeling and the
deterministic global optimization method, aBB, coupled with torsion angle
dynamics, form the basis for the final tertiary structure prediction [1-3].
The first stage of the approach involves the identification of helical segments.
This is accomplished through detailed atomistic-level energy modeling of
overlapping subsequences of the overall protein sequence using the selected
force-field (e.g., ECEPP/3 [4]). The amino-acid sequence is first decomposed
into subsequences of overlapping oligopeptides (e.g., pentapeptides,
A-58
heptapeptides, nonapeptides). For instance, using heptapeptides, the folowing
subsequences are generated: 1-7, 2-8, 3-9, . . . etc. For each subsequence,
global optimization is used to generate an ensemble of low energy
conformations along with the global minimum energy conformation [5].
Rigorous free energies that include entropic, cavity formation, polarization and
ionization contributions, and involve solution of the Poisson-Boltzmann
equation, are calculated for a subset of conformations for each oligopeptide
system. Finally, these free energy values are combined to determine helical
propensities for each residue by calculating equilibrium occupational
probabilities for each possible helical cluster [6].
The second stage focuses on the prediction of beta-sheet and disulfide bridge
topology through the analysis of amino acid properties that are based on residue
hydrophobicities. The approach, which borrows key concepts from a
mathematical framework developed in the area of process synthesis of chemical
systems [7], is based on the idea that beta-structure formation relies on a
hydrophobic driving force. To model this force, it is necessary to predict
contacts between hydrophobic residues. The first important component of the
approach is the postulation of a beta-strand superstructure that encompasses all
alternative beta-strand arrangements. A novel mathematical model is then
formulated to provide the formation of ordered structural features, such as betasheets and disulfide-bridge connectivity. The solution of this integer linear
programming problem, with the objective being the maximization of the
hydrophobic contact energy, provides a rank ordered list of preferred
hydrophobic residue contacts, beta strand topologies and disulfide bridge
connectivities [8].
The third stage involves the derivation of restraints based on helical and betasheet predictions in the form of dihedral angle and atomic distance restraints to
enforce the predicted secondary and tertiary arrangements. Additional
restraints are determined for the intervening loop residues connecting helical
and strand regions through novel application of free energy simulation [9-11].
More specifically, the identified loops are extended on each side to incorporate
three additional amino acids of both secondary structure elements that the loop
connects. Each set of three flanking amino acids are imposed to be in their
respective secondary structure state (e.g., helix, beta-strand). Then, a series of
free energy calculations are conducted through the principles of overlapping
oligopeptides, similar to the free energy calculations used in the helix
prediction stage. The objective of these calculations is to produce improved
bounds on the dihedral angle and backbone distances within the loop residues.
The fourth and final stage of the approach involves the prediction of the tertiary
structure of the full protein
sequence. The problem formulation, which relies on dihedral angle and atomic
distance restraints introduced from the previous stages, as well as on detailed
atomistic energy modeling, represents a nonconvex constrained global
optimization problem. This problem is solved through the combination of a
deterministically based global optimization approach, the aBB, and torsion
angle dynamics [1,11]. A distributed computing framework of each stage of
the proposed approach has been developed, and our predictions in the CASP5
competition employ this parallel implementation.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Klepeis J.L. and Floudas C.A. (2002) Ab-Initio Tertiary Structure
Prediction of Proteins, Journal of Global Optimization, in press.
Floudas C.A. (2000) Deterministic Global Optimization: Theory,
Algorithms and Applications, Kluwer Academic Publishers.
Klepeis J.L. et al. (2002) Deterministic global optimization and ab-initio
approaches for the structure prediction of polypeptides, dynamics of
protein folding, and protein-protein interactions, Advances in Chemical
Physics 120, 265-457.
Nemethy et al. (1992) Energy parameters in polypeptides. 10. Improved
geometrical parameters and nonbonded interactions for use in the ECEPP/3
algorithm with applicatoins to proline-containing peptides, Journal of
Physical Chemistry 96, 6472-6484.
Klepeis J.L. and Floudas C.A. (1999) Free energy calculations for peptides
using deterministic global optimization, Journal of Chemical Physics, 110,
7491-7512.
Klepeis J.L. and Floudas C.A. (2002) Ab-Initio Prediction of Helical
Segments in Polypeptides, Journal of Computational Chemistry, 23, 1-22.
Floudas C.A. (1995) Nonlinear and Mixed-Integer Optimization:
Fundamentals and Applications, Oxford University Press.
Klepeis J.L. and Floudas C.A. (2002) Prediction of Beta-Sheet Topology
and Disulfide Bridges in Polypeptides, Journal of Computational
Chemistry, in press.
Klepeis J.L. and Floudas C.A. (2002) Analysis and Prediction of Loop
Segments in Protein Structures, in preparation.
A-59
10. Klepeis J.L., Pieja M.J. and Floudas C.A. (2002) A new class of hybrid
global optimization algorithms for peptide structure prediction: integrated
hybrids, Computer Physics Communications, in press.
11. Klepeis J.L., Pieja M.J. and Floudas C.A. (2002) Hybrid global
optimization algorithms for protein structure prediction: alternating
hybrids, Biophysical Journal, in press.
FM-AF (P0571) - 17 predictions: 17 3D
Folding Machine: Coarse-Grained Folding Dynamics Using
an Implicit-Solvent Potential
A. Colubri1, A. Fernández2,3
1 - Department of Chemistry, University of Chicago, IL 60637,
2 - Institute for Biophysical Dynamics, University of Chicago, IL 60637,
3- Instituto de Matemática, Universidad Nacional del Sur, Consejo Nacional de
Investigaciones Científicas y Técnicas, Bahía Blanca 8000, Argentina
acolubri@uchicago.edu
The CASP5 predictions submitted by this group were generated with a coarsegrained ab-initio folding algorithm implemented as a computer program called
Folding Machine (FM). This algorithm is based in three key simplifications:
1. The folding dynamics is computed in the space of backbone torsional angles
by assuming constant bond lengths and plane angles. Furthermore, the
algorithm takes advantage of the local geometrical constraints imposed on
backbone motion: the torsional dynamics is subordinated by a discrete process
of hopping between the Ramachandran basins accessible for each residue.
2. The conformations of the side-chains are space-averaged by using softspheres to represent the actual side-chain geometry. This is based in the
assumption that the timescales of side-chain dihedral motions are much faster
than those of the main-chain torsional angles.
3. An empirical intramolecular potential with an implicit treatment of the
solvent is used to quantify the stability of the conformations generated along
the folding pathway. This potential is constructed so that distinctive local
environments shaped by the chain during folding are treated through manybody correlations defining a rescaling of the two body energy contributions.
Earlier versions of the algorithm are described in [1-2]. The FM has been used
previously to study pathway diversity [3-4] and cooperativity [5-6] in protein
folding. Detailed analysis of points 1, 2 and 3 are given in all these references.
The dynamics can be briefly described as follows:
a. At time t, the extent of structural involvement of each residue is quantified
by means of our empirical potential function. In this way, we assign a certain
probability of basin hopping to each residue.
b. According to these hopping probabilities, the residues that change basin in
the interval [t, t+dt] (where dt = 10-8s) are determined, and an initial selection
of phi-psi coordinates is made for those residues.
c. Using the resulting selection of basins as a constraint, a simulated annealing
optimization of the protein conformation is performed in order to improve the
nascent secondary and tertiary structures. This intra-basin optimization is not
performed all the time, but with a given frequency which is a parameter of our
algorithm.
Now the main aspects of steps a, b and c will be presented (for more details see
refs. [3-4]). Lets denote with P(k, t) the hopping probability of residue k at
instant t, this is, the probability of residue k to unergo a basin change at time t.
This magnitude is defined by P(k, t) = exp[-dG(k, t)/RT], where dG(k, t) is the
change in free energy associated with the basin hopping of k, assuming that all
the interactions that depend on residue k are destroyed during this basin change.
The empirical potential function used to evaluate dG(k, t) has the following
terms: Uexcluded-volume + Usolvophobic + Ucoulombic + Udipolar + Uhydrogen-bond. In a zerothorder approximation, generically denoted U0, each of these terms can be
expressed as a sum of pairwise contributions U0(i, j). Under this approximation,
the potential function does not reflect the effect of local solvent environments
on the stability of the dielectric-dependent interactions (solvophobic, coulombic
and hydrogen-bond). In order to incorporate this effect, we rescale the zerothorder contribution of each pair, U0(i, j), by introducing renormalization factors
fi and fj which depend on the level of desolvation of residues i and j. Thus the
A-60
rescaled pairwise energy which implicitly includes the solvent effect is U(i, j) =
fi fj U0(i, j), where fi = fi(Li) and Li is the extent of desolvation of residue i.
The accessible Ramachandran basins for each aminoacid and the intra-basin
distributions of phi-psi points were obtained by analyzing the torsional
coordinates of the chains listed in the culled PDB database [7]. Using this data,
the parameters for the discrete basin dynamics were calculated. For example,
the probability of adopting a basin is given by its relative lacunar area.
In order to include local correlations between the basin dynamics of neighbor
residues, the I-sites library of local sequence-structure patterns [8] is used to
complement the information encoded in the distributions of Ramachandran
basins. This is done in the following way: when a certain region of the chain
matches in sequence one of the I-sites motifs and its backbone geometry is
close enough to the basin assignment of this motif, then the torsional
coordinates of the motif are applied to all the residues of that region of the
chain. The FM also includes the possibility of using sequence-structure motifs
obtained from the PHD server [9].
The structure optimization step requires a secondary structure assignment done
in parallel with the simulation. The secondary structure assignment algorithm
built in the FM is roughly similar to the DSSP algorithm [10], but is more
error-tolerant in order to recognize imperfectly formed structures. It is also able
to detect the structure topology, this is, the pattern of connections between the
secondary elements, information which is needed in the subsequent structure
optimization.
The structure optimization algorithm is designed to minimize Q(S 1, S2) with
restrictions of the form Rn(S1, n, S2, n) < qn*, n = 1, 2..., where S1, S2, S1, n, S2, n
stand for different secondary structures detected in the chain, and Q(S1, S2),
Rn(S1, n, S2, n) denote different magnitudes defined between structures, for
example, the total interaction energy, the hydrogen-bond energy, the alignment
variable (the angle between the end-to-end vectors of both structures), etc.
Using this general optimization algorithm, different optimization schemes can
be constructed in the FM, for example: to minimize the hydrogen-bond energy
between two strands preserving the remaining interactions, or to improve the
alignment pattern between some set of nascent secondary structures without
interfering with the tertiary structures already formed.
1.
Fernández A., Colubri A., Appignanesi G. (2001) Finding the collapseinducing nucleus in a folding protein. J. Chem. Phys. 114, 8678-8684
2. Fernández A., Appignanesi G., Colubri A. (2001) Semiempirical
prediction of protein folds. Phys. Rev. E 64, 21901-21914
3. Colubri A., Fernández A. (2002) Pathway diversity and concertedness in
protein folding: an ab-initio approach. J. Biomol. Struct. & Dyn. 19, 739764
4. Fernández A., Colubri A. (2002) Pathway Heterogeneity in Protein
Folding. Proteins: Func., Struct. & Genetics 48, 293-310
5. Fernández A., Colubri A., Berry R.S. (2001) Topologies to geometries in
protein folding: hierarchical and non-hierarchical scenarios. J. Chem. Phys.
114, 5871-5887
6. Fernández A., Colubri A., Berry R.S. (2002) Three bodies correlations in
protein folding: the origin of cooperativity. Physica A 307, 235-259
7. Culled PDB website:
http://www.fccc.edu/research/labs/dunbrack/culledpdb.html
8. Bystroff C., Baker D. (1998) Prediction of Local Structure in Proteins
Using a Library of Sequence-Structure Motifs. J. Mol. Biol. 281, 565-577
9. Rost B., Sander C. (1993) Prediction of protein secondary structure at
better than 70% accuracy. J. Mol. Biol. 232, 584-599.
10. Kabsch W., Sander C. (1983) Dictionary of protein secondary structure:
Pattern recognition of hydrogen-bonded and geometrical features.
Biopolymers 22, 2577–2637
FORTE1 (P0290) - 276 predictions: 276 3D
Fold Recognition Using the FORTE1 Server
K. Tomii and Y. Akiyama
Computational Biology Research Center
National Institute of Advanced Industrial Science and Technology
k-tomii@aist.go.jp
We attempted to submit the prediction results for all CASP5 targets in order to
evaluate a novel fold recognition technique we have devised (K. Tomii,
manuscript in preparation). To this end we have constructed a fold recognition
system FORTE1 based on our new technique.
Throughout the distant comparative modeling and fold recognition lessons we
have recognized that the most important factor influencing model quality still
remains with alignment accuracy [1-2]. We have also realized that fold
recognition methods utilizing evolutionally information outperform other
methods [3]. Thus, we have devised a novel profile-profile comparison
technique to increase the sensitivity of fold recognition and improve alignment
accuracy. Our method has distinct features of measuring similarity between two
profiles as compared with other published methods, such as FFAS [4] and the
method developed by Yona and Levitt [5], which exploit alignment
information.
The FORTE1 system utilizes the sequence profiles of both a target and
templates to predict the structure of target sequence. The sequences of
templates are derived from the ASTRAL [6] (version 1.59) 40% identity list
and the selected PDB entries, which are not registered in SCOP [7] (1.59
release) database. We mainly update the template library according to the
update of PDB database. As the exceptional-strength computational resource
(http://www.cbrc.jp/magi/) is available at our center CBRC, PSI-BLAST [8]
iterations are performed maximally 20 times to prepare the profiles of both
target and templates. With the NCBI non-redundant database the profiles are
updated about half a month during the prediction season.
A-61
In profile comparisons the global-local algorithm is employed to build an
optimal alignment of a query sequence profile onto a template one. Statistical
significance of each alignment score is estimated by calculating Z-score with a
simple log-length correction. The candidates of the templates are sorted by Zscores, and then prediction results in AL format are submitted. Those
implement in a fully automatic manner. The server is available at
http://www.cbrc.jp/forte1/.
1.
2.
3.
4.
5.
6.
7.
8.
Venclovas C. et al. (2001) Comparison of performance in successive
CASP experiments. Proteins. Suppl. 5, 163-170.
Tramontano A. et al. (2001) Analysis and assessment of comparative
modeling predictions in CASP4. Proteins. Suppl. 5, 22-38.
Fischer D. et al. (2001) CAFASP2: The second critical assessment of fully
automated structure prediction methods. Proteins. Suppl. 5, 171-183.
Rychlewski L. et al. (2000) Comparison of sequence profiles. Strategies
for structural predictions using sequence information. Protein Science. 9
(2), 232-241.
Yona G. et al. (2002) Within the twilight zone: a sensitive profile-profile
comparison tool based on information theory. J. Mol. Biol. 315 (5), 12571275.
Chandonia J.M. et al. (2002) ASTRAL compendium enhancements.
Nucleic Acids Res. 30 (1), 260-263.
Murzin A. et al. (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol. 247
(4), 536-540.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Friesner (P0112) - 174 predictions: 174 3D
Bridging the Gap Between Physical Chemistry and
Bioinformatics
Matthew P. Jacobson1,4, Yuling An1, Tyler J. F. Day2, Volker A.
Eyrich2, Ramy S. Farid2, John R. Gunn2, Susan Harrington1, Xin
Li1, David L. Pincus1, Chaya S. Rapp3, Daron M. Standley2, and
Richard A. Friesner1,*
1
Columbia University Department of Chemistry, 2 Schrödinger, Inc.,3 Yeshiva
University, 4 Currently: UCSF Department of Pharmaceutical Chemistry.
* rich@chem.columbia.edu
Overview
The major emphasis of our participation in CASP5 was the integration of
knowledge-based and physics-based methods for protein structure prediction.
For proteins with low sequence identity, threading methods employing novel
alignment techniques and residue based potential energy functions are used to
identify remote homologues and build low resolution structures. When
identification of the template family is straightforward, the alignment methods
are combined with a physical chemistry based energy function (all-atom force
field including electrostatics, Generalized Born model for the polar component
of the solvation free energy, and a function describing the nonpolar component
of the solvation free energy) which is deployed at one or more points during the
modeling process: model construction, refinement, and/or final scoring. This
energy function is of course substantially more computationally expensive than
most scoring functions used for building and refining protein models, and we
have invested substantial effort towards the development of new sampling
algorithms to accelerate convergence.
Specific notable aspects of our methodology include 1) deliberate sampling of
helix conformations, to complement the usual sampling of side chains and
loops, 2) the construction of several models with biological symmetry and/or
the explicit inclusion of ligands (by analogy to the templates), to improve local
structural details of the models, 3) the use of a new alignment algorithm that
A-62
employs both sequence and secondary structure information, 4) the use of a
well-validated fold recognition algorithm that identified reasonable templates
for several difficult targets, including all-alpha helical targets (T129, T139,
T170), that were classified as challenging cases on the CAFASP website, 5) ab
initio generation of unaligned beta sheet regions, and 6) the use of a composite
model building facility, when a combination of multiple templates appeared to
be superior to any single template.
Alignments
Our alignment algorithm used both an amino acid substitution matrix and
secondary structure matching using a profile built from several prediction
servers [1]. A variable gap penalty was employed, with a larger penalty
assigned to gaps within secondary structure elements. In a majority of
comparative modeling cases, the alignments generated with this algorithm were
broadly consistent with alignments generated by other algorithms, such as
PDB-Blast, with differences confined mostly to loops, and some minor shifts (a
few residues) in secondary structure regions. In a few cases, including T133,
the alignment produced using our algorithm differed substantially from others,
although these were mostly targets at the outer fringes of comparative
modeling. In these cases, we used our alignments to the preference of others,
because it was designed to operate seamlessly over a large range of sequence
identity, ranging from true fold recognition cases, where secondary structure
matching dominates the alignment scoring, to homology modeling, where
sequence matching dominates. Regions with significant variability among
different alignment algorithms, including our own, were isolated for sampling
after model building. In a majority of cases, multiple models were built using
several different alignments and/or different templates, each model was refined
independently, and the lowest energy resultant structure was submitted.
Composite templates were constructed for certain comparative modeling and
many fold recognition targets. With respect to the latter, when the choice of
template was ambiguous (about 15% of the cases), we generated alignments
based on composite templates by enumerating variable regions within a group
of structurally related proteins [2]. This procedure, which in general resulted in
many thousands of candidate structures, was followed by a hierarchical filtering
process that utilized clustering and statistics-based scoring [3].
A-63
Model Refinement and Scoring
We have created a new software package, PLOP (Protein Local Optimization
Program) [4], which is explicitly designed to build and refine protein models
using physical chemistry all-atom force fields and implicit solvent models
(specifically a Generalized Born model). The primary emphasis has been on
the development of new sampling algorithms that complement molecular
dynamics and utilize knowledge-based statistics and fast steric screening to
reduce computational expense. The hierarchy of sampling algorithms includes
direct minimization, combinatorial side chain optimization, loop sampling, and
sampling of helix positions/orientations. Together, these sampling algorithms
permit energy-based refinement of homology models. Automated, iterative
refinement was carried out in a parallel manner on up to 20 processors until the
energy ceased to decrease substantially (typically 1–3 days).
The lowest level sampling algorithm is direct minimization, which is
accomplished using a novel multi-scale algorithm based on the Truncated
Newton method [5]. Side chain conformational sampling [6,7] is accomplished
primarily through the use of highly detailed rotamer libraries developed by
Xiang and Honig [8]. Loop prediction utilizes a dihedral angle sampling
procedure for the backbone degrees of freedom to generate many loop
candidates (102–106), followed by clustering, and finally side chain
optimization and complete energy minimization on representative
conformations in a hierarchical manner. Finally, because helix positions are
also variable among homologs, particularly at relatively low sequence identity,
we have implemented a helix sampling algorithm, in which rigid body (6
degrees of freedom) sampling of helix positions is coupled with loop prediction
on either side of the helix, side chain re-packing, and energy minimization. In
all of these sampling algorithms, the energy function consisted of the all-atom
OPLS force field [7,9,10] and SGB/NP solvent model [11,12].
The sampling strategy for model refinement was informed by 1) location and
number of gaps in the sequence alignment, 2) structural diversity among
proteins in the same family as the template, as evidenced by multiple structure
alignments, 3) non-conservative amino acid substitutions, particularly those
involving Gly and Pro, 4) known structural problems with the initial model
(e.g., steric clashes), and 5) local sequence similarity in different portions of the
alignment.
1.
Case-Specific Strategies
Composite Model Building: The submitted models for T132, T149 (N-terminal
domain), T186, and T192 were each constructed from two homologous
templates in PLOP. Models for T136, T146, T147, T162, T172 (domain 2),
T173, T174, T181, T187, and T194 were constructed from several structurally
similar proteins using the automated method of generating and filtering
composite templates.
Biological Symmetry: Several targets (T151, T160, T167, T184 N-terminal
domain, T189, T190) were specified to be homodimers or homotetramers (or
assumed to be so based on the template protein). Neglect of the inter-chain
interactions could lead to significant errors, for example due to exposure of
hydrophobic residues that are buried in the biologically relevant complex. For
this reason, we used symmetry operations, derived from the template structures,
to replicate the monomer appropriately. All copies of the monomer retain the
same conformation at all times during the refinement, thus reducing the
sampling effort required [6].
Explicit Inclusion of Ligands: HETATM groups were explicitly included in
model building and refinement for several targets, either because the CASP
instructions specified their existence, or because the biological function of the
protein would presumably require a cofactor. Examples include the fatty acid
ligand and several strongly conserved water molecules in T137, a Zn ion in
T141, the CoA cofactor in T169, two Co ions in T182, and a Mg ion in the Nterminal domain of T184.
Ab Initio Sheet Construction: Unaligned beta sheets were constructed using an
algorithm that enumerates all possible ways of combining unpaired strands and
extending existing sheets, given a fixed assignment of strand residues [13].
New sheet topologies were screened according to strand-strand connectivity
[14] and loop crossing [15], and scored based on hydrophobic contacts [14].
Each model topology was then simulated independently and ranked as with
other FR predictions. This method was applied to several difficult targets,
including T130, T140, T148, T149, and T156.
A-64
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
An Y.L. and Friesner R.A. (2002) A novel fold recognition method using
composite predicted secondary structures. Proteins 48, 352–366.
An Y.L. (2002) Homology-based protein structure prediction: fold
recognition and alignment. Ph.D. Thesis, Columbia University.
Eyrich V.A. et al. (2001) Ab initio protein structure prediction using a
size dependent tertiary folding potential. Adv. Chem. Phys., 120.
Jacobson M.P. and Friesner R.A. Unpublished.
Schlick T. and Overton M. (1987) A powerful truncated Newton method
for potential energy minimization. J. Comput. Chem. 8, 1025–1039.
Jacobson M.P. et al. (2002) On the role of crystal packing forces in
determining protein sidechain conformations. J. Mol. Biol. 320, 597–608.
Jacobson M.P. et al. (2002) Force field validation using protein sidechain
prediction. J. Phys. Chem. B, accepted.
Xiang Z. and Honig B. (2001) Extending the accuracy limits of prediction
for side-chain conformations. J. Mol. Biol. 311, 421–430.
Jorgensen W.L. et al. (1996) Development and testing of the OPLS allatom force field on conformational energetics and properties of organic
liquids. J. Am. Chem. Soc. 118, 11225–11236.
Kaminski G.A. et al. (2001) Evaluation and reparameterization of the
OPLS-AA force field for proteins via comparison with accurate quantum
chemical calculations. J. Phys. Chem. B 105, 6474–6487.
Ghosh A. et al. (1998) Generalized Born model based on a surface
integral formulation. J. Phys. Chem. B 102, 10983–10990.
Gallicchio E. et al. (2002) The SGB/NP hydration free energy model
based on the Surface Generalized Born solvent reaction field and novel
non-polar hydration free energy estimators. J. Comput. Chem. 23, 517–
529.
Gunn J.R. and Friesner R.A. Unpublished.
Klepeis J.L. and Floudas C.L. (2002) Prediction of beta-sheet topology and
disulfide bridges in polypeptides. Preprint.
Ruczinski I. et al. (2002) Distribution of beta sheets in proteins with
applications to structure prediction. Proteins 48, 85–97.
FROST-MIG (P0047) - 72 predictions: 72 3D
sequence using a suitably modified dynamic programming algorithm.
Insertions/deletions inside core elements are strongly penalized. For 3D
parameters that involve pairs of residues it is not possible to use the dynamic
programming algorithm and we have to use an exact but slow branch and
bound algorithm [2] for the smallest proteins or a faster heuristic method for the
largest proteins.
FROST: a filter based fold recognition method
A. Marin1, J. Pothier2, K. Zimmermann1 and J-F. Gibrat1
1
-Mathématique Informatique et Génome, INRA, Jouy-en-Josas, 78352 cedex,
FRANCE 2 -Atelier de Bioinformatique, 12 rue Cuvier, 75005 Paris, France
gibrat@jouy.inra.fr
The FROST method consists of four components: i) a library of cores, ii) a
fitness function that measures the compatibility of a sequence to a fold, iii) an
algorithm for optimal alignment of the sequence onto the fold and iv) a
statistical analysis of the raw scores.
We have clustered all the sequences of proteins in the PDB into groups having
more than 35% identical residues. The best structure from each group (based on
miscellaneous criteria such as the resolution, the number of missing residues,
etc) was then chosen as a group representative. The corresponding core is
defined as the conserved secondary structure elements, disregarding the loops
of this representative.
The fitness function is based on two distinct sets of parameters. The first set
(1D parameter set) involves only one site (site are defined as the Ca of the
protein residues) in the structure. The second set (3D parameter set) involves
pairs of sites in contact in the structure. The first set only requires the
knowledge of a degenerate version of the 3D structure, namely, the list of the
structural states. These states are defined by their secondary structures (H helix,
E strand, C coil) and their buried state (b buried, e exposed). The second set, on
the other hand, requires the knowledge of the true 3D structure. Parameters for
both sets are calculated using a definition of information due to Fano [1]. 1D
parameters are a direct extension of BLOSUM matrices in which the known
state of the residues, i.e., Hb, He, Eb, Ee, Cb, Ce is taken into consideration. 3D
parameters are a further extension in which one consider the cost of replacing a
pair of amino acid in contact in a given structural context by another pair.
Each set of parameters is used in a different filter. Each filter provides, for the
query sequence, scores for being aligned with the database cores. This score are
only meaningful when they are normalized. In addition one must also evaluate
the significance of the normalized score. In FROST this is done empirically for
each core in the database by aligning a set of true protein sequences without
relationship with the core thus providing an empirical distribution of scores (see
[3] for details) .
We have developed a test database that allows us to empirically determine
which threshold for the filter normalized scores must be used to obtain a given
rate of error (say 1% or 5%). Note that results of both 1D and 3D filters are
simultaneously considered. For this we use an approach similar to the technique
known as support vector machine [4] where points in a M dimensional space
are separated into different classes by hyperplanes. Here M=2 and hyperplanes
are just lines. Using the test database we have determined the position of the
lines for obtaining a given error rate.
Using the test database we showed that for an error rate of 1% we are able to
detect 60% of all related pairs. In teh same conditions, PSI-BLAST only detects
30% of the related pairs.
1.
2.
3.
4.
Using the 1D parameters we align a profile corresponding to the core (for
which we know the residue states) with a profile corresponding to the query
A-65
Fano R.M., Transmition of information: a statistical theory of communication. MIT press Cambridge, 1961
Lathrop R.H. and Smith T.F., (1996) Global optimum protein threading
with gapped alignment and empirical pair score functions. J. Mol. Biol.
255, 641-665.
Marin A., et al (2002), FROST: a filter based fold recognition method,
Proteins, 49, 493-509.
Vapnick V., Statistical learning theory, Wiley, New-York, 1998
FUGUE2 (P0014) - 330 predictions: 330 3D
FUGUE3 (P0226) - 330 predictions: 330 3D
more homologues than the original profiles (unpublished). These improvements
are particularly pronounced for protein families that have only one
representative known structure. These single-member families account for more
than half the total HOMSTRAD families.
Homology Recognition Using Environment-Specific
Substitution Scores Enriched with Homologous Sequence
Information
K. Mizuguchi, T.L. Blundell, H.S. Gweon, J. Shi* and
L.A. Stebbings
Department of Biochemistry, University of Cambridge, 80 Tennis Court Road,
Cambridge CB2 1GA, UK,
kenji@cryst.bioc.cam.ac.uk
Our
sequence-structure
homology
recognition
program
FUGUE
(http://www-cryst.bioc.cam.ac.uk/fugue/)[1] has been ranked among the top
fold recognition servers in CAFASP2 and LiveBench exercises. Unlike most
other fold recognition servers, it utilizes environment-specific substitution
tables and structure-dependent gap penalties, where scores for amino acid
matching and insertions/deletions are evaluated depending on the local
environment of each amino acid residue in a known structure. Key features
defining local environments (such as secondary structure, solvent accessibility
and hydrogen-bonds) have been examined and various weighting parameters
optimized using extensive benchmark sets [1]. The program has been
successfully used to identify novel homologies [2-3].
Since CASP4, FUGUE has been updated substantially and we now call the new
version FUGUE2. While the original version constructs a position-specific
scoring table (profile) for each family in the HOMSTRAD database
(http://www-cryst.bioc.cam.ac.uk/homstrad)[4] using the environment-specific
substitution tables, the new version enriches it by adding information derived
from the homologous sequences. This is done by taking each sequence from the
HOMSTRAD structural alignment, running PSI-BLAST [5] against the NCBI
nr database and combining all the sequence alignments with the original
structural alignment. The new profile is calculated by assuming that all the
homologues adopt the same environments as the known structures. Our new
benchmarking suggests that the enriched profiles can indeed recognize 10%
A-66
One important issue is whether multiply-aligned structures with divergent
sequences (as in some HOMSTRAD families) always contain more information
and can improve the fold/homology recognition performance, or rather they can
increase the noise and thus it is better to use profiles derived from individual
single structures. To test this, we set up two separate servers. FUGUE2
searches against a library of profiles, which are derived from the HOMSTRAD
families as described above. FUGUE3 uses an enlarged library, which includes,
in addition to the original HOMSTRAD profiles, all representative single
structures in the PDB, as defined in the culled pdb [6]. The search programs
themselves are identical in FUGUE2 and FUGUE3. Although we cannot draw
firm conclusions at this stage, it appears that in some cases, the single-structure
profiles in FUGUE3 produced significant Z-scores, whereas the multiplestructure profiles in FUGUE2 failed to recognize the target. Thus, using a
highly redundant structural library such as that in FUGUE3 may further
improve the performance of FUGUE.
1.
2.
3.
4.
5.
6.
*
Shi J. et al. (2001). FUGUE: sequence-structure homology recognition
using environment-specific substitution tables and structure- dependent
gap penalties. J. Mol. Biol., 310, 243-257
Shirai H. et al. (2001) A novel superfamily of enzymes that catalyze the
modification of guanidino groups. Trends Biochem. Sci. 26, 465-468.
Witty M. et al. (2002) Structure of the periplasmic domain of
Pseudomonas aeruginosa TolA: evidence for an evolutionary relationship
with the TonB transporter protein. EMBO J., 21, 4207-4218.
Mizuguchi K., et al. (1998). HOMSTRAD: a database of protein structure
alignments for homologous families. Protein Sci., 7, 2469-2471
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Wang G. and Dunbrack R. L. Jr. (2002) PISCES: a protein sequence
culling server. Bioinformatics, submitted.
Present Address: Celltech R&D Inc., 1631 220th Street SE, Bothell, WA 98021, USA
Garnier-Kloczkowski (P0396) - 91 predictions: 91 SS
greater then the probability of the other states (H, E) by the imposed thresholds
(0.15 for E and 0.075 for H).
Combining the GOR V Algorithm With Evolutionary
Information for the Protein Secondary Structure Prediction
from the Amino Acid Sequence
A. Kloczkowski1, R.L. Jernigan1 and J. Garnier2
1
2
Laboratory of Experimental and Computational Biology, NCI, NIH
Analytical Biostatistics Section, Laboratory of Structural Biology, CIT, NIH
jgarnier@jouy.inra.fr
The GOR algorithm has been modified by employing the evolutionary
information provided by the multiple sequence alignments, adding triplet
statistics and optimizing various parameters (GOR V, 1, 2). The PSI-BLAST
multiple sequence alignments was used after 5 iterations with an E value of
5.10-4.
The methodological procedure was based on the calculation of the matrices of
the probabilities of various (H, E, C) secondary structure elements P H(i, j), PE(i,
j) and PC(i, j) for each j-th residue in the i-th alignment (with the inclusion of
alignment gaps). The gaps were skipped by the GOR program in the
calculation of the probabilities of various secondary structure conformations,
but the information about them was retained for the averaging purposes. Then
the averages were calculated over alignments <P H( j)>, <PE( j)> and <PC( j)> at
the j-th position in the alignment by summing P H(i, j) (and similarly PE(i, j) and
PC(i, j) ) over i, and by dividing this sum by the number of alignments,
excluding (in the alignment count) alignments with gaps at the j-th position. In
the alignment matrix A columns containing gaps in the query sequence were
skipped, contracting the size of the matrix A to the original length of the query
sequence. The prediction of the secondary structure conformation for the j-th
residue was based of the set of three probabilities {<P H( j)>, <PE( j)>, <PC(
j)>}. The secondary structure of the j-th residue was assigned to the
conformation with the largest probability value max{<P H( j)>, <PE( j)>, <PC(
j)>} modified by introducing decision constant thresholds. The coil state was
being predicted only if the calculated probability of the coil conformation is
A-67
All calculations of the parameters for the observed states were performed with
the translation of the eight state DSSP assignments into the three secondary
structure states H, E and C as the following: DSSP states H and E were
translated to H and E in the three state code, and all other letters of the DSSP
code were translated to coil (C). Additionally helices shorter than 5 residues
(HHHH or less) and sheets shorter than 3 residues (EE or E) were considered as
coils. Similarly the GOR algorithm has a built-in correction scheme, which
removes secondary structure segments that are too short (helices shorter than 4
residues, and sheets shorter than 3 residues) treating them as the most likely
prediction errors.
In the case the PSI-BLAST alignment detected a sequence from the PDB with
an E value smaller than 5.10-4 to the query sequence, two models of prediction
were given. Model 1 used the PSI-BLAST alignment to transfer the observed
conformation of the PDB sequence to the predicted conformation of the aligned
residues of the query sequence; if more than one PDB sequence were below the
E value, no more than three of them were taken. Most of the three observed
conformations were identical, subject to alignment errors or ends of secondary
structures. Only the DSSP secondary structures displayed by the PDB site were
used although somewhat at variance with the crystallographer assignments and
our own DSSP assignments for the calculation of the GOR parameters. Model 2
was the prediction made as described in the paragraphs above, without taking
into account of the observed conformation of the PDB sequences. The order
was chosen expecting that model 1 should be the most accurate even if PSIBLAST alignments might differ from structural alignments.
1.
2.
Kloczkowski A. et al. (2002). Protein Secondary Structure Prediction
Based on the GOR Algorithm Incorporating Multiple Sequence Alignment
Information. Polymer, 43, 441-449.
Kloczkowski A,. et al. (2002). Combining the GOR V Algorithm With
Evolutionary Information for Protein Secondary Structure Prediction from
Amino Aid sequence. Proteins: Structure, Function, and Genetics, 49,
154-166
GEM (P0359) - 76 predictions: 76 3D
Model building. Models were build using either DeepView or MODELLER[9].
The advantage to using DeepView is that a user keeps full control over the
modeling process. This is done by allowing user intervention at any step. In
DeepView, coordinates were assigned to the target sequence by applying the
“Build Preliminary Model” option. In MODELLER, the default parameters
were used.
Comparative Modeling in CASP5
H. Scheib1,2, K. Koretke1, A. Diemand1,2, M. Word1, C. Combet1,3
and E. Migliavacca1,2
1
3
GlaxoSmithKline, 2 Swiss Institute of Bioinformatics,
Institut de Biologie et Chimie des Protéines, University of Lyon
holger.scheib@isb-sib.ch
Comparative modeling in CASP5 was separated in five steps: (1) template
structure selection, (2) alignment, (3) model building, (4) loop building, and (5)
model structure refinement. From the possible 65 targets, our group submitted
model structures for 40 targets in comparative modeling.
Template structure selection. Template structures were obtained mainly from
SENSER[1] and Match2, but also to some lower extend from 3D-PSSM[2]. In
cases were multiple templates were available for a target, a multstep process
was used; (1) the set of templates were superimposed using either DeepView
(formerly SwissPDBViewer)[3] or STAMP[4] producing a structural based
multiple sequence alignment; (2) creating a multiple sequence alignment of the
target protein family; (3) combining the two alignments using both sequence
similarity and mapping of the target's predicted secondary structure elements to
the corresponding ones in the templates. The final template was selected based
on the best "fit" of the target sequence to a putative template.
Alignment. The target sequence was manually aligned to a single or a set of
superimposed template structures using DeepView or MACAW[5]. The
alignment was guided by conserved residues as identified by multiple sequence
alignments of related proteins, secondary structure prediction (PHD[6] and
results from CAFASP3 server[7]), InterPro signature sequences[8] as well as
hydrophobicity patterns. It was attempted to move insertions and deletions into
loop regions to conserve the secondary structure elements, i.e. in the core of the
protein. However, in rare cases gaps were placed violating these conditions in
order to retain an overall homologous and compact shape of the resulting
model.
A-68
Loop building. When generating models using DeepView, most loops were
built manually, since the automatic loop building facility implemented in
DeepView has its limitations. A loop database scan was carried out to identify
possible solutions for a loop. In cases where this approach failed, loops were
built de novo. Anchor points were chosen manually taking into account the
local environment of the putative loop and neighboring biologically relevant
residues. The resulting loops were ordered according to their force field energy
as implemented in DeepView. The most suitable loop was usually selected
among the top five proposed solutions with the lowest force field energy. This
procedure was applied to loops in models generated by MODELLER wherever
gaps of more than 2 residues occurred.
Model structure refinement. The resulting structures were energy minimized
applying 100 to 200 steps of either Steepest Descend or Conjugate Gradient to
the model using the GROMOS96 force field[10] as implemented in DeepView.
Unfavorable side chain conformations were identified using “Amino Acids
Making Clashes”, “Amino Acids Making Clashes With Backbone” and a force
field energy report. Sterical problems were fixed by changing the side chain
rotamer of one or more residues in the affected area. After fixing side chains,
another 100 steps of either Steepest Descent or Conjugate Gradient
minimization was carried out. In all cases, clashes could be removed by this
procedure.
1.
2.
3.
Koretke K.K et al. (2002) Fold recognition without folds. Protein Sci. 11,
1579-1579.
Kelley L.A. et al. (2000) Enhanced genome annotation using structural
profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499-520
Guex N. and Peitsch M.C. (1997) SWISS-MODEL and the SwissPdbViewer: an environment for comparative protein modeling.
Electrophoresis 18, 2714-2723
4.
Russell R.B. & Barton G.J. (1992). Multiple protein sequence alignment
from tertiary structure comparison: assignment of global and residue
confidence levels. Proteins, 14, 309-323.
5. Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple
alignment construction and analysis. Proteins 9,180-190.
6. Rost B. (1996) PHD: predicting one-dimensional protein structure by
profile based neural networks. Methods Enzym. 266, 525-539
7. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html
8. Apweiler R. (2001) The InterPro database, an integrated documentation
resource for protein families, domains and functional sites. Nucleic Acids
Res. 29, 37-40
9. Sali A. and Blundell T.L.. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.
10. van Gunsteren et al. (1996) Biomolecular Simulation, the GROMOS96
Manual and User Guide, Vdf Hochschulverlag AG an der ETH Zuerich,
Zuerich, Switzerland, 1-1042.
GEM (P0359) - 76 predictions: 76 3D
Applying Sequence Homology and Secondary Structure
Prediction in Fold Recognition
K. Koretke1, H. Scheib1 and A. Lupas2
1
- GlaxoSmithKline, 2 - Max-Planck-Institute for Developmental Biology
Kristin.K.Koretke@gsk.com
Summary. All CASP5 targets were submitted to the sensitive search routine
program SENSER[1] to gather information about each sequences' homologous
space. Secondary structure and other fold recognition predictions were obtained
from CAFASP2 server[2]. Additional sequence searches were done using
regular expression patterns and HMMs. If a protein of known structure
appeared to match the properties of the target, alignments were generated using
a combination of MACAW[3], PSI-Blast[4] and HMMer[5] with final
adjustments made by mapping conserved hydrophobicity patterns with
A-69
secondary structure elements. A total of 18 targets were predicted as fold
recognition; 4 had templates identified using SENSER, 4 were predicted based
on distant homology detected through other methods and 10 were predicted
through secondary structure patterns.
Details. SENSER is a multi-step program that uses PSI-Blast to search
sequence space and identify distantly related sequences for a given query
sequence. In the first step SENSER performs a PSI-Blast search with the target
sequence for a maximum of 6 iterations. Proteins identified in the search are
divided into a significant sequence space, containing those sequences with an E
value lower than 10-2, and a 'trailing end' of sequences between 10 -2 and 10.
Because some of the proteins detected may contain unrelated domains, all
proteins are trimmed to the actual region detected in the PSI-Blast run.
In the second step, transitive searches are used to expand the significant
sequence space. Only proteins within the significant sequence space that have
less than 30% identity to the target sequence are used as starting points for
further PSI-Blast searches, in order to avoid redundant searches, i.e. those that
produce similar profiles and sequence spaces. This value was chosen as it is a
frequently quoted threshold for the 'twilight zone', below which sequences can
not be confidently said to be homologous.
In the third step trailing-end sequences are tested for their ability to backvalidate, i.e. detect any sequence of the significant sequence space of the target
in PSI-Blast. Because several PSI-Blast searches were performed to establish
the significant sequence space, trailing-end sequences are pooled and ranked
first by number of occurrences and second by E-value, before being tested. If a
trailing-end sequence back-validates, its significant sequence space is added to
that of the target. The process is then repeated until no further sequences are
detected.
Domains were automatically predicted for each target and identified using
alignment information from the final iteration of the target sequence's PSI-Blast
run. A domain was identified if a significant sequence aligned to less than 50%
of the target sequence. The boundaries of a domain were determined by the
maximum overlap of a target to all of the significant sequences that overlapped
the same region and only aligned to less than 50% of the target sequence. If a
domain was predicted, a new SENSER run was initiated with the target
sequence trimmed to predicted domain region.
If SENSER identified a potential template structure, its match with the target
was evaluated using predicted secondary structure, the occurrence of sequence
patterns, and biochemical information. The aligment was generated using
MACAW, HMMer, PSI-Blast or a combination of these methods to produce
the alignment that seemed most plausible to us based on conserved residues,
hydrophobicity, and secondary structure.
If SENSER did not identify a potential template structure, regular expression
patterns, predicted secondary structure, other fold recognition predictions and
biochemical information were used to search for possible templates. In
addition, in cases where the target was only a fragment of a larger protein, the
entire protein was used in sequence searches. If a template was judged to match
the properties of the target, an alignment was produced using MACAW,
HMMer, Clustal[6], or a combination of these methods, to produce the
alignment that seemed most plausible to us based on conserved residues,
hydrophobicity, and secondary structure.
1.
2.
3.
4.
5.
6.
http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html
Koretke K.K., Russell R.B., Lupas A.N. (2002) Folds without a Fold.
Protein Science 11(6):1575-9.
Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple
alignment construction and analysis. Proteins 9,180-190.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755763.
Thompson J.D., Higgins D.J., and Gibson T.J. (1994) CLUSTALW:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight
matrix choice. Nucleic Acids Res. 22, 4673-4680.
A-70
GEM (P0359) - 76 predictions: 76 3D
Applying Sequence Homology and Secondary Structure
Prediction in Fold Recognition
K. Koretke1, H. Scheib1 and A. Lupas2
1
- GlaxoSmithKline, 2 - Max-Planck-Institute for Developmental Biology
Kristin.K.Koretke@gsk.com
Summary. All CASP5 targets were submitted to the sensitive search routine
program SENSER[1] to gather information about each sequences' homologous
space. Secondary structure and other fold recognition predictions were obtained
from CAFASP2 server[2]. Additional sequence searches were done using
regular expression patterns and HMMs. If a protein of known structure
appeared to match the properties of the target, alignments were generated using
a combination of MACAW[3], PSI-Blast[4] and HMMer[5] with final
adjustments made by mapping conserved hydrophobicity patterns with
secondary structure elements. A total of 18 targets were predicted as fold
recognition; 4 had templates identified using SENSER, 4 were predicted based
on distant homology detected through other methods and 10 were predicted
through secondary structure patterns.
Details. SENSER is a multi-step program that uses PSI-Blast to search
sequence space and identify distantly related sequences for a given query
sequence. In the first step SENSER performs a PSI-Blast search with the target
sequence for a maximum of 6 iterations. Proteins identified in the search are
divided into a significant sequence space, containing those sequences with an E
value lower than 10-2, and a 'trailing end' of sequences between 10 -2 and 10.
Because some of the proteins detected may contain unrelated domains, all
proteins are trimmed to the actual region detected in the PSI-Blast run.
In the second step, transitive searches are used to expand the significant
sequence space. Only proteins within the significant sequence space that have
less than 30% identity to the target sequence are used as starting points for
further PSI-Blast searches, in order to avoid redundant searches, i.e. those that
produce similar profiles and sequence spaces. This value was chosen as it is a
frequently quoted threshold for the 'twilight zone', below which sequences can
not be confidently said to be homologous.
1.
2.
In the third step trailing-end sequences are tested for their ability to backvalidate, i.e. detect any sequence of the significant sequence space of the target
in PSI-Blast. Because several PSI-Blast searches were performed to establish
the significant sequence space, trailing-end sequences are pooled and ranked
first by number of occurrences and second by E-value, before being tested. If a
trailing-end sequence back-validates, its significant sequence space is added to
that of the target. The process is then repeated until no further sequences are
detected.
3.
Domains were automatically predicted for each target and identified using
alignment information from the final iteration of the target sequence's PSI-Blast
run. A domain was identified if a significant sequence aligned to less than 50%
of the target sequence. The boundaries of a domain were determined by the
maximum overlap of a target to all of the significant sequences that overlapped
the same region and only aligned to less than 50% of the target sequence. If a
domain was predicted, a new SENSER run was initiated with the target
sequence trimmed to predicted domain region.
If SENSER identified a potential template structure, its match with the target
was evaluated using predicted secondary structure, the occurrence of sequence
patterns, and biochemical information. The aligment was generated using
MACAW, HMMer, PSI-Blast or a combination of these methods to produce
the alignment that seemed most plausible to us based on conserved residues,
hydrophobicity, and secondary structure.
If SENSER did not identify a potential template structure, regular expression
patterns, predicted secondary structure, other fold recognition predictions and
biochemical information were used to search for possible templates. In
addition, in cases where the target was only a fragment of a larger protein, the
entire protein was used in sequence searches. If a template was judged to match
the properties of the target, an alignment was produced using MACAW,
HMMer, Clustal[6], or a combination of these methods, to produce the
alignment that seemed most plausible to us based on conserved residues,
hydrophobicity, and secondary structure.
A-71
4.
5.
6.
http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html
Koretke K.K., Russell R.B., Lupas A.N. (2002) Folds without a Fold.
Protein Science 11(6):1575-9.
Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple
alignment construction and analysis. Proteins 9,180-190.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755763.
Thompson J.D., Higgins D.J., and Gibson T.J. (1994) CLUSTALW:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight
matrix choice. Nucleic Acids Res. 22, 4673-4680.
GeneSilico (P0517) - 195 predictions: 86 3D, 64 SS, 45 RR
From Fold-Recognition Analysis via the Genesilico MetaServer, to Modeling and Refinement by Several Predictors, to
Uniform Evaluation and Generation of Hybrid Models
M. Kurowski, M. Feder, J. Kosinski, I. Cymerman, J. Sasin,
J.M. Bujnicki
International Institute of Molecular and Cell Biology (IIMCB) in Warsaw.
Trojdena 4, 01-109 Warsaw, Poland
iamb@genesilico.pl
Assessments of protein structure prediction (CASP, CAFASP, Livebench) have
demonstrated that fold recognition (FR) methods can identify remote
similarities when standard sequence search methods fail, but the reported
target-template alignments are often only partially correct, leading to models
with misfolded parts. The use of additional information, such as secondary
structure (SS), and/or localization of ligand-binding residues can help to
improve the target-template alignments. Moreover, models constructed from
multiple parents are often found to be more accurate than models constructed
from single parents only. The final prediction accuracy can be therefore
improved if the best fragments obtained from various FR alignments can be
judiciously combined to generate a consensus model.
Based on our experience with the meta-server approach to protein structure
prediction, and both fully automated and expert-refined fold-recognition
analysis in CASP4 and CAFASP2, we developed a novel fold-recognition
gateway, which combines the useful features of other meta-servers available
previously with the greater flexibility of the input (the beta version of the new
tool is available at http://genesilico.pl/meta).
Whenever possible, we attempted to identify as many homologs of the target
sequence as possible. For this purpose, we created a database of putative
translation products (length>20aa) of all unfinished genomes, whose sequences
were publicly available. This allowed a roughly two-fold increase of the size of
the non-redundant database. In a cases of a few targets, it allowed to increase
the size of the multiple sequence alignment from ~3 sequences to > 10
sequences and much better delineation of conserved and variable regions. The
alignments were used to divide the query sequence into domain-size fragments.
Fold recognition analysis was carried out for the single sequences, for the
individual domains, as well as for the alignment sections corresponding to the
individual domains. In the case of submission of alignments for the foldrecognition analysis, two options were used: i) columns with >30% of gaps
were deleted (i.e. only the core regions were analyzed) and ii) gaps were treated
as unknown characters (X) (i.e. the variable regions of the target sequence were
“extended” to the maximal size, using the longest insertions present in
homologous sequences as the reference).
Results of fold-recognition analysis carried out via our meta-server for all the
variants of the target sequence as well as all FR and ab initio predictions
obtained from the CAFASP website were collected and presented to a team of
six human predictors. Their varying experience in protein sequence analysis
and modeling notwithstanding, all members of the GeneSilico team attempted
to build and refine the models independently. They used different software
(Modeller, Swiss-model, MOE, WhatIf, and ICM-Pro) and applied different
refinement protocols. The purpose of this exercise was to sample the “model
space” in a vicinity of the solution suggested by the consensus between the
A-72
fold-recognition methods. This sampling was not meant to be too extensive and
was carried out with an assumption that the knowledge-based refinement
carried out by the human predictor in the case of crude FR models is superior to
the refinement carried out by the fully automated procedures. All models were
evaluated using independent criteria (Verify3D and ProsaII) and compared with
each other. Following superposition of all modeled structures, a consensus
model was built from the best-scoring fragments of all models. The consensus
model was re-evaluated and further refined or its parts were replaced with parts
from other models, if generation of a “hybrid” model resulted in deterioration
of the score due to apparent incompatibility of fragments of different
preliminary models. Following the manual correction of selected sidechains
and energy minimization with the GROMOS forcefield, the refined model was
submitted in the TS format. The same procedure applied to the targets in the
homology modeling, fold recognition and “novel folds” categories.
The final model and all the well-scoring parts of the intermediate models were
used to calculate the average residue-residue separation distances (submitted as
the RR category). The secondary structure of the final model was inferred
according to DSSP and combined with the independent alignment/sequencebased prediction to generate the output in the SS format. For targets with no
reliable models of the tertiary structure, the independent SS prediction was
based solely on alignment-based predictions.
GERLOFF (P0240) - 9 predictions: 9 3D
Incorporation of Constraints Derived from Active/Functional
Site Predictions in Protein Tertiary Structure Assembly
R. Schmid, D. C. Soares, Z. A. M. Hussein, B. J. Mitchell,
R. S. Hamilton and D. L. Gerloff
Biocomputing Research Unit & Structural Biochemistry Group,
Institute of Cell and Molecular Biology,University of Edinburgh, UK
d.gerloff@ed.ac.uk
We have submitted tertiary structure predictions for five CASP5 target proteins
in order to investigate the potential of knowledge and/or predictions about
functional sites in these proteins for being used in combination with established
structure prediction methods. The degrees of difficulty assigned to the
prediction targets, and the categories in which our predictions are considered,
vary. Similarly, the way in which functional site information is used, and its
impact on the final model varies slightly from target to target.
Our primary postulates are that: (a), the interchange between structure and
functional prediction (or knowledge) leads to improvement at both ends;
(b), formulation/adaptation of systematic fold-specific heuristics and functionspecific heuristics is possible, at least for certain folds and functions; and
(c), prediction of structure/function can go beyond trying to find re-occurrences
of known cases.
While we found little opportunity within the set of CASP5 targets to
demonstrate and/or test postulate (b), we attempted to use function prediction/
knowledge in all predictions we submitted. Primarily, we used predicted key
residues in proteins presumed to function as enzymes to “anchor” threading
alignments (in T0130, T0173, and to an extent in T0136 and T0132) so that
their arrangement in the model would allow catalysis. In T0129, we could not
find a suitable fold template and used the presumed proximity of presumed
functional residues to guide the assembly of helices ab initio.
Besides our emphasis on formulating distance and/or geometrical constraints
for our models based on functional site prediction from multiple sequence
A-73
alignments, another unifying link between our submissions is the consideration
of predicted Surface/Interior/ActiveSite/Parse positions (SIAP-predictions)
according to the approach by Benner & Gerloff [1], in refining our
predictions/threading alignments. Prediction of key residues from multiple
sequence alignments (typically generated using ClustalX on sequences
retrieved by standard BLAST/PSI-BLAST searches on the nr database, with
subsequent manual editing (!)) was generally based on complete, or high,
conservation of functional type amino acids, sometimes taking into
consideration patterns of conservation similar to those described in [1]. The
choice of template structures used in our predictions was often influenced by
the publicly available CAFASP2 predictions by automated servers, albeit not
exclusively. Here again, the compatibility between the folds and biologically
sensical arrangements of predicted key residues was our primary criteria in
non-obvious cases. Secondary structure predictions by CAFASP2-servers were
used by default but often refined according to [1] and in the course of
modeling. Refer to individual prediction headers for further details of interest,
particularly the speculative functional roles of individual predicted key residues
in predictions where this was possible in the least. These blind predictions of
functional aspects are influenced by the structure predictions as much as vice
versa.
While the “manual component” in our CASP-predictions is obviously
significant, our goal is to identify systematic aspects in the way biochemists’
knowledge influences (and quite often improves) tertiary structure predictions,
with the goal of providing “refinement modules” for existing automated
methods. Besides functional site assembly, consideration of the usually
observed pseudo-symmetry in protein quaternary structures is under-explored
in our field, and we believe that the prediction of (non-transient) quaternary
structure besides tertiary structure would be a highly relevant addition to future
CASPs. Interesting quaternary structure cases in the targets we considered were
T0132 and T0136. Again, the benefits of further developing efforts in this
direction could be mutually beneficial to either tertiary and quaternary structure
prediction.
1.
Benner S.A., Cannarozzi G.M., Gerloff D., Turcotte M. and
Chelvanayagam G. (1997) Bona fide predictions of protein secondary
structure using transparent analyses of multiple sequence alignments.
Chem. Reviews 97, 2725-2843
Ginalski (P0453) - 71 predictions: 71 3D
Modeling of CASP5 Target Proteins with 3D-CAM
K. Ginalski1,2
1
- Interdisciplinary Centre for Mathematical and Computational Modelling,
Warsaw University, Warsaw, Poland, 2 - BioInfoBank Institute, Poznań, Poland
kginal@bioinfo.pl
For the fifth round of Critical Assessment of Techniques for Protein Structure
Prediction (CASP5), 67 target proteins were modeled using the 3D-Consensus
Alignment Method (3D-CAM). The issue of sequence-to-structure alignment of
target sequences with their respective parent structures was the main emphasis,
and as shown in previous rounds of CASP, this part of the modeling procedure
is the major source of errors. The critical steps in modeling: selection of
template(s) and generation of sequence-to-structure alignment, were based on
the results of secondary structure prediction and tertiary fold recognition
carried out using the Meta Server [1].
Initially, related proteins with known structures were identified from the
consensus of the Meta Server results. For difficult targets, template (fold)
identification was based on the results of the 3D-Jury method (Rychlewski L.,
unpublished). Structural determinants of the fold were then analyzed: all the
structures representing a given fold, and corresponding structural alignment
extracted from the FSSP database [2], were inspected for both conservation and
variability of the structural elements. Conservation of specific residues and
contacts responsible for maintaining tertiary structure, and critical for substrate
binding and/or catalysis, were also established. Additionally, homologous
sequences that matched the targets were collected with PSI-BLAST searches
[3] performed against the non-redundant protein sequence database and
unfinished genomes until profile convergence. The CLUSTAL W program [4]
was used to generate multiple sequence alignments for sets of sequences
containing target, and other closely-related proteins, to identify conserved
residues within the family.
All alignments produced by different servers interacting with the Meta Server
were inspected for both variability and violation of structural integrity. Initial
A-74
alignment was obtained by taking, in most cases, the common alignment for
each region (mainly for each secondary structure element), taking into account
the structural alignment of templates where possible, within the context of the
structural and sequential constraints identified above. In some cases close
homologues were also submitted to the Meta Server as the query sequences.
For regions that displayed low stability (i.e. highly dependent on the server),
possible alignment variants were derived manually, guided mainly by
secondary structure predictions.
All plausible alternative sequence-to-structure alignments were tested by
building 3D molecular models for the target sequence with the Homology
module of InsightII (Accelrys Inc., San Diego, CA). Backbone conformation
was taken from the template structure, and only non-conserved side chains
were substituted. Modeling of loops that contained insertion and deletion
regions was skipped in this procedure. Models were then subjected to detailed
evaluation, mainly by visual inspection of structural consistency and using
Verify3D [5] and ProsaII [6] energy profiles. Such a 3D evaluation procedure
enabled selection of final sequence-to-structure alignments.
Final models of target proteins were built using the MODELLER program [7].
Where possible, more than one template protein was used, after
superimposition of their molecular structures. The overall quality of each
modeled structure was checked in detail with the WHAT_CHECK program [8].
No energy minimization procedures were employed.
1.
2.
3.
4.
5.
Bujnicki J.M. et al (2001) Structure prediction meta server, Bioinformatics
17 (8), 750-751.
Holm L. et al, (1996) Mapping the protein universe, Science 273 (5275),
595-603.
Altschul S.F. et al (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs, Nucleic Acids Res, 25
(17), 3389-3402.
Thompson J.D. et al (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice, Nucleic Acids
Res, 22 (22), 4673-4680.
Luthy R. et al (1992) Assessment of protein models with three-dimensional
profiles, Nature 356 (6364), 83-85.
6.
7.
8.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins, Proteins 17 (4), 355-362.
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234 (3), 779-815.
Hooft R.W. et al. (1996) Errors in protein structures. Nature 381 (6580),
272.
harrison (P0188) - 43 predictions: 43 3D
Estimated Distance Matrices and Self-Assembling Models
John Petock1, Ping Liu1, Irene T. Weber1, and Robert W.
Harrison2,1
1- Department of Biology, 2- Department of Computer Science, Georgia State
University
rharrison@cs.gsu.edu
Three potential improvements in methodology were explored in our
submissions to CASP5. These were applied to both ab initio and similarity
modeling. The problem of modeling inserted and deleted regions in homology
modeling is related to the problem of modeling a structure from sequence alone
and therefore we expect that techniques developed for one problem will be
partially transferable to the other. The improvements were: 1) using Floyd’s
algorithm to generate a full rank distance matrix for homology modeling, 2) an
improved soft hydrophobicity potential with structural fragments and an
improved version of self-assembling neural network model for ab initio
modeling, and 3) a non-stationary multiple sequence alignment algorithm for
initial alignments.
The fundamental problem in insertion/deletion modeling is to generate
coordinates from potential energy information alone. Unfortunately, molecular
mechanics potentials will not generate a unique minimum structure by energy
minimization, although sometimes solvated molecular dynamics will converge
to reasonable model. Therefore it is a common practice[1] to augment the
A-75
potentials by searching the structural database and finding structural fragments
that fit the known parts of the structure. These are then used to generate an
initial guess for the missing parts of the structure. The basic problem with this
approach is that while there are O(N2) distances that should be estimated for a
well-conditioned full rank model building problem, this approach only
produces O(N) distance estimates.
Floyd’s algorithm is a dynamic
programming algorithm that uses repeated iterations of the triangle inequality
to fill in missing distances. Floyd’s algorithm produces a strict upper-bound for
each distance, but is not a reliable predictor of lower-bounds. Distance terms
were added to the molecular mechanics potential from three sources: 1)
distances in the initial model were added with tight error bounds, 2) distances
were added from searching the structure database with error bounds derived
from the deviation to the model, and 3) upper-bound only distance terms were
added by using Floyd’s algorithm. These distances were then used as distance
restraints during the initial part of the model building to generate reasonable
insert/deletion structures.
Ab initio models were generated for several systems where no close homolog
was detected. The basic approach is a development of our earlier selfassembling methods, where the self-assembly of proteins and other polymers is
modeled by the self-assembly of a Kohonen neural network[2]. Two
improvements were implemented. First, a softer hydrophobicity potential was
used, and second the relaxation scheme in the Kohonen network was improved
after extensive studies. The softer hydrophobicity potential meant that we
could use short structural restraints (10mers) to impose local secondary
structure preferences, and take the best structure as our model, rather than use a
longer structural restraint and average the distances over an ensemble of
models. Structural restraints were generated with a diagonal programming
algorithm that maximized the similarity between overlapping segments while
picking the individual segments based on local sequence similarity. The
improvements in the implementation of the Kohonen network changed the
initial relaxation radius from a small fixed value to a successively decreasing
value that started half the size of the space for the molecule and linearly
contracted to a small value. This resulted in faster and more accurate
convergence in test systems. This algorithm is capable of producing left and
right-handed folds with all L amino acids, so both hands of the fold were
converted to all-atom models and submitted.
Head-Gordon (P0271) - 93 predictions: 93 3D
The alignment approach was altered in a small, but important manner. Earlier
we had implemented a non-stationary alignment scheme that maximized the
correlation between the distributions of amino acids at each position along two
sequences. This approach worked by constructing a local set of moments of the
distribution and assumed that two sequences were aligned when the
distributions were similar. The algorithm was implemented as a dynamic
programming algorithm and is fast enough to scan the entire pdb database. The
problem with this work is that pairs of sequences can be misleading. It is
possible for the closest alignment between a pair of sequences to not be the best
alignment between a sequence and the structural class of the protein. Multiple
sequence algorithms attempt to remedy this problem by building a discrete
model for the structural class from several high similarity sequences. The
CASP5 target and several other sequences were aligned by conventional
approaches, which are perfectly adequate for high homology (80%+, no gaps)
and this aligned set was then used to derive the target vs. starting point
alignment.
The models were built using the program AMMP with the sp4 potential set[3].
For homology models the structures were constrained to the initial coordinates
while building the unknown parts, and then the constraints were released. The
ab initio models were built using a reduced atom potential, and then converted
to a all-atom model. Distance restraints were implemented with a split
harmonic potential where the potential is zero between and upper and lower
bound. Nine ab initio models (targets 129,132, 135, 138, and 157) and 29
homology models (targets 132, 137,139, 143, 144, 150, 151, 153, 154, 155,
156, 158, 160, 163, 167, 178, 179, 182, 183, 188, 190, 192, 193 and 194) were
submitted.
1.
2.
3.
Bates P.A., Sternberg M.J. (1999) Model building by comparison at
CASP3: Using expert knowledge and computer automation. Proteins 37
(s3), 47-54.
Harrison R.W. (1999) A self-assembling neural network for modeling
polymer structure. J. Math. Chem 26, 125-137.
Bagossi P., Zahuczky G., Tozser J., Weber I.T., and Harrison R.W. (1999)
Improved parameters for generating partial charges: correlation with
observed dipole moments. J. Mol. Model 5, 143-152.
A-76
A Physical Approach to Protein Structure Prediction
Teresa Head-Gordon1, Silvia Crivelli2, Oliver Kreylos3, Betty
Eskow4, Harry Choi1, Richard Byrd4, and Robert Schnabel4
1
Department of Bioengineering, University of California, Berkeley,
2
NERSC, Lawrence Berkeley National Laboratory,
3
Department of Computer Science, University of California, Davis,
4
Department of Computer Science, University of Colorado, Boulder
TLHead-Gordon@lbl.gov
The Stochastic Perturbation with Soft Constraints (SPSC) is a global
optimization method that uses some information from known proteins to
predict secondary structure, but not in the tertiary structure predictions or in
generating the terms of the physics-based energy function [1-4]. Our approach
is also characterized by the use of an all atom energy function that includes a
novel hydrophobic solvation function derived from experiments that shows
promising ability for energy discrimination against misfolded structures [5-7].
We competed for the first time in CASP4, where we showed that our approach
is more effective on targets for which less information from known proteins is
available. Our SPSC method produced the best prediction for one of the most
difficult targets of the competition, a new fold protein of 240 amino acids [4].
The SPSC algorithm is a two-phased approach in which the first phase
generates starting structures which are local minima containing predicted
secondary structure, and the second phase improves upon the starting structures
using both global and local optimizations. The most substantial differences
between our CASP4 and CASP5 methods are in Phase I. In CASP4, Phase I
begins with a starting structure that is the fully extended chain, and locates
good minimizers through local minimizations with soft constraints. The soft
constraints are derived from predictions of secondary structure obtained from
Psi-Pred [8], and encourage the formation of helices and sheets through
the use of penalty (reward) functions; the strength of a penalty (reward)
function depends on the strength of a neural network prediction. In CASP5, all
starting structures were generated with a new inverse kinematics (IK) tool
developed by Kreylos and co-workers [9], that allows for interactive
manipulation of local and global dihedral angle moves.
Using Psi-pred predictions, the IK tool forms helices and (isolated) strand
structures based on ideal geometric definitions of the two types of local
secondary structure. This tool allowed us to create a diverse population of
initial configurations regardless of the size and topology of the targets. It was
used to form different starting beta-sheet topologies for proteins with predicted
strands, in part based on the full list of open sheet motifs described in Ruczinski
et al. [10], although we did not use their scoring function for ordering or
eliminating certain sheet topologies. We also used the IK tool to generate some
starting sheet topologies that were not representative of those found by
Ruczinski et al. [10], but seemed possible given the mechanics of the chain.
These IK structures were then locally minimized and added for Phase II
optimization. We also relied less on strand prediction by trying a new
strategy that uses the IK tool to form all strand structure for all backbone
dihedral angles predicted not to be helical. We have found that the global
optimization itself forms its own sheet topologies; this is especially
important for cases where there is poor or ambiguous secondary structure
predictions, such as was the case for targets like T145.
The local minimizers resulting from Phase I contain predicted secondary
structure but do not contain any significant tertiary structure. Phase II improves
those minimizers through global minimizations in a sub-space of the torsion
angles of amino acids predicted to be coil. A brute-force search is avoided by
selectively doing a local minimization based on whether a new proposed start
structure lies within a certain distance metric of another structure, and whether
its energy is lower than other existing structures; if a new start structure lies
within the distance metric, and is higher in energy, it is assumed to lie within an
existing basin of attraction, and is rejected from further computational
consideration. This global optimization approach is one of the few that provides
a theoretical guarantee of finding global optimum, and is general in the sense
that subspaces of arbitrary dimension can be explored. However, in practice,
the amount of work required to reach the theoretical guarantee is prohibitive for
large subspaces. Because the theoretical guarantee is higher for small
dimensional problems, we select a subset of ~6-10 variables from the space of
torsion angles predicted to be coil, and a global optimization is performed on
A-77
the selected torsion angles as variables while keeping the rest temporarily fixed
at their current values. The global optimization produces a number of local
minimizers in the subspace of torsion angles chosen, and a number of those
conformations with low energy values are selected for local minimizations in
the full variable space. The new minimizers obtained from the local
minimizations are merged into the current list, are clustered and ordered by
energy value and the second phase starts again. The process repeats for a
number of iterations, until no further progress is made according to the
following stopping criteria.
At the end of each Phase II run the algorithm returns between 60-140 of the
lowest energy configurations found thus far. These structures are clustered into
groups in which members of a given cluster are within 5-15Å r.m.s.d. of each
other (lower bound for small proteins, upper bound for large proteins) and the
members of each cluster are energy ranked. The best configuration from each
cluster is used as a starting point for the next round of Phase II. Our experience
is that convergence correlates with no new distinct clusters, and an energy
value that is no longer changing. In general, we submitted structures that were
one of the three lowest in energy of the lowest energy cluster as the first
prediction, the second prediction generally as one of three lowest energy
members of the second lowest energy cluster, etc.
The SPSC algorithm uses the AMBER 95 molecular mechanics energy
function to represent the protein-protein interactions. We have added an
empirical solvation free energy term that models the hydrophobic effect as a
two-body interaction between all aliphatic carbon centers, and is motivated by
our recent experimental and theoretical work to determine the role of hydration
forces of model protein systems [5-7]. In addition to other validation studies of
the potential [1-4], a validation using the set of protein misfolds in the decoys
database [http://dd.stanford.edu/] is in progress. Our recent preliminary tests
found the native structure as the 11th lowest in energy relative to the several
hundred subtly misfolded structures of 2CRO. Further validation on the full set
of decoys structures is planned, but were put aside to compete in CASP5.
The method is parallelizable as different subspaces can be searched
independently. The global optimization algorithm runs on the T3E and IBM/SP
at NERSC running up to 256 processors, the T3E at the Pittsburgh SC, a 40processor IBM/SP, and a 32-node cluster of Compaq DS20's. The
parallelization uses a new load balancing technique that is general for large tree
search problems using a hierarchical approach [11]. We selected targets by
considering the CAFASP servers estimate of what might be a new fold or
difficult fold recognition target. We submitted predictions on 20 targets with
strength percentages below 60% according to CAFASP statistics, and ranging
in size from 53 to 417 amino acids. Some predictions on the largest targets
were not converged, although it is certainly possible with at least a few more
weeks of lead time.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Crivelli S., Head-Gordon T., Byrd R. H., Eskow E., Schnabel R. (1999). A
hierarchical approach for parallelization of a global optimization method
for protein structure prediction. Lecture Notes in Computer Science, EuroPar '99, Amestoy, Berger, Dayde, Duff, Fraysse, Giraud, Ruiz (eds.), 578585.
Crivelli S., Philip T.M., Byrd R., Eskow E., Schnabel R., Yu R.C., HeadGordon T. (2000). A global optimization strategy for predicting protein
tertiary structure: a-helical proteins. Computers & Chemistry 24, 489-497.
Azmi A., Byrd R.H., Eskow E., Schnabel R., Crivelli S., Philip T.M.,
Head-Gordon T. (2000). Predicting protein tertiary structure using a global
optimization algorithm with smoothing. Optimization in Computational
Chemistry and Molecular Biology: Local and Global Approaches, Floudas
and Pardalos, editors (Kluwer Academic Publishers, Netherlands), 1-18.
Crivelli S., Eskow E., Bader B., Lamberti V., Byrd R., Schnabel R., HeadGordon T. (2002). A physical approach to protein structure prediction.
Biophys. J. 82, 36-49.
Pertsemlidis A., Soper A.K., Sorenson J.M. & Head-Gordon T. (1999).
Evidence for microscopic, long-range hydration forces for a hydrophobic
amino acid. Proc. Natl. Acad. Sci. 96, 481-486.
Sorenson J.M., Hura G., Soper A.K., Pertsemlidis A. & Head-Gordon T.
(1999). Determining the role of hydration forces in protein folding. Feature
Article for J. Phys. Chem. B 103, 5413-5426.
Hura G., Sorenson J.M., Glaeser R.M. & Head-Gordon T. (1999). Solution
x-ray scattering as a probe of hydration-dependent structuring of aqueous
solutions. Perspectives in Drug Discovery and Design 17, 97-118.
McGuffin L.J., Bryson K, Jones D.T. (2000) The PSIPRED protein
structure prediction server. Bioinformatics 16, 404-405.
Kreylos O., Hamann B., Max N., Bethel W. & Crivelli S. (2002).
Interactive Protein Manipulation, Tech. Report CSE-2002-28, UC Davis.
A-78
10. Ruczinski I., Kooperberg C., Bonneau R., Baker D. (2002). Distributions
of beta sheets in proteins with application to structure prediction. Proteins:
Structure, Function and Genetics 48, 85-97.
11. Crivelli S. & Head-Gordon T. (2002). A New Load Balancing Technique
for the Solution of Large Tree Search Problems using a Hierarchical
Approach. In preparation for IBM Research Journal.
HMMSPECTR (P0025) - 285 predictions: 285 3D
Protein Structure Prediction Using Hidden Markov Models
Based System (HMMSPECTR)
I.F. Tsigelny, Y.V. Sharikov, A.P. Kornev, V. Kotlovyi,
L.F. Ten Eyck
University of California, San Diego, San Diego Supercomputer Center
itsigeln@sdsc.edu
For CASP5 predictions we used the advanced version of HMMSPECTR
system (http://hmm-spectr.sdsc.edu) that was initially implemented in CASP4
(group TSIGELNY (Tsigelny) PO274). The system is based on searching for
the best alignments between the target primary sequence and members of
Hidden Markov Models (HMMs) library of protein structural homologs [1]. As
compared with the previous version of HMMSPECTR, we have increased the
size of the HMMs library and developed new search approaches.
Structural classification of the library was made in accordance with SCOP [2].
The total size of the used HMMs-library was 26263 models, including three
sub-libraries derived from three different sources. The first was the HMMs
library of SCOP superfamilies, SUPERFAMILY v.1.59, downloaded from
http://supfam.mrc-lmb.cam.ac.uk/SUPERFAMILY. The second and the third
sub-libraries have been created using protein structure alignments made by two
different programs: CE[3] and the original structure alignment tool SA
(http://www.npaci.edu/CCMS/wsat) [4]. In the case of SA implementation
alignments were made within each SCOP family, while in CE the alignments
have been built within a group of structural homologs selected from the PDB
(Protein Data Bank) for typical representatives (“title proteins”) of each SCOP
family. The HMMs were built using HMMER package [5], with 8 different
filters for gaps in HMM matrix columns (from 10% to 90% gaps allowed).
While building HMMs using the SA engine for families with a small number of
proteins, we included up to five of the best versions of alignments for each pair
of compared proteins. Such an enhancement of HMMs basis diminished the
influence of its reduction to the homologs’ primary sequences in the sparcely
populated families.
The next step of protein structure prediction was the alignment of a target
primary sequence with suggested template proteins. The above-mentioned
SCOP “title proteins” were considered as template proteins for each HMM. The
alignment of the target protein and the title protein was made using the
HMMER package with the BLOSUM62 substitution matrix.
The final step of the prediction was assessment of the alignments obtained. The
assessment was made using multiple parameters, including the HMM-score,
Secondary Structure score (SS-score) and Hydrogen Bonds score (HB-score).
The HMM-score was calculated directly by the HMMER package. For
assessment of the secondary structure identity between the target and the
template proteins an original approach for the target secondary structure
prediction was applied.
The secondary structure prediction was performed analyzing information
presented in the PDB and creating a database of all secondary structure patterns
related to particular primary sequences. Patterns of 6 and more residues were
taken into account. Different secondary structure patterns related to the
identical primary sequence patterns were averaged. After that, a BLAST-like
search of the primary sequence patterns presented in the created database was
made in the target protein primary sequence. The predicted secondary structure
of the target protein was calculated from a sum of all secondary structure
patterns aligned. SS-score was derived from comparison of the predicted
secondary structure of the target protein with the secondary structure of the
template protein.
To minimize dependency of the used scores on the alignments length, all of the
alignments were separated into groups of the same size (alignments within each
A-79
group did not differ more than 5% of the target protein length). Within every
group the alignments were ranked twice: according to HMM-score and
according to SS-score. Joined weighted rank was used for each alignment: the
HMM-score weight was 100%; the SS-score weight did not exceed 50% and
depended on the secondary structure prediction quality coefficient. The three
best alignments belonging to three unique proteins were selected for the
subsequent analysis from each group.
In the final stage the HB-score was taken into account. Calculation of the HBscore was performed by locating hydrogen bonds in template proteins using the
HBPLUS program [6]. To exclude hydrogen bonds involved in alpha-helixes
formation, the bonds with the donor and acceptor separated by less than 5
residues were not considered, and the pairs of the target protein residues
corresponding to each hydrogen bond in the template protein were estimated
according to their ability to form a hydrogen bond. Finally, the total
bonus/penalty score was normalized by the number of analyzed hydrogen
bonds.
The HB-score was used for selection between two alignments of the same
template protein as well as for distinguishing between template proteins with
close HMM/SS-ranks. For example, if a template protein had two alignments –
a short one with a high HMM-score and a long one with relatively low HMMscore, but the HB-score for the long alignment was high enough, the long
alignment was considered as a probable candidate for the target structure
prediction.
If several template proteins were aligned with different parts of the target
protein, multi-domain prediction (CASP5 – AL format) was presented (e.g.
T0139 #5, T0162 #4, T0173 #5). For some of the target proteins a new
approach using multi-aligned HMMs was tested. It was used when different
HMMs were aligned with different parts of a target protein (overlapped or not).
In this case the aligned HMM matrices were cut in accordance with the
maximum scores and merged into one ‘synthetic’ HMM matrix. This HMM
was used to search through PDB primary sequences. The protein aligned with
the HMM was considered to be the most probable structure template for the
target protein.
1.
2.
3.
4.
5.
6.
Tsigelny I. et al. (2002) Hidden Markov Models-based system (HMMSPECTR) for detecting structural homologies on the basis of sequential
information. Protein Eng. 15(5), 347-352.
Murzin A.G. et al. (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol.
247(4), 536-540.
Shindyalov I.N., Bourne P.E. (1998) Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path. Protein
Engineering 11(9), 739-747.
Kotlovyi V. et. al. (2002) A flexible method for structural alignment:
Application to structure prediction assessments. In Protein structure
prediction: Bioinformatic Approach (ed. I.F.Tsigelny), pp. 433-447.
International University Line, La Jolla, CA.
Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics 14(9),
755-763.
McDonald I.K., Thornton J.M. (1994) Satisfying hydrogen bonding
potential in proteins. J.Mol.Biol. 235(5), 777-793.
Ho-Kai-Ming (P0437) - 129 predictions: 129 3D
Three Dimensional Threading Approach to Protein Structure
Recognition
Kai-Ming Ho, Haibo Cao, Yungok Ihm,
Zhong Gao, Cai-Zhuang Wang and Drena Dobbs
Iowa State University
kmh@ameslab.gov
Our protein recognition scheme uses a threading approach in which candidate
structures are represented by contact matrices following the work of Miyazawa
and Jernigan[1]. The alignment of the target sequence on a template structure is
determined by a scoring function consisting of a sum of all residue-residue
contacts with hydrophobic strengths evaluated using the Li, Tang and Wingreen
parameterization[2] of the Miyazawa Jernigan matrix[1]. Contributions of local
A-80
secondary structure preference are included by multiplying the raw score by an
enhancement factor equal to (1+ alpha*(Nright-Nwrong)/Naligned) where
Naligned is the number of residues aligned to the structure, Nright is the
number of residues where the secondary structure (helix, sheet or loop) of the
template agrees with the result from secondary structure predictions and
Nwrong is the number of residues where they disagree. The secondary structure
predictions are obtained from jpred, psipred and samt99 (as posted on the
CAFASP website) and only predictions where all three methods agree are
counted.
Our searches are divided into two classes: for targets which has significant
sequence similarities to proteins of known structure (homology modeling or
HM targets) as indicated by the results from various servers posted on the
CAFASP website, we run threading studies on the suggested structural families
(using the structural classification from the SCOP and ASTRAL database). If
the threading score is above our threshold, we stop the search. For non HM
targets, threading studies are done with ~14000 protein structures selected from
the ASTRAL domain library[3]. This dataset covers ~1500 out of ~1800
domain families in the ASTRAL database and includes all domain families
which are shorter than 300 residues. For longer target sequences, we augment
the above database with additional families with lengths from 300 to 600
residues. To facilitate recognition, we perform our initial threading studies not
on the whole target sequence but on short subsequences with lengths ~100-120
selected from different positions on the target. Once we have hits with
threading scores above threshold, threading studies are performed on the
selected families using longer and longer subsequences representing a larger
and larger fraction of the target. In some cases, high-scoring fragments are
pasted together to yield a more complete structural prediction for the whole
protein. In the last part of the prediction process, final PDB geometries are
generated from high-scoring template structures using the alignment obtained
from the threading studies. The final modeling and refinement are done with
the software package JACKAL (J. Xiang, Columbia University) and/or the
MODELLER and PROCHECK package.
1.
Miyazawa S. and Jernigan R.L., (1985) Estimation of effective interresidue
contact energies from protein crystal structures: Quasi-chemical
approximation. Macromolecules 18, 534-552
2.
3.
Li H., Tang C., and Wingreen N.S. (1997) Nature of driving force for
protein folding: A result from analyzing the statistical potential. Phys. Rev.
Lett. 79, 765-768
Brenner S.E., Koehl P., and Levitt M. (2000) The astral compendium for
protein structure and sequence analysis. Nucleic Acids Res. 28, 254-256
HOGUE-SLRI (P0267) - 254 predictions: 254 3D
Homology Modeling using a Novel Flexible Fragment
Assembly Approach and Ab Initio Prediction Using
Distributed Computing
H.J. Feldman1,2, M. Dumontier1,2 and C.W.V. Hogue1,2
1
– Samuel Lunenfeld Research Institute, Mount Sinai Hospital
2
– Department of Biochemistry, University of Toronto
hogue@mshri.on.ca
We employed two distinct approaches for structure prediction, depending on
whether homology with a protein of known structure existed for the target.
Initially, CDDSearch (http://www.ncbi.nlm.nih.gov/structure/cdd)
was
employed to identify protein families with significant similarity to the target.
We also checked for low E-value hits from PDB-BLAST on the CAFASP site.
If a template sequence with an E-value below 0.01 was found (from either
PDB-BLAST or CDD) we proceeded with the homology modeling procedure
outlined below. Otherwise, we assigned the target to the Distributed Folding
Project (described below).
A total of 39 targets were modeled using homology modeling in the following
way. The family with the best E-value which also contained a protein of
known structure was aligned to the query CASP target, and the most similar
sequence with a structure in that family was chosen as the template for
homology modeling. In the case of multi-domain proteins, the best hit from
CDD for each domain was used as template. Template structure(s) were
A-81
manually inspected, and gaps manually adjusted when necessary to ensure they
fell on loop regions and not in the center of secondary structure elements.
Next, using a modified version of our TRADES algorithm [1] the backbone
alpha-carbon trajectory of the template was recorded, and a trajectory
distribution built with the new sequence of the target. Each gapless stretch of
alignment was replaced by a single fragment from the recorded trace. Where
gaps occurred in the alignment, fragments were built to span the gaps. These
fragments were created as follows.
The 'takeoff angles' were recorded starting from one residue prior to the gap
and ending one residue following the gap, on the template structure. These
consisted of six degrees of freedom - the distance between the start and end of
the gap, two Ca virtual angles (i.e. angles between three consecutive Ca atoms
– ‘virtual’ because they are not covalently bonded) and three Ca virtual
dihedrals. Then three Ca atoms from each side of the gap were placed in space,
according to the recorded takeoff angles. Alpha carbons required to fill the gap
were then given arbitrary starting co-ordinates within the gap region, and a
steepest descent energy minimization carried out. For the purposes of this
minimization, the energy function consisted of Ca virtual bond length
restraints, Ca virtual angle restraints, and a van der Waals term. The three
anchoring atoms on either side of the gap were held fixed during the
minimization. Finally, the resulting loop was incorporated as a fragment using
its own Ca trace.
Fragments are not completely rigid but rather are allowed a small amount of
flexibility by adding some random noise to each Ca position relative to the
previous backbone. Then roughly 1000 structures were generated using the
fragments obtained from the previous steps and our Foldtraj software, with
bump checking disabled, and due to the slight flexibility in the fragments, some
variation occurs in this pool of structures.
Using a slightly modified version of a statistical residue-based potential [2]
which we have termed 'crease energy', the best five structures were chosen.
These were then refined with a steepest-descent minimization using the
CHARMM EEF1 force field to resolve any steric clashes but without
significantly changing the structure (typically 1A RMSD between the refined
and unrefined structures).
Holm (P0090) - 38 predictions: 38 3D
A total of 13 targets were predicted with the help of distributed computing
using an ab initio approach. Using a modified version of our TRADES
algorithm [1] we incorporated secondary structure prediction from PsiPred [3]
and performed random walks in Ramachandran space. Sidechains were placed
probabilistically using Dunbrack's backbone dependent rotamer library [4]. All
residues are chirally and sterically valid, having a minimum of non-hydrogen
van der Waal collisions.
Up to 1 billion structures were generated for each target using the Distributed
Folding Project framework (http://www.distributedfolding.org/). This allowed
us to make use of spare CPU cycles on thousands of computers across the
world to sample structures.
Finally, from the pool of generated structures various statistics were collected
including radius of gyration, exposed surface area, exposed hydrophobic
surface area, and energy score according to three different scoring functions:
the EEF1 solvation term, a modified version of a statistical residue-based
potential [2] which also compared actual secondary structure content to
predicted content, and a species-specific contact potential developed in our lab.
Structures with radii of gyration greater than 120% * 2.59 * N^0.346, where N
is the number of residues in the protein, were all discarded. This ensured only
compact structures were retained.
The best structures were chosen based on their energy scores. The top 2-5
structures for each of the three energy functions used were visually inspected
and five chosen for submission.
1.
2.
3.
4.
Feldman H.J. and Hogue C.W.V. (2000) A Fast Method to Sample Real
Protein Conformational Space. Proteins 39 (2), 112-131.
Bryant S.H. and Lawrence C.E. (1993) An Empirical Energy Function for
Threading Protein Sequence through the Folding Motif. Proteins 16 (1),
92-112.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Dunbrack R.L., Jr. and Karplus M. (1993) Backbone-dependent rotamer
library for proteins. Application to side-chain prediction. J.Mol.Biol. 230
(2), 543-574.
A-82
Adaptive Profile Alignment
A. Heger and L. Holm
Institute of Biotechnology, P.B. 56, 00014 University of Helsinki, Finland
Liisa.Holm@Helsinki.fi
The method uses only sequence information and has two steps: (1) select a set
of homologous sequences that includes both the target and template proteins,
(2) align target and template sequences using many intermediate sequences as
stepping stones. For the first step, we used the transitivity of homology to
search for connected sets of sequence-similar proteins, as it has been shown
that profile-based sequence similarity searching fails to detect a large fraction
of more distant homology relationships. The second step used a novel
algorithm, MaxFlow, which in our own tests has improved both the reliability
and coverage of alignments compared to PSI-Blast. In particular, MaxFlow is
capable of generating accurate alignments between proteins, which are only
indirect PSI-Blast neighbours. If the target and template are distant
homologues, they can have genuinely different amino acid preferences, which
cannot be reasonably modelled by a single profile. MaxFlow mimics
evolutionary adaptation in that it allows the profile model to change gradually
through many intermediate stages.
Data: Sequence analysis was based on our PairsDb database, which organizes
one million non-identical protein sequences (nrdb100 set) into hierarchical
clusters. Nrdb90 is a representative subset generated at the 90 % identity level,
and nrdb40 is a representative subset generated at the 40 % identity level. All
sequences in nrdb100 are mapped to the nrdb90 or nrdb40 representative via
Blast [1] alignments. The database also stores the results of all-against-all PSIBlast searches [1] in the nrdb40 set.
Homology detection: Templates for structure prediction were selected based on
overlapping sequence neighbourhoods. Sequence neighbours were defined as
reciprocal PSI-Blast hits, i.e., profiles seeded from protein A or B detected
protein B or A, respectively, with an e-value < 1. We used a pre-computed
library of sequence neighbours of all PDB structures. If any sequence
neighbour of the target was included in the library, we counted a hit. We then
pooled the counts per SCOP superfamily. The superfamily with the most votes
was chosen as template-set. Typically, there was one SCOP superfamily, which
had a distinctly higher count than any other. We gain in sensitivity compared to
a plain PSI-Blast search, as the target A and template B need not be direct PSIBlast neighbours but may be linked by an intermediate (domain) X as in A-X,
X-B.
Sequence alignment: The notion of scoring alignments for consistency – rather
than amino acid similarity – has been around for a long time [2-4]. The input to
MaxFlow is a library of pairwise alignments. The input set of sequences was
the union of the sequence neighbours of the template set and of the target. The
pairwise alignments were taken from the PSI-Blast all-against-all database.
Provided that the target and template are in the same connected component, the
pairwise alignment library implies a transitive alignment between target and
template via a number of intermediates. There are, of course, very many
choices of intermediates that will lead to mutually inconsistent alignments
between the proteins at the start and end of the chain. The classical multiple
sequence alignment problem aims to reconcile such inconsistencies using ad
hoc objective functions, usually a sum-of-pairs score, leading to NP-complete
optimisation problems [4]. MaxFlow uses a novel type of objective function
for transitive alignment, which is based on a path score. The path score
measures the total support in the alignment library for pairing a given pair of
residues of two proteins. The algorithmic advantage is that we need only
address the standard pairwise alignment problem, which can be solved exactly.
Empirically, MaxFlow’s consistency score correlates with the reliability of
alignment, so that one can select more reliable core parts of the alignment, but
this information could not be entered into the AL format of CASP.
1.
2.
3.
4.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Vingron M., Argos P. (1991) Motif recognition and alignment for many
sequences by comparison of dot-matrices. J. Mol. Biol. 218, 33-43
Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A novel method
for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-217
Notredame C. (2002) Recent progress in multiple sequence alignment: a
survey. Pharmacogenomics 3, 131-144
A-83
Honig (P0110) - 113 predictions: 113 3D
Comparative Modeling Using HMAP, NEST, Troll and
Physical-Chemical Principles
Zhexin Xiang1,2 , Donald Petrey1,2, Cinque Soto2,
Chris Tang3 and Barry Honig1,2
1
Howard Hughes Medical Institute, 2 Department Of Biochemistry And
Molecular Biophysics, 3 Integrated Program in Cellular, Molecular and
Biophysical Studies, Columbia University
bh6@columbia.edu
Overview - We participated in the fold recognition and homology sections of
the experiment using primarily in-house software. Much of this software is
novel and has not yet been published. The in-house software we used includes
HMAP (a hybrid sequence and structure based alignment between query and
template profiles), NEST (a new homology modeling program that is based on
an artificial evolution method), SCAP [1] and LOOPY [2] (a side-chain and
loop prediction program based on the colony energy approach), Troll [3]
/GRASP2 (an interactive program which contains all of the features of GRASP
plus multiple structure alignments and an easy to use graphical user interface
that displays both sequence and structure alignments), DIFALN/BINGO (a
graphical program to display and manually tune sequence alignments between
HMAP and CAFASP servers) and physical-chemical based energy functions to
evaluate alternate conformations.
Our strategies for fold recognition and homology modeling were very similar.
For fold recognition we generally attempted targets where HMAP detected
templates with a reasonable e-value threshold, or where we felt that HMAP
improved the alignments that came from the CAFASP servers. On occasion, we
noticed that CAFASP servers would detect significant hits where HMAP did
not. In all cases, this happened because the hit detected by the servers was not
in our database. Thus, we built a profile using HMAP for the new template and
used it to generate our own alignments. If we felt we had nothing to add beyond
what the servers listed, we decided not to submit that target.
For each target we would perform the following: 1) build 3D models for
sequence alignments from HMAP and selected CAFASP servers; 2) evaluate
each model with our own energy functions and with Verify3D [4]; and 3)
identify regions of the sequence where multiple structure alignments of family
members revealed either similarities or differences. If differences were
identified, we generally used energetic criteria to decide between models, but
on occasion used intuition derived from visually inspecting the alignments. The
alignments were adjusted based on the energy criteria and steps 1-3 above were
carried out again. This process was repeated until a satisfactory structure was
generated. One area where visual inspection was particularly useful was in
deleting insertions. In many cases we could easily delete loops and even some
secondary structure elements while minimally perturbing the structure.
Our strategy for homology modeling was closely related to that used in fold
recognition but with a few additional steps. Since NEST works so rapidly we
were able to use regions from different templates where we believed they
provided better local templates, and then fuse the ends of these regions into our
original template with a loop closure procedure [2]. In general, we did not try to
keep the target as close as possible to the template. We realized that this was a
risky procedure but we felt it important to test our ability, for example using the
refinement module of NEST, to try to relax the structure. This was sometimes
done with manual input. For example we always tested for buried charges and
unless we could visually identify a potential ion-pairing partner we would
either change the alignment or try to change the structure. This involved both
backbone and side chain movement.
Methods-HMAP is a fold-recognition and alignment program that relies on
profile-to-profile dynamic programming. Template profiles were derived from
SCOP-defined protein domains (version 1.57 at the time we built our database)
and consisted of several different types of information that could be derived
from the sequence and structure of a protein. In CASP5, our templates
primarily used information derived from secondary structure, fixed-length
sequence motifs, automated multiple structure alignments and sequence-based
profiles. Position-specific gap penalties were derived from the secondary
structure profiles generated from multiple alignments of structurally related
proteins. The results were stored in the form of a database of structural
templates. Profiles were calibrated so that the statistical significance of a hit
could be estimated. When a new target was released from CASP, we built a
A-84
query profile for the sequence based on its sequence-based profile and
secondary structure prediction (using a consensus between PSI-PRED [5], PHD
[6] and JNET [7]). The alignments given by HMAP were manually assessed
and then fed to the homology-modeling program NEST.
NEST is a homology program based on an artificial evolution method
(http://trantor.bioc.columbia.edu/~xiang/jackal). The program can build and
refine homology models based on single, composite or multiple templates.
Given an alignment between a query sequence and a template, the alignment
can be considered as a list of operations such as residue mutation, insertion or
deletion. Building a structure for the query sequence based on the template is a
process of performing these operations. Each operation will disturb the
template structure and involves an energy cost, either positive or negative. The
model building starts from the operation with the least energy cost and so on.
Each operation is finished with a slight energy minimization to remove atomic
clashes. The final structure is then subjected to more thorough energy
minimization. The minimization is done in torsion angle space. The energy
function consists of the following terms: van der Waals energy, hydrophobic,
electrostatics, torsion angle energy, hydrogen-bond network energy of the
template, and statistical energy of a residue’s solvent accessibility. The
structure refinement module in NEST can refine the models in four levels:
energy minimization of clashing atoms, refinement of insertion and deletion
regions, refinement in all loop regions and refinement in all α/β regions.
Refinement of loop regions is done using LOOPY and refinement of side-chain
conformations is performed using SCAP, where both SCAP and LOOPY use
the colony energy approach to account for the flexibility of side chains and
loops on the protein surface. Refinement of helix or sheet regions is done by a
procedure similar to LOOPY, but the hydrogen constraints in the regular
secondary structure regions are applied so that the refinement does not disrupt
the original hydrogen bond network.
Models were evaluated by comparing energies of the models using a protocol
that combines an extensive molecular mechanics minimization with an
evaluation of the total electrostatic energy using the finite-difference PoissonBoltzman method. Powell minimization using an all-hydrogen model and
CHARMM22 parameters and a dielectric constant of 10 was performed. Low
energy structures were considered for submission. This procedure was
combined with visual evaluation of the models using the program GRASP2
written with the Troll software library of molecular analysis and visualization
tools. In addition to the molecular graphics, surface display, and electrostatic
features of the original version of GRASP, GRASP2 now integrates structure
alignment and sequence display/alignment tools into the graphical user
interface. These tools allow a user to conveniently search a database of
domains for proteins that are structurally homologous to a given template, and
to simultaneously display/compare different alignments to a template or
alignments to different templates. This is accomplished by carrying out a
multiple structure alignment of a set of templates and then adding alignments of
a query to each template to the multiple structure alignment. Structure
alignments were generated as follows. First, equivalent secondary structure
elements are identified using a double-dynamic programming algorithm. Once
structurally equivalent secondary structure elements are identified, structurally
equivalent residues are identified by superposing the end-points of the
equivalent secondary structure elements and then carrying out an iterative
process of sequence alignment. Residue similarity at this stage is a simple
function of the distance between alpha-carbons given the current rigid body
superposition. A sequence alignment is determined using this similarity score
and rigid body superposition is carried out again. This process is repeated until
the change in root-mean square deviation of aligned carbon-alpha atoms does
not change by more than a given threshold.
The simultaneous
display/comparison of alignments and structures allows convenient
identification of structural features that may be responsible for differences in
the more objective evaluation criteria such as calculation of molecular
mechanics energies or Verify3D profiles [4] and contributed significantly to the
decision as to which model/alignment to submit.
1.
2.
3.
4.
5.
Xiang Z. and Honig B. (2001) Extending the Accuracy Limits of
Prediction for Side Chain Conformations. J. Mol. Biol. 311:421-430.
Xiang Z., Soto C and Honig B. (2002) Evaluating Conformational Free
Energies: The Colony Energy and its Application to the Problem of Loop
Prediction. Proc. Natl. Acad. Sci. USA 99:7432-7437.
Petrey D. and Honig B. (2000) Free Energy Determinants of Tertiary
Structure and the Evaluation of Protein Models. Protein Science 9:21812191.
Luthy R., Bowie J.U. and Eisenberg D. (1992) Assesment of Protein
Models with Three- Dimensional Profiles. Nature 356:83-85.
Jones D. (1999) Protein secondary structure prediction based on positionspecific scoring matrices. J Mol Biol. 292(2):195-202.
A-85
6.
Rost B. (1996) PHD: predicting one-dimensional protein structure by
profile-based neural networks. Methods Enzymology. 266:525-39.
Cuff J.A. and Barton G.J. (2000) Application of multiple sequence
alignment profiles to improve protein secondary structure prediction.
Proteins. 240(3): 502-11.
7.
Huber-Torda (P0351) - 83 predictions: 83 3D
Fold Recognition and Sequence to Structure Alignment
Using Wurst
T. Huber1, J.B. Procter2 and A.E. Torda2
1
2
- Department of Mathematics, The University of Queensland, Australia,
- Zentrum für Bioinformatik Hamburg, University of Hamburg, Germany
huber@maths.uq.edu.au, procter@zbh.uni-hamburg.de,
torda@zbh.uni-hamburg.de
Our calculations were performed with the "wurst” [1], a locally written protein
structure prediction package. Fold recognition calculations were done using a
two-step approach where completely different score functions are used for
alignment and ranking of models [2].
The first score function was based on optimized Bayesian-mixture models
designed to measure sequence to structure compatibility in small fragments.
This treats both sequence and backbone angles as statistical descriptors. Despite
the unusual formalism, it allows one to build a score matrix for sequence to
stucture alignments and easily combined with a term accounting for sequence
similarity. For ranking models, these terms were mixed with a low-resolution
(five site / residue), z-score optimized score function [3].
Alignments were calculated using a Smith and Waterman style local alignment
and extended using the globally optimal Needleman and Wunsch algorithm.
Gap penalties and relative contributions of different terms were optimized
using a simplex method to produce the best models, in a geometric sense, for a
set of structurally similar proteins. The penalty function only considered the
quality of a model and did not refer to any “ideal” alignment.
The library of candidate models was built using a clustering of known proteins,
based on their apparent similarity in the appropriate score function, rather than
a conventional sequence or structure measure.
Finally, models were regularized using a distance geometry code.
1.
2.
3.
4.
Huber T et al. (1999) SAUSAGE: Protein threading with flexible force
fields. Bioinformatics 15 , 1064-1065.
Huber T and Torda A.E. (1999) Protein sequence threading, the alignment
problem and a two-step strategy. J. Comput. Chem. 20, 1455-1467.
Huber T. and Torda A.E. (1998) Protein sequence threading, the alignment
problem and a two-step strategy. Protein Sci. 7, 142-149.
Russel A.J. and Torda A.E. (2002) Protein sequence threading – averaging
over structures. Proteins 47, 496-505.
I-sites/Bystroff (P0132) - 64 predictions: 64 3D
Fully Automated Ab Initio Tertiary Structure Prediction
Using I-SITES, HMMSTR and ROSETTA
C. Bystroff and Y. Shao
Department of Biology, Rensselear Polytechnic Institute
shaoy@rpi.edu, bystrc@rpi.edu
The Monte Carlo fragment insertion method for protein tertiary structure
prediction (ROSETTA) of Baker and others, has been merged with the I-SITES
library of sequence structure motifs and the HMMSTR hidden Markov model
for local structure in proteins, to form a new public server for the ab initio
prediction
of
protein
tertiary
structure
(www.bioinfo.rpi.edu/~bystrc/hmmstr/server.html). The server also predicts
A-86
fragment structure, backbone angles, and secondary structure. The CASP4
results for this server are presented in [1].
The following is a short description of each of the main programs in the order
they appear in the server script
Generate a multiple sequence alignment
Single sequences were submitted to PSI-BLAST[2], which returns a multiple
sequence alignment. The multiple sequence alignment was converted to a
sequence profile.
Predict I-sites motifs
The sequence profile is compared in a sliding-window fashion with each of the
261 I-sites Library scoring matrices [3]. The score is mapped to a confidence.
The server returns a list of fragment predictions, expressed as backbone angles,
sorted by confidence. The highest confidence fragments are referred to as "Isites predictions," the whole list as "I-sites fragments."
Generate a fragment moveset
The I-sites fragment list was converted to a ROSETTA move set containing
libraries of length 3 and 9 peptides for each window of the sequence. Each Isites fragment with length L ≥ 9 was divided into L-9+1 subsegments of length
9. The 25 highest confidence fragments were kept for each 9-residue window
in the query. If fewer than 25 high confidence fragments were found, then the
list was augmented by extending 7 and 8 residue I-sites fragments. A similar
procedure was done for the moveset of length 3.
Restrain high-confidence regions
High confidence I-sites predictions were restrained to their predicted backbone
angles to increase the efficiency or ROSETTA. Fragment insertion was allowed
in the restrained regions, but moves were rejected if any angle deviated by
more then 60° from the I-sites prediction. A maximum of one-third of the
residues could be restrained.
Divides long sequences
If the target sequence had more than 36 un-restrained residues after teh
previous step, then it was divided into overlapping segments having about 36
un-restrained residues each. Adjacent segments overlap by at least 18 unrestrained residues, plus any intervening restrained segments.
5.
Assemble fragments
ROSETTA[4] searches protein conformational space using fragment insertion
moves and a Monte Carlo acceptance critereon. An insertion point in the target
is selected at random, then a fragment (either length 3 or 9) is selected at
random from the fragment library. The backbone angles are changed to those of
the fragment, new coordinates are computed from the backbone angles, and the
move is accepted or rejected, using Monte Carlo. The energy function [5] is
composed of structure-based Bayesian conditional probability expressions,
drawn from the PDBselect database [6].
The probability of acceptance depends on the change in energy, and the
temperature setting (T). T is set initially set to a high value so that most
physically-possible moves are accepted, then decreased linearly over 12,000
moves. The optimal temperature schedule depends on the length of the chain
being simulated, or more specifically, the number of degrees of freedom. In this
automated proceedure, a fixed temperature schedule was used and the length of
the input sequence was restricted to a narrow range.
Re-assemble split sequences
A total of 15 fragment predictions were produced by ROSETTA for each
segment. The 5 best predictions for adjacent segments were re-combined by
exhaustive splicing. Starting with two sets of overlapping segment predictions,
all possible crossover hybrid models were made and the five with the lowest
energy were saved for the next round, or for final output.
1.
2.
3.
4.
Bystroff C. & Shao Y. (2002). Fully automated ab initio protein structure
prediction using I-SITES, HMMSTR and ROSETTA. Bioinformatics 18,
S54-61.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Bystroff C. & Baker D. (1998). Prediction of local structure in proteins
using a library of sequence-structure motifs. J Mol Biol 281, 565-77.
Simons K.T., Kooperberg C., Huang E. & Baker D. (1997). Assembly of
protein tertiary structures from fragments with similar local sequences
using simulated annealing and Bayesian scoring functions. J Mol Biol 268,
209-25.
A-87
6.
Simons K.T., Ruczinski I., Kooperberg, C., Fox B.A., Bystroff C. & Baker
D. (1999). Improved recognition of native-like protein structures using a
combination of sequence-dependent and sequence-independent features of
proteins. Proteins 34, 82-95.
Hobohm U. and Sander C. (1994) Enlarged representative of protein
structures. Protein Science 3, 522.
INFORMAX (P0326) - 24 predictions: 24 3D
Modeling of CASP5 Targets with GenoMax 3.3 ™ Homology
Modeling Tool
Feodor Tereshchenko
InforMax Inc.
feodor@informaxinc.com
The homology modeling algorithm is described in [1]. Alignments were created
in GenoMax 3.3 ™ environment [2] and manually edited.
The following predictions were submitted:
T0133: Residues 293-304 modeled.
Template - 1SP1.
T0135: Residues 42-100.
Template – 1FCE.
T0137: Residues 1-131.
Template – 1B56.
T0141: Residues 3-73.
Template - 1MIM.
T0144: Residues 5-172.
Template - 1DYW_A.
T0150: Residues 2-97.
Template - 1CN9_A.
T0154: Residues 12-286.
Template - 1IHO_A.
T0155: Residues 4-118.
Template - 2DHN.
T0158: Two models submitted.
Model No. 1: Residues 12-317.
Template - 1JJI_B.
Model No. 2: Residues 10-319.
Template - 1EVQ_A.
T0160: Residues 11-128.
Template - 2MSP_A.
T0163: Residues 1-216.
Template - 1EL7_A.
T0169: Residues 17-69.
Template - 1CZF_B.
T0171: Residues 43-130.
Template - 1OIL_A.
T0175: Residues 14-211.
T0177: Residues 41-83.
T0178: Residues 96-166.
T0179: Residues 4-275.
T0180: Residues 10-30.
T0182: Two models submitted.
Model No. 1: Residues 6-245.
Model No. 2: Residues 6-246.
T0190: Residues 6-114.
T0193: Residues 94-203.
T0194: Residues 164-233.
1.
2.
Template - 1JG4_A.
Template - 1FYJ_A.
Template - 1JCL_A.
Template - 1JQ3_D.
Template - 1F6D_A.
The first step in our calculations is to perform separate simulations of different
fragments of the chain. The secondary structure prediction guides the choice of
fragments to be studied, and is sometimes also used to bias parts of the
fragments towards -helix or -strand structure. Structures obtained from these
simulations are then taken as input for simulations of larger fragments. This is
iterated till the full chain has been studied. If no reasonable structure is found,
the calculations are restarted from the beginning or some intermediate level.
To analyze the structures obtained from the simulations, we use a simple
clustering algorithm, based on root-mean-square deviations.
Template - 1C27_A.
Template - 1MAT.
Template - 1FTP_A.
Template - 2SCU_A.
Template - 1D4U_A.
Tereshchenko F., Daraselia N. (2000) A homology modeling algorithm for
protein tertiary structure prediction. CASP4 Proceedings, Pacific Grove,
CA. 50-51.
GenoMax 3.3 Users Manual. (2001) InforMax Inc. Bethesda, MD.
Our conformational search is Monte Carlo-based. Only torsional degrees of
freedom are considered. Backbone torsion angles are updated using different
move sets at different places along the chain. For parts of the chain to be
refined only, we use a semi-local `small-step’ algorithm. For other parts, we use
both this update and more drastic updates that lead to non-local deformations of
the chain.
1.
Irback (P0559) - 20 predictions: 20 3D
Hierarchical All-Atom Procedure for Protein Structure
Prediction
2.
G. Favrin1, A. Irbäck1, B. Samuelsson1,
F. Sjunnesson1 and S.Wallin1
1
Irbäck A., Samuelsson B., Sjunnesson F. and Wallin S. (2002)
Thermodynamics of - and -structure formation in proteins. Preprint LU
TP 02-28, available at www.thep.lu.se/complex/publications.html.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
IST-ZORAN (P0454) - 195 predictions: 195 DR
– Complex Systems Division, Department of Theoretical Physics,
Lund University, Sölvegatan 14A, SE-22362 Lund, Sweden
anders@thep.lu.se
Three Ensemble Predictors of Protein Disorder
Our calculations are performed using an all-atom model with a minimalistic
potential [1], in combination with PSIPRED [2] secondary structure
predictions. The potential function of this model consists of three terms
representing excluded volume, hydrogen bonds and hydrophobic attraction,
respectively. The model has been tested on an -helix and a -hairpin, using
the same parameters for both peptides [1]. To be able to deal with larger chains,
we proceed in a hierarchical manner, using secondary structure predictions.
A-88
K. Peng, S. Vucetic, and Z. Obradovic,
Center for Information Science and Technology, Temple University
zoran@ist.temple.edu
To predict protein disorder in CASP proteins, we used three different
predictors. The first was based on the attributes derived from amino acid
compositions, without incorporating evolutionary information [1]. Another two
predictors incorporated evolutionary information [2] - the first was built using
training set enhanced by homologues of the true disordered regions, the second
was built from family profiles obtained from PSI-BLAST [3]. All the three
predictors are ensembles of 10 feed-forward neural networks constructed from
the same training data.
Predictor1. The first predictor constructs 20 attributes (18 amino acid
frequencies, average flexibility and sequence complexity) at each sequence
position using an input window of length Win centered at the position. The raw
predictions are averaged over an output window of length W out to obtain the
final prediction for a given position. The dataset included 150 proteins with
disordered regions longer than 30 consecutive residues and 290 completely
ordered proteins. By comparing the predictor performance, we selected the best
window size combination as Win = 41 and Wout = 61. The accuracy using 30fold cross-validation was 76.08% on disordered regions and 91.11% on ordered
proteins.
Predictor2. To incorporate evolutionary information, the second predictor was
trained using disordered examples taken from 150 disordered proteins as well
as their homologues found by PSI-BLAST search against the non-redundant
database (nr), while ordered examples were still taken from 290 completely
ordered proteins. Homologues with too high E-values (>1e-05) or too low Evalues (<1e-30) were excluded. Using random sampling, the disordered
examples from the same family had an equal chance to be selected for training.
This predictor constructs a set of composition-based attributes and averages the
raw predictions in the same manner as predictor1. Given Win = 41, Wout = 61,
the accuracy using 30-fold cross-validation was 80.77% on disordered regions
and 89.87% on ordered proteins.
Predictor3. The third predictor constructs attributes from the family profiles
obtained by PSI-BLAST search against the non-redundant database (nr). The
substitution scores for each amino acid are averaged over an input window of
length Win to obtain 20 profile-based attributes. They are used along with the
average flexibility and the sequence complexity attributes calculated over the
same input window of length Win. Similarly to the predictor1, the raw
predictions are also averaged over an output window of length W out. Given Win
= 41 and Wout = 61, the accuracy using 30-fold cross-validation was 79.17% on
disordered regions and 91.34% on ordered proteins.
A-89
1.
2.
3.
Vucetic S. et al. (2001) Methods for Improving Protein Disorder
Prediction, Int'l Joint Conf. on Neural Networks 2001.
Peng K. et al. (unpublished) Improving Protein Disorder Prediction by
Incorporating Evolutionary Information and Optimizing Knowledge
Representation.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Jager (P0582) - 7 predictions: 7 3D
Protein Minimization by Multiscale Decomposition
Lukas Jager
University of Bonn, Institute for Applied Mathematics, Department for
Scientific Computing and Numerical Simulation
jager@iam.uni-bonn.de
The tertiary structure of the target proteins are predicted using a novel
multilevel minimization algorithm. The proteins are described by the
CHARMM [1] force-field with an all-atom representation including hydrogens.
The minimization procedure employs a three level decomposition of the protein
where the coarse level potentials are automatically computed.
This
computation is based on the all-atom representation only. For the minimization
itself we use an outer (global) Basin-Hopping algorithm [2]. The BasinHopping algorithm accepts or rejects each minimum, which is calculated from
a disturbed configuration of the last accepted minimum, according to a standard
metropolis probability. To speed up convergence each outer minimization is
preconditioned by a combined Basin-Hopping on the coarsest level with a local
minimization on the other levels. Our numerical results clearly show that this
multilevel sampling strategy significantly improves the efficiency compared to
a simple local minimization algorithm [3].The details of the method are as
follows:
Starting from the all-atom representations of the protein two additional
representations with fewer particles are calculated. The first coarse
representation of the protein, what we call the "endless" representation, is built
by simply removing all ends of the protein, i.e., all atoms with only one bondneighbor are removed. To adjust the force-field the potential parameters of the
remaining atoms are recalculated. The second coarse level consists only of hard
spheres each representing one amino acid. The spheres are located at the center
of "mass" of the amino acids, where the "epsilon" parameter of the LennardJones-Potential on the finest level, which determines the depth of the potential
minimum, is used for weighting. The radius of the sphere is defined by the
maximum distance between an atom of an amino acid and the sphere's center.
The potential on the coarsest level consists only of bonds between adjacent
spheres and a (coarse) Lennard-Jones-Potential where the (coarse) parameters
are computed from the all-atom representation on the finest level.
The coarsest model of the protein is then minimized using the Basin-Hopping
algorithm described above. The "endless" representation is rearranged
according to the lowest found minimum on the coarsest level and updated by
local minimization. Finally the all-atom representation is minimized locally
starting from the structure which was calculated by refining the "endless"
representation. Each minimum found by this procedure is taken as one step of
the global Basin-Hopping algorithm and thus accepted or rejected depending on
the potential energy on the finest level. Finally the configuration with the
lowest potential energy is taken as the best guess for the protein configuration.
1.
2.
3.
MacKerell A.D. Jr., Brooks B., Brooks C.L., Nilsson L., Roux B., Won Y.,
Karplus M. (1998) CHARMM: The Energy Function and Its
Parameterization with an Overview of the Program, The Encyclopedia of
Computational Chemistry, 1, 271-277, P. v. R. Schleyer et al., editors John
Wiley & Sons: Chichester.
Wales D.J., Doye J.P.K. (1997) Global Optimization by Basin-Hopping
and the Lowest Energy Structures of Lennard-Jones Clusters Containing
Up to 110 Atoms. Chem. Phys. Lett. 269, 408-412.
Jager L. (2002) Zur globalen Minimierung von Energie-Funktionen.
Diplomarbeit, Institut für Angewandte Mathematik, Universität Bonn.
jive (P0506) - 37 predictions: 37 3D
JIVE: Protein Structure Prediction by the Assembly of Local
Supersecondary Structural Motifs
David F. Burke, and Tom L Blundell
Department of Biochemistry, University of Cambridge,80 Tennis Court Road,
Cambridge, CB2 1GA, United Kingdom
dave@cryst.bioc.cam.ac.uk
In the CASP 5 experiment, models of proteins which had low confidence
values across the CAFASP3 servers were selected to be modelled by JIVE.
JIVE predicts the structure of small continuous domains of proteins by the
assembly of fragments of local supersecondary motifs. Initially, homologous
sequences were identified using PSI-BLAST[1] and secondary structure
prediction was performed using PHD[2]. The conformational class of the
supersecondary fragments were predicted using SLOOP[3-5]. SLOOP uses
sequence/structure profiles derived from a database of loops clustered on the
conformation of the loop and surrounding secondary structures. These
fragments were then assembled using a Monte Carlo simulation. Unsuitable
models were rejected based on excluded volume and a distance-dependent
conditional probability function[6].
The generated structures were then searched against protein atructures from
both the HOMSTRAD[7] database of homologous families and the
CAMPASS[8] database using the program SEA[9]. Potential hits were then
analysed further for validation. In all, 17 targets were submitted.
1.
2.
3.
A-90
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs.Nucleic Acids Res.
25(17):3389-402.
Rost B., et al. (1994) PHD-an automatic mail server for protein secondary
structure prediction.Comput Appl Biosci.10(1):53-60
Donate L.E., et al. (1996) Conformational analysis and clustering of short
and medium size loops connecting regular secondary structures: a database
for modeling and prediction. Protein Sci. 5(12):2600-16
4.
5.
6.
7.
8.
9.
Rufino S.D. et al (1997) Predicting the conformational class of short and
medium size loops connecting regular secondary structures: application to
comparative modelling. J Mol Biol. 267(2):352-67.
Burke D.F. et al. (2001) Improved Loop prediction from sequence alone.
Protein Engineering 14 (7) 473-478
Samudrala R. et al. (1998) An all-atom distance-dependent conditional
probability discriminatory function for protein structure prediction. J Mol
Biol. 275(5):895-916
Mizuguchi K., et al. (1998) HOMSTRAD: a database of protein structure
alignments for homologous families. Protein Science 7 2469-2471.
Sowdhamini R., et. al. (1996) A database of globular protein structural
domains: clustering of representative family members into similar folds.
Fold Des 1 (3):209-20
Rufino S.D. et al. (1994) Structure-based identification and clustering of
protein families and superfamilies. J Comput Aided Mol Des 8(1):5-27
Jones (P0067) - 121 predictions: 68 3D, 53 SS
discriminate between correct and incorrect fold recognition matches. This
method is still very experimental, but it was used for all "non-obvious"
predictions targets. Final predictions were based on the neural network output.
Predictions for targets where the neural network output (range 0-1) of the top
match was < 0.5 were not submitted (but were still considered for ab initio
prediction if the size permitted). Only a single prediction was submitted for
each target, unless either a second fold had an equal score to the top hit or in a
few cases where more than one alignment was generated with and without
secondary structure prediction inputs.
Targets with obvious homology to existing structures were predicted using
GenTHREADER and mGenTHREADER [3] as submitted to the CAFASP3
prediction section. However, in making CASP5 submissions, we also
considered other models obtained from the CAFASP3 web server. A new
program called MODCHECK was used to evaluate the ensemble of collected
structures in order to identify the model predicted to have the highest accuracy.
MODCHECK is based on the same potentials used for THREADER3, but
makes use of a large number of shuffled sequence re-alignments in order to
estimate the specificity of the initial sequence-structure alignment.
1.
Fold Recognition Using THREADER and GenTHREADER
2.
D. T. Jones & L. McGuffin
Bioinformatics Unit, Dept. of Computer Science, University College London,
Gower Street, London WC1E 6BT
dtj@cs.ucl.ac.uk
3.
THREADER3 is the latest incarnation of our original program to implement
threading [1] (D.T. Jones et al. Nature 358, 86-89, 1992) and although it now
incorporates a number of new features (in particular the use of PSI-BLAST [2]
profiles), and a more refined set of potentials, the overall components of the
current implementation remain more or less unchanged since CASP2. The fold
library and potentials used throughout CASP5 was based on representative
protein chains from the FSSP data bank and domains found in SCOP V1.57.
As for our CASP4 predictions, the raw threading output was evaluated using a
neural network (similar to that used in GenTHREADER [3]) trained to
A-91
Jones D.T., Taylor W.R. & Thornton J.M. (1992) A new approach to
protein fold recognition. Nature 358, 86-89.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold
recognition method for genomic sequences. J. Mol. Biol. 287, 797-815.
Jones-NewFold (P0068) - 214 predictions: 87 3D, 63 SS, 64 DR
1. Selection of fragment library. At each sequence position a list of 10
structural fragments is generated. These fragments (supersecondary motifs or
fixed length fragments) are taken from 200 highly resolved protein chains with
no chain breaks. The selection process involves ranking the fragments in order
of potential energy Z-scores (an ungapped alignment is used for this ranking),
and excluding fragments based on the PSIPRED secondary structure prediction.
FRAGFOLD, PSIPRED and DISOPRED: Methods for
Prediction Of New Folds and Elements of Local Structure
D. T. Jones, J. Ward & L. McGuffin
Bioinformatics Unit, Dept. of Computer Science, University College London,
Gower Street, London WC1E 6BT
dtj@cs.ucl.ac.uk
For CASP5 targets which we could not reliably predict using fold recognition
methods, our FRAGFOLD [1] method was used to generate up to 5 structures.
This approach to protein tertiary structure prediction is based on the assembly
of recognized supersecondary structural fragments taken from highly resolved
protein structures using a simulated annealing algorithm.
For all targets (including CM and FR targets), secondary structure was
predicted using PSIPRED [2-3]. PSIPRED predictions in CASP5 (as opposed
to CAFASP3) were generated with a database updated at the CASP deadline
rather than the CAFASP deadline. Also for CASP targets which were obviously
multidomain, PSIPRED predictions were made for the individual domains and
then combined. Two new programs were tried at CASP5: PSIPRED-SVM and
DISOPRED. PSIPRED-SVM is a reimplementation of PSIPRED using Support
Vector Machines rather than neural networks. PSIPRED-SVM was trained on a
much smaller dataset than PSIPRED, and yet appears to have equal, if not
slightly better performance from our own cross-validation benchmarks.
DISOPRED makes use of a variation of the original PSIPRED method to
predict disordered regions in proteins. Regions which are predicted by
PSIPRED to be coil regions are further analysed using a second neural network
trained to identify disordered regions. At present, a crude training set has been
used for this network, which is derived by defining missing regions in protein
structures (determined by alignment of the PDB SEQRES records with the
ATOM records) as regions likely to be disordered. We hope to refine this
training set by manual inspection, and by examination of NMR structure
ensembles.
Our 3-D submissions were calculated using the following procedure:
A-92
2. Simulation. A classic Metropolis scheme is employed in running the
simulation. Random moves are made by selecting either one of the preselected
10 fragments at a randomly chosen sequence position, or a free choice is made
from all 3-5 residue fragments from the entire fold library. These moves are
first tested to ensure than the generated structure is physically possible (steric
checks) and then accepted if the Metropolis criterion is met. The starting
temperature for the simulation is selected by making 500 random moves to the
starting conformation and calculating the largest absolute energy change
between any two moves. The simulation is started at a temperature
corresponding to 10 times this delta-E, and the temperature is halved after
either 5000 random moves have been accepted by the Metropolis criterion, or a
total of 50000 sterically allowable moves have been tested.
3. Potentials. The THREADER V3.0 potentials were used. These are distancedependent potentials of mean force compiled from a non- redundant set of
protein chains with resolutions < 2.6 Angstroms. Predicted secondary structure
was not incorporated in the objective function. Energies were summed over
homologous sequences. In addition to the mean force terms, simple terms were
added to take account of hydrogen bonding in beta-sheets and steric hindrance.
An improved solvation potential was used which allowed us to dispense with
an additional chain compactness term.
4. Final model selection. Between 500 and 2000 separate simulations were run
in parallel with different random seed values on a farm of 75 dual-CPU Linux
machines. The final structures were clustered using a fast interative rigid-body
clustering program (RMSDCLUST) and the representatives of the largest
clusters were submitted (up to the CASP maximum of 5) as final predictions.
1.
2.
3.
Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of
NK-Lysin using multiple sequences and recognized supersecondary
structural motifs. PROTEINS. Suppl. 1, 185-191.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
http://www.psipred.net
Kaznessis (P0548) - 15 predictions: 15 3D
Guiding Molecular Mechanics Simulations of Protein Folding
with Correlated Mutations Analysis
Yiannis Kaznessis, Himanshu Khandelia and Spyros Vicatos
Department of Chemical Engineering and Materials Science, and Digital
Technology Center, University of Minnesota, Minneapolis, MN 55455
yiannis@cems.umn.edu
proposed to be proximal in the structure of each target protein. Proximity is
declared if the C-alpha atoms are closer than 6 Angstrom.
MOLECULAR DYNAMICS
CHARMM [5] was used to carry out molecular dynamics simulations (MD).
The protein's initial configuration was a linear chain. The simulation was
carried out in infinite dilution in a continuum dielectric of epsilon=70.0. A
restraining spring-like attractive potential was used to bring together the Calpha atoms of the amino acids predicted to be close by the CMA. The strength
of the spring constant was varied in different simulations, so that the energy of
the constraints ranged between 0.5% and 3% of the total energy of the protein.
The protein was minimized for 5000 steps using the SHAKE algorithm. The
minimized structure was then subjected to 10 ps of heating by velocity scaling
from 23.3 K to 323.3 K. MD was then carried out for 11 ns with a time step of
2 femtoseconds. The constraints were then relaxed and MD was carried out for
another 11 ns with the same time step. We picked the protein conformations
having the lowest energies in the latter half of the simulation. (i.e. without
constraints).
1.
A combination of correlated mutations analysis and molecular dynamics
simulations was used to predict the structure of the target sequences.
2.
CORRELATED MUTATIONS ANALYSIS
Multiple sequence alignments (MSA) were built processing the target
sequences using PsiBlast and Clustalw [1,2].
A correlated mutations analysis (CMA) was performed on each MSA to predict
pairs of amino acids that are proximal in the structure of the protein [3].
Specifically, the CMA consists of calculating three different correlation
coefficients for all pairs of positions in the MSA. Principal component analysis
was used on 142 experimentally determined amino acid properties [4] to filter
out three orthogonal descriptors of amino acid properties. The first principle
component is associated with the hydrophobicity of the residues, the second is a
measure of size, and the third is related to electronic properties of amino acids.
The descriptors were used to calculate correlation coefficients for each pair of
positions in the MSA. Pairs of positions distant in the alignment (i-j>4) and
having the highest coefficients were used to form a set of distance constraints in
molecular dynamics simulations. Eventually, 18 pairs of amino acids were
A-93
3.
4.
5.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Thompson J.D., Higgins D.G. & Gibson T.J. (1994) CLUSTAL W:
improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight
matrix choice. Nucleic Acids Res. 22, 4673-4680.
Neher E. (1994) How frequent are correlated changes in families of protein
sequences?, Proc. Natl. Acad. Sci., 91, 98-102.
Shuichi K., Ogata H. & Kanehitsa M. (1999) AAindex: amino acid index
database. Nucleic Acid Res. 27, 368-369.
Brooks B.R., Bruccoleri R.E., Olafson B.D., States D.J., Swaminathan S.
& Karplus M. (1983). CHARMM: a program for macromoleuclar energy,
minimization, and dynamics simulations. J. Comp. Chem. 4, 187-217.
keasar (P0429) - 90 predictions: 90 3D
[4] (T0131, 0135, T0136 fragment, T0148 fragment, T0149 fragment, T0157,
T0170, T0172 fragment, T0173 two fragments, T0180 and T181 fragment).
These were considered ab-initio targets.
Protein Structure Prediction with an Ant Lion Town
Potential
N. Kalisman and C. Keasar
Ben-Gurion Univ. of the Negev
keasar@cs.bgu.ac.il
Due to the inherently rough energy landscape, conformational search
simulations of proteins hardly ever converge to the global minimum. This so
called “multiple minima problem” is generally considered a major obstacle for
protein structure prediction. An Ant Lion Town Potential (ALTP) is an attempt
to take advantage of this malady. Inspired by the impressing earthworks of the
small insect [1] as well as a previous scientific work [2], an ALTP has local
minima with wide basins of attraction. As a result, reduction of the protein’s
conformation space is achieved by the convergence of large regions of the
space into single points, namely the local minima.
Wide basins of attraction for local minima are achieved by using five types of
energy terms: soft-atom van der Waals [2], long-range hydrophobic term,
cooperative hydrophilic term, cooperative hydrogen bond term and soft
distance constraint term. The cooperative hydrogen bonding term is used to
bias the conformation search towards a predicted secondary structure. The soft
distance constraint term is used to force disulfide bonds and (when available)
structural insight from fold recognition or homology modeling. The ALTP
allows a rapid generation of decoy sets for protein structure prediction by
repeated torsion-angle energy minimization of random starting points [3].
During the CASP5 experiment our group predicted 17 targets. Due to computer
time and memory limitations we restricted ourselves to proteins and protein
fragments of up to 140 residues. For each target a decoy set of 10,000 to
60,000 models was built. The submitted models were selected from the lowest
energy percent of the decoys by clustering and visual inspection.
We have predicted eleven targets which were given low fold recognition scores
by the servers and meta-servers which participated in the CAFASP experiment
A-94
Secondary structure was assigned to these targets based on the consensus of
several secondary structure prediction sites. The predicted secondary structures
of remote, but clearly related, homologs of the targets were used to confirm the
prediction. Decoy sets were generated and models were chosen for submission
as described above. When the secondary structure seemed ambiguous two
decoy sets were generated independently. Submitted models were taken from
both.
The ALTP approach to protein structure prediction was originally developed
for ab-initio predictions. We believe, though, that predicting the structure of
large insertions in fold recognition/homology modeling has much in common
with ab-initio prediction. In this experiment we have tried for the first time to
use the ATLP scheme for fold recognition/homology modeling targets.
We predicted six such targets (T0130, T0138, T0139, T140 fragment, T176 and
T188). For each of them we used the most reliable parts of the top ranking
CAFASP [4] top model as a template. The distances between the alpha carbons
of the template structure were used as (soft) constraints to the energy
minimization simulations. Otherwise, the prediction was performed as with
the ab-initio targets.
While differing in quite a few details, the prediction scheme presented here is
very similar to one used by the Levitt group in CASP4 [5].
1.
2.
3.
4.
5.
http://waynesword.palomar.edu/pljuly97.htm
Head-Gordon T. and Stillinger F.H. (1993) Predicting polypeptide and
protein structures from amino-acid-sequence – Antlion method applied to
melittin. Biopolymers 33, 293-303.
Levitt M. (1983) Protein folding by restrained energy minimization and
molecular dynamics. J. Mol. Biol. 170, 723-764
http://www.cs.bgu.ac.il/~dfischer/CAFASP3/
http://predictioncenter.llnl.gov/casp4/abstracts/casp4-abstracts-ab.html#38
KGI-QMW (P0015) - 19 predictions: 19 3D
models were trained, using the representative domain sequences themselves as
well as sequence homologs of these domains in order to create a larger training
set.
A Bayesian Network Model for Protein Fold and
Superfamily Recognition
D.L.Wild1, A. Raval1,3 , and M. Saqi2
1
– Keck Graduate Institute,
– Queen Mary School of Medicine and Dentistry, London,
3
– Claremont Graduate University, Dept. of Mathematics
david_wild@kgi.edu
2
A library of Bayesian network models based on SCOP superfamilies and
homologous sequences in SWISS-PROT was constructed using the methods
outlined in [1] and references therein. The Bayesian network approach is a
framework which combines graphical representation and probability theory,
which includes, as a special case, hidden Markov models. Our implementation
is a Bayesian network which simultaneously learns amino acid sequence,
secondary structure and residue accessibility for proteins of known threedimensional structure. An awareness of the errors inherent in predicted
secondary structure and residue accessibility may be incorporated into the
model by means of a confusion matrix. The Bayesian network models we have
utilized for CASP 5 can thus be seen as extensions of hidden Markov models to
incorporate multiple observations and confusion matrices.
In preparation for CASP 5, we modeled 89 superfamilies from SCOP 1.59
using BN1-PRED models and 53 of these using both BN1-PRED and BN2
models [1]. The BN1-PRED models are trained and tested with predicted
secondary structure and residue accessibility whilst the BN2 models are trained
with DSSP-calculated secondary structure and residue accessibility and tested
with predicted secondary structure and residue accessibility. For the BN2
models, a confusion matrix is applied to allow for errors in secondary structure
and residue accessibility prediction.
A
complete
list
of
modeled
superfamilies
is
given
at
http://public.kgi.edu/~wild/BN/CASPtrained.html. For the superfamilies that
have a low number of representative domains in SCOP 1.59, only BN1-PRED
A-95
The recent release of targets for CASP 5 contained a list of 67 targets. Out of
these, we submitted predictions for 18 targets (after filtering out Psi-Blast [2]
detectable targets that we deemed suitable for comparative modeling). We first
carried out a secondary structure and residue accessibility prediction for each
target using JNET [3]. Barring some exceptional targets (targets 130 and 136,
described below), the typical method of prediction for each target was as
follows:
1) The target was scored against both BN2 and BN1-PRED model
libraries by evaluating its posterior probability for belonging to each
superfamily in the library.
2) The top 3 superfamilies (as ranked by the posterior score) according to
both BN2 and BN1-PRED models were then identified and compared
against predictions of other automated fold recognition methods. Any
superfamily prediction that we considered to be in the wrong structural
class (based on predicted secondary structure) was removed from the
top-3 list.
3) The closest template of the target within each of the top ranking
superfamilies was found using one of two procedures:
(a) Comparing the posterior score of the target to the posterior scores
of the representative domains within the superfamily, and
identifying the domain whose posterior score was closest to the
posterior score of the target as the template.
(b) Comparing the Fisher score vector [4] of the target to the Fisher
score vectors of the representative domains within the
superfamily, and identifying the domain whose Fisher score
vector was closest to that of the target (with ``closeness’’ defined
in terms of Euclidean distance between Fisher score vectors).
4) After identifying the closest templates within each of the top-ranking
superfamilies, the sequences were then individually aligned to the
target using either a global (GCG program GAP) or local (GCG
program SSEARCH) sequence alignment, depending on the length
difference between the template and the target.
5) The template(s) that gave the best alignment were then reported as
predictions in AL format.
Templates for targets 130 and 136 were identified using Psi-Blast, as follows.
For target 130, Psi-Blast against the nr database identified the target as a
nucleotidyltransferase. A search against SCOP for the keyword
“nucleotidyltransferase” found the domain 1kny, which gave a good alignment
to the target and was reported in the final prediction. For target 136, Blast
detects a carboxyl transferase conserved domain and Psi-Blast reports a hit
against 1bob, which is a N-acyltransferase and gives a good alignment to the
target. 1bob was reported in the final prediction.
1.
2.
3.
4.
Raval A. et al. (2002) A Bayesian network model for protein fold and
remote homologue recognition. Bioinformatics 18(6), 788-801
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Cuff J.A. and Barton G.J. (2000) Application of multiple sequence
alignment profiles to improve secondary structure prediction. Proteins, 40,
502-11
Jaakkola T. et al.. (2000) A discriminative framework for detecting remote
protein homologies. J. Comput. Biol. 7 (1-2), 95-114
KIAS (P0531) - 479 predictions: 176 3D, 303 SS
Prediction of Protein Secondary Structure using PREDICT,
a Novel Method Based on Pattern Matching
Keehyung Joo1 , Ilsoo Kim1 , Julian Lee1,
Seung-Yeon Kim1, Sung Jong Lee1,2 , and Jooyoung Lee1
1
School of Computational Scineces, Korea Institute for Advanced Study
2
Department of Physics, Suwon University
jlee@kias.re.kr
We introduce a novel method for the secondary structure prediction, PREDICT
(PRofile Enumeration DICTionary). This method uses a concept of distance
A-96
between patterns. For a given protein sequence, this method uses PSI-BLAST
(Position Specific Iterative Basic Local Alignment Search Tool ) to generate
profiles, which define patterns for amino acid residues. Each pattern is
compared with those in the pattern database generated from PDB (Protein Data
Bank), and the patterns close to the query pattern is selected to determine the
secondary structure of the query residue. This method combines the idea of the
nearest-neighbor method of Yi and Lander[1] with the profile generating
technology of PSI-BLAST [2].
To elaborate our method, we first generate profiles for the query sequence and
also for those in the database using the PSI-BLAST multiple sequence
alignment. The profile defines pattern for each residue of these sequences by
considering seven neighboring residues to the left and right of the given
residue, which makes a window of size 15. The pattern is a 15 x 21 matrix
where 21 stands for 20 amino acid types plus one indicating the N and C
terminal ends of the protein sequence. The elements of this matrix are the
relative frequency of amino acids observed in the multiple sequence alignment.
The distance between patterns is defined by Djk=∑ℓ Wℓ |P ℓj -P ℓk|, where P
ℓ (ℓ= 1, … 315) are the ℓ-th components of the pattern j, and W are the
j
ℓ
weight parameters. For each pattern in the database, we calculate the secondary
structure according to the DSSP (Dictionary of Protein Secondary Structure)
[3]. We use the 3 state classification for the secondary structure : H (helix), E
(extended), and C (coil). We compare the query pattern with those in the
pattern database. This database is called the first-layer database, and consists of
7777 proteins selected from the PDB, with 1988085 residues. With a preset
number N, we choose the N nearest patterns. A naïve way for the prediction
would be to use the secondary structure of the majority of these N patterns as
the prediction. We call it the first-layer prediction. However, instead of
performing the first-layer prediction, we count the number of patterns
corresponding to H, E, C, which defines a 15 x 4 matrix for the query residue.
We construct this matrix for each residue in the set of non-homologous proteins
CB513, consisting of 513 proteins with 84119 residues. These matrices
constitute the second-layer pattern database. The 15 x 4 matrix pattern of a
query residue is now compared with those in the second-layer pattern database,
and the N closest ones are chosen. The secondary structure of the majority of
these patterns is used as the final prediction.
Since we expect the pattern elements near the center residue is more important
in defining the distance, we use an initial guess for the weights as Wℓ= |8|8-j||^2, where j ( j = 1, …15) is the index labeling the residue
corresponding to element ℓ. We call this parameter set W0. We then optimize
the parameters. We use the first-layer prediction for this purpose, and the
database used consists of the patterns from CB513 set only. We also use a
different method of prediction in this case. We select three sets of 100 nearest
patterns whose secondary structures are H, E, and C, respectively. We calculate
the average distance of the patterns in each set from the query pattern. We use
the secondary structure of the nearest group to the query residue as the
prediction. The Q3 value for W0 is 71.0%. For the residues whose predicted
secondary structures are different from the correct one, the parameters are
modified by a small amount in such a way that the set with correct secondary
structure becomes closer to the query pattern relative to the other groups. This
procedure is iterated, with the final percentage of correct prediction being
73.1%. We call this optimized parameter set W300. W300 is used only in the step
of the first-layer prediction.
The prediction with parameters W300 and N=200, W0 and N=200, W300 and
N=100, W0 and N=200,300 were chosen as model 1,2,3,4,5 respectively.
Preliminary result on the Q3 value of the secondary structure prediction of
CB513 set using 7777 protein set as the database is about 80 %.
1.
2.
3.
Yi T. et al. (1993) Protein Secondary Structure Prediction using NearestNeighbor Methods, J. Mol. Biol. 232, 1117-1129
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Kabsch W. et al. (1983) Dictionary of Protein Secondary Structure: Pattern
Recognition of Hydrogen-Bonded and Geometrical Features, Biopolymers
22, 2577-2637.
A-97
KIAS (P0531) - 479 predictions: 176 3D, 303 SS
Prediction of Protein Tertiary Structure using PROFESY,
a Novel Method Based on Pattern Matching
and Fragment Assembly
Julian Lee , Seung-Yeon Kim , Keehyung Joo ,
Ilsoo Kim , Saejoon Kim , and Jooyoung Lee
School of Computational Scineces, Korea Institute for Advanced Study
jlee@kias.re.kr
We introduce a novel method for the tertiary structure prediction, PROFESY
(PROFile Enumerating SYstem). This method utilizes secondary structure
prediction information and fragment assembly. The secondary structure
prediction is performed using the method PREDICT (PRofile Enumeration
DICTionary) recently developed by our group, which uses a concept of
distance between patterns. For a given protein sequence, this method uses PSIBLAST to generate profiles, which define patterns for amino acid residues.
Each pattern is compared with those in the pattern database generated from
PDB, and the patterns close to the query pattern is selected to determine the
secondary structure of the query residue. In order to construct the tertiary
structure, we also collect the backbone dihedral angles along with these
patterns. These constitute a library of fragments for a given protein sequence.
By construction, the secondary structure of the tertiary structure obtained from
PROFESY agrees with the ones predicted from PREDICT. In order to obtain
the optimal tertiary packing of these secondary structure elements, we define a
score function based on the number of long-range hydrogen bonds, the radius
of gyration, and the inter-residue Lennard-Jones interactions to avoid steric
clashes. Replacement of fragments by the ones in the library is carried out, so
that the score function is minimized. The score function minimization is
performed by a powerful global optimization method, conformational space
annealing (CSA) method [1], which enables one to sample diverse low lying
minima of the score function. The square-well type function is used for the
radius of gyration, that is, whenever the radius of gyration is greater than an
upper bound Rmax, then only the radius of gyration was minimized. Otherwise,
the other terms in the score function are used. Rmax= (3 Nres/0.026/3.14)1/3 was
used for targets T147- T163, where Nres is the number of residues, and a
smaller value of Rmax=(3 Nres/0.026/3.14)1/3/1.2 was used for targets T167 T194. The hydrogen bonding was introduced to the score function for targets
T167 – T194. Since the hydrogen bonding term favors alpha helices, we
included the hydrogen bonding energy terms only between residues separated
more than 5 in sequence. This restriction was implemented for targets T178T194. SASA solvation terms in CHARMM were used for T129-T134, and
ASAP solvation terms were used for T147-163, but the resulting conformations
were not satisfactory. We realized that there are no side-chains in our models
and consequently it is unphysical to use the solvation terms. Therefore, the
solvation terms were discarded for targets T167 – T194.
To select five best conformations, we first performed clustering of the
conformations, selecting five largest clusters and then choosing the
conformation at the center of each cluster. For targets T167 – T194 , a score
function was utilized as the selection criteria. The structure with the largest
number of hydrogen bonds was chosen for each cluster, for T167-175. For
T176 – T194 we introduced a new criterion based on burial of hydrophobic
residues and exposure of hydrophilic residues. This term is based on the
exposed volume with Reduced Radius Independent Gaussian Sphere (RRIGS)
approximation [2]. This score function was not directly implemented into the
conformation search algorithm due to the time constraints, but was used as the
selection criteria for top structures.
1.
2.
3.
4.
Lee J. et al. (1997) New optimization method for conformational energy
calculations on polypeptides : Conformational Space Annealing. J. Comp.
Chem. 18 (9), 1222-1232 ;
Lee J. et al. (1998) Conformational analysis of the 20-residue membranebound portion of Melittin by Conformational Space Annealing.
Biopolymers. 46, 103-115 ;
Lee J. et al. (1999) Conformational Space Annealing by parallel
computations: extensive conformational search of Met-enkephalin and the
20-residue membrane-bound portion of Melittin. Int. J. Quant. Chem. 75,
255-265 ;
Lee J. et al. (1999) Energy-based de novo protein folding by
conformational space annealing and an off-lattice united-residue force
A-98
5.
6.
7.
8.
field: Application to the 10-55 fragment of staphylococcal protein A and to
apo calbindin D9K. Proc. Natl. Acad. Sci. USA 96, 2025-2030 ;
Liwo A. et al. (1999) Protein structure prediction by global optimization of
a potential energy function. Proc. Natl. Acad. Sci. USA 96, 5482-5485 ;
Lee J. et al. (1999) Calculation of protein conformation by global
optimization of a potential energy function. PROTEINS: Structure,
Function, and Genetics 3:204-208 ;
Lee J. et al. (2000) Hierarchical energy-based approach to protein-structure
prediction: Blind-test evalutation with CASP3 targets. Int. J. Quant.
Chem. 77, 90-117
Auspurger J. D. et al. (1996) An efficient, differentiable hydration potential
for peptides and proteins. J Comp. Chem. 17 (13), 1549-1558.
Kim-Park (P0442) - 65 predictions: 65 SS
Protein Secondary Structure Prediction by Support
Vector Machines and Position-specific Scoring Matrices
H. Kim and H. Park
University of Minnesota, twin cities, MN 55455, U.S.A.
hskim@cs.umn.edu, hpark@cs.umn.edu
The prediction of protein secondary structure is important problem for the
prediction of tertiary structure of a protein. The SVMpsi method using support
vector machines (SVMs) and the position specific scoring matrix (PSSM)
generated from PSI-BLAST is introduced, which achieves better prediction
accuracy [1].
The final position-specific scoring matrix from PSI-BLAST [2] against the
SWALL non-redundant protein sequence database is used. We applied PFILT
[3] to mask out regions of low complexity sequences, the coiled coil region,
and transmembrane spans. For PSI-BLAST, the E-value threshold for inclusion
of 0.001 and three iterations were applied to search the non-redundant sequence
database.
3.
We designed two additional tertiary classifiers based on one-versus-one scheme
and directed acyclic graph scheme [4]. The one-versus-one classifier for the
secondary structure prediction chooses majority results based on three
classifiers H/E, E/C, and C/H. Many test results show that one-versus-one
classifiers are more accurate than one-versus-rest classifiers due to the fact that
one-versus-rest scheme often need to deal with two data sets with very different
sizes, i.e., unbalanced training data [5]. However, a potential problem of the
one-versus-one scheme is that the voting scheme might suffer from
incompetent classifiers. For example, while the test point is helix, the result
from the one-versus-one classifier E/C that is not related to helix
inappropriately contributes to the decision. We can reduce this problem by
using the directed acyclic graph (DAG) scheme that can classify a new data
point after 2 binary classifications for 3 class problems without influence from
incompetent classifiers. For example, if the testing point is predicted to be E
(not C) from E/C classifier, then H/E classifier is applied, while if the point is
predicted to be not sheet (~E) from E/C classifier, C/H classifier is applied to
tell if it is coil or helix.
A new protein secondary structure prediction method SVMpsi produces the
performance measures of Q_3=76.1% and SOV94 = 79.6% on the RS126 set
and Q_3=76.6% and SOV94 = 80.1% on the CB513 set through seven-fold
cross validation, which outperforms other existing methods that we are
aware of. We prepared KP480 set from CB513 set and the prediction accuracy
of SVMpsi was Q_3=78.5% and SOV94 = 82.8%. Moreover, we built 136
protein sequences for blind test. The blind test results were Q_3=77.2% and
SOV94 = 81.8%. It shows that the support vector machine approach is
another good method to predict the protein secondary structure. The major
improvement of the new SVMpsi method is obtained from the PSI-BLAST
PSSM profile and the new optimization strategy in SVM for maximizing the
Q_3 measure.
1.
2.
Kim H. and Park H. (2002) Protein Secondary Structure Prediction by
Support Vector Machines and Position-specific Scoring Matrices.
Submitted to Bioinformatics.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
A-99
4.
5.
Jones D. T. and Swindells M. B. (2002) Getting the most from PSIBLAST, TRENDS in Biochemical Science, 27, 161-164.
Heiler M. (2002) Optimization Criteria and Learning Algorithms for Large
Margin Classifiers, Diploma Thesis, University of Mannheim.
Hsu C. W. and Lin C. J. (2002) A comparison of methods for multi-class
support vector machines, IEEE Transactions on Neural Networks, 13, 415425.
LAMBERT-Christophe (P0035) - 131 predictions: 131 3D
Evaluation of Different Methods for Comparative Modeling
C. Lambert, N. Léonard, B. Damien and E. Depiereux
Unité de Recherche en Biologie Moléculaire, Facultés Universitaires NotreDame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium
christophe.lambert@fundp.ac.be
The aim of our work is to compare different approaches for comparative
modeling.
Model 1 was build running ESyPred3D [1] on the best template chosen by
CAFASP3 jury.
Model 2 was build by homology modeling using the MOE [4] program and the
best template selected by the PDB-BLAST feature of MOE. No minimization
or dynamics were done. This method was used to provide comparison with
more complex modeling techniques used in our group.
Model 3 was build by homology modeling using the MOE [4] program and the
best templates selected by the PDB-BLAST feature of MOE. Best template was
chosen as primary template and 1-3 others were used to model parts not present
in the primary template. The model obtained was compared with models
obtained from Swiss-Model and ESyPred3D. Then model was thoroughly
analyzed and corrected accordingly using restrains on angles distances and
dihedrals. Model is finally going through energy minimization and molecular
dynamics until a minimal amount of dihedral/angle and distance errors exist.
Model 4 and 5 were build by using the ESyPred3D[1] to find possible
templates and pairwise alignments between the query sequence and each
chosen template. A multiple structure alignment between all templates was
build using the STAMP [3] program. The pairwise alignments between the
query and each template, and the multiple structure alignment were combined
to obtain a multiple alignment between the query and all templates. This
multiple sequence alignment was used by MODELLER [2] to build the final
model.
ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred
1.
2.
3.
4.
Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures.
Bioinformatics. 18 (9), 1250-1256
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234(3), 779-815.
Russell R. B. et al. (1992) Multiple protein sequence alignment from
tertiary structure comparison: assignment of global and residue confidence
levels. Proteins 14(2), 309-323
Chemical Computing Group Inc. , Montreal, Quebec, Canada
LIBELLULA (P0230) - 216 predictions: 216 3D
A New Web Server For Fold Recognition Evaluation
O. Graña1, D. Juan1, F. Pazos2, P. Fariselli3, R. Casadio3 and A.
Valencia1
1
Protein Design Group, National Center for Biotechnology, CNB-CSIC. Spain
2
ALMA Bioinformatica, Tres Cantos, Madrid, Spain
3
CIRB Biocomputing Unit and Lab. of Biophysics, Dept. of Biology, University
of Bologna. Italy
This approach improves the selection of correct folds from among the results of
two methods implemented as web servers (SAMT99 [2] and 3DPSSM [3]).
LIBELLULA is based on the training of a system of neural networks with
models generated by the servers and a set of associated characteristics such as
the quality of the sequence-structure alignment, distribution of sequence
features (sequence conserved positions and apolar residues) and compactness of
the resulting models.
It has been implemented as an automatic system available as a public web
server at http://www.pdg.cnb.uam.es/ servers/libellula.html.
1.
2.
3.
A-100
Juan D., Graña O., Pazos F., Fariselli P., Casadio R. and Valencia A. A
neural network approach to evaluate fold recognition results. Accepted in
Proteins: structure, function and genetics.
Karplus K., Barrett C., Hughey R. Hidden Markov models for detecting
remote protein homologies. Bioinformatics. 14, 846-856.
Kelley L.A., MacCallum R.M., Sternberg M.J. Enhanced genome
annotation using structural profiles in the program 3D-PSSM. J Mol Biol.
299, 499-520.
and -sheets, and transfer energies of side-chains from water to the protein
interior [2]. Next, a set of 3D models, which correspond to the selected lowenergy sequence-structure alignments, must be evaluated using more rigorous
all-atom free energy functions that have been derived recently from
mutagenesis data [3]. In the beginning of the CASP5 experiment, our program
MIMIC was only at the initial development stage, with many important options
(including all-atom threading) not presently implemented. Therefore, its actual
performance still remains to be tested.
Lomize-Andrei (P0288) - 76 predictions: 76 3D
Fold Recognition and Homology Modeling of Protein Cores
A.L Lomize, I.D Pogozheva, and H.I. Mosberg
College of Pharmacy, University of Michigan, Ann Arbor, MI
almz@umich.edu
3D models of 66 CASP5 target proteins were generated using several different
techniques. 30 targets were modeled simply by homology using PSI-BLAST
searches, our in-house software, and QUANTA. The initial PSI-BLAST
alignments were often corrected to superimpose hydrophobic residues of a
target and the water-inaccessible positions of the experimental template, and to
remove any gaps within regular secondary structures, which is a part of our
threading approach. The applied side-chain packing procedure provided
geometrical “tracing” of template side-chains, removal of significant
hindrances, and optimization of hydrogen bonding.
The remaining 36 targets had no or very low sequence homology to
experimental structures. They were modeled using the following steps: (1)
prediction of secondary structure, -sheet topology, and general protein
“architecture” based on the hydrophobicity patterns observed in multiple
sequence alignments [1]; (2) fold recognition using the program MIMIC that is
under development in our group; (3) selection of the most probable template
also taking into account predictions of 3D-PSSM server and the biological
function of the target; (4) refinement of target-template alignment and
generation of the corresponding full atomic model. The modeling included
human intervention. Only T0131 was modeled ab initio by a hierarchic
assembly of -helices and -sheets [2].
Our fold recognition program, MIMIC, was designed to identify the lowest
energy sequence-structure alignment, and the lowest energy experimental
template. The optimal and suboptimal sequence-structure alignments are
generated first using the dynamic programming algorithm and approximate
energy functions. At this step, thermodynamic stability of the protein core
(excluding nonregular loops) is estimated as the sum of backbone energy,
secondary structure propensities, interactions between side-chains in -helices
A-101
1.
2.
3.
Lomize A.L. and Mosberg H.I. Thermodynamic model of secondary
structure for alpha-helical peptides and proteins. (1997) Biopolymers., 42:
(2), 239-269.
Lomize A.L., Pogozheva I.D., and Mosberg H.I. (1999) Prediction of
protein structure: The problem of fold multiplicity. Proteins. 37 (Suppl. 3),
199-203.
Lomize A.L., Reibarkh M.Y., and Pogozheva I.D. (2002). Interatomic
potentials and solvation parameters from protein engineering data for
buried residues. Protein Sci. 11 (8), 1984-2000.
luethy (P0419) - 240 predictions: 240 3D
Unified Prediction Approach for Comparative Modelling
and Ab-initio Predictions
R. Luethy
Amgen Inc.
roland@luethy.net
Sequence and structural similarities vary gradually between proteins in the
same family. Here it was attempted to use the same overall approach of
structure prediction for all target classes. The first step was to generate a
multiple sequence alignment for the target, then the sequence profile method
[2] was used to find similar sequences in PDB [1]. The multiple sequence
alignment was checked and adjusted manually if needed. It was also used to
guess domain boundaries in potential multidomain proteins. For potential
multidomain proteins a separate multiple sequence alignment was made for
each predicted domain. The highest scoring PDB sequences were then aligned
with the target sequence profile and a database of structural fragments was
generated from these alignments. A fragment was defined by a ungapped
region in the aligment. These fragments were then used in a folding procedure
which has the following components [3]:
a simplified representation of protein structures that can be locally modified.
A structure modification method based on selecting blocks from know 3D
structures.
Evaluation of structures and optimization.
The simplified models are based on a sequence of internal coordinates: the
torsion angles between four consecutive C atoms and angles between three
Catoms. In order to generate different structures, fragments were randomly
selected from the database of structural blocks which was derived from the
profile alignments. Different structures were generated by randomly selecting
blocks from this database and substituting them into the model. To evaluate
structures cartesian coordinates for the C and C atoms were reconstructed
using constants for all distances and the angles needed to reconstruct the C
positions. These structures are then evaluated using knowledge based potentials
derived from know 3D structures. The potentials used were a residue specific
pair-wise distance potential, a residue specific number of contacts potential, a
compactness function and a penalty for too close contacts.
For targets suitable for comparative modeling the following additions were
made: If the sequence similarity between the target and its best match in PDB
was significant the fragments from the corresponding structure were inserted
into the starting model and were not allowed to change during the optimization.
Their relative positions were constraint by a distance matrix derived from the
know structure.
After the structure optimizations all atom coordinates were reconstructed in the
following way: first all coordinates from the PDB fragments were copied, then
missing backbone atoms were inserted by looking up the closest 5 residue
backbone fragment in PDB, finally missing side-chain atom were copied from
the closest 5 residue fragment from PDB with the same residue in the middle.
The structure was then minimized using TINKER [4] using a steepest descent
method with fixed C atoms.
A-102
1.
2.
3.
4.
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H.,
Shindyalov I.N., Bourne P.E., (2000) The Protein Databank. Nucleic Acids
Research, 28 pp. 235-242
Gribskov M., McLachlan A.D. and Eisenberg D. (1987) Profile analysis:
Detection of distantly related proteins. PNAS, 84, 4355-4358
Zhu J. and Luethy R. (2002) Three-dimensional structure prediction using
simplified structure models and Bayesion block fragments. in: Protein
structure prediction. Bioinformatics approach. ed. Igor F.Tsigelny,
International University Line, pp. 85-107
Ponder J.W. and Richards F.M. (1987) An Efficient Newton-like Method
for Molecular Mechanics Energy Minimization of Large Molecules. J.
Comput. Chem., 8, 1016-1024 (http://dasher.wustl.edu/tinker/)
Lund-Ole (P0391) - 39 predictions: 39 3D
X3M – a Computer Program to Extract 3D Models
O. Lund, M. Nielsen, C. Lundegaard and P. Worning.
Center for Biological Sequence Analysis, Biocentrum-DTU, Building 208,
Technical University of Denmark, DK-2800 Lyngby, Denmark
lund@cbs.dtu.dk
Summary
A novel method was developed for fold recognition/homology modeling, in
which a large sequence database is iteratively searched to construct a sequence
profile until a template can be found in a database of proteins with known
structure. The method differs from the PDB-BLAST method in that a sequence
profile is only made if a template is not readily found in the database of known
structures. A sequence profile is subsequently made for the template, using the
same number of PSI-BLAST iterations that were used to identify it. Query and
template sequences are subsequently aligned using a score based on profileprofile comparisons. The alignment score is modified so as to ensure unreliable
parts of the alignment is discarded.
Background
A problem often encountered when doing iterative sequence searches in a
database is that the search may go astray and start picking up unrelated
sequences often with hydrophobic or low complexity regions. It has been found
that using PSI-BLAST [1] to build a profile using a sequence database and
subsequently use this profile to search a database of proteins with known
structures (PDB-BLAST) works better than searching one merged database [2].
We have developed a method related to PDB-BLAST where we only perform
iterative searches against the sequence database if no match can be found in the
database of proteins with known structure.
It has been shown that methods based on profile-profile alignment can produce
more accurate alignments than methods based on sequence-profile or sequencesequence alignment [3]. A number of different methods for scoring two profiles
against each other have been suggested over the recent years: The average score
between all amino acid pairs according to the probability distribution in each
profile [2], the probability that the same amino acid is found in given positions
in the two profiles (the dot product of the amino acid probability vectors) [4],
the probability that two amino acid distributions are the same [5], or
combinations of different profile-profile scores with other scoring terms [6].
Kelley et al. [7] use the average alignment score of the query profile with the
template sequence and the query sequence with the template profile for fold
recognition. Here we take that average for each residue pair in and use that as a
scoring matrix for the alignment algorithm. This approach has the advantage
that it reduces to the classical sequence-sequence alignment in the case that no
homologous proteins can be found.
In CASP4 Venclovas [8] successfully selected correctly aligned regions by
discarding regions which aligned differently in different blast searches. Another
way to select for reliable parts of the alignment is to change the scoring matrix
that is used to align the two proteins. It has been found that scoring matrices
with low PAM values (corresponding to high BLOSUM values) are appropriate
for making shorter alignments [9]. Subtracting a number from the scoring
matrix also leads to shorter but more accurate alignments [10,3]. Blosum
alignment scores S are often measured in half bits and derived from log odds
scores S = 2*ln2(Qij/PiPj) [11]. In this case subtracting two from the alignment
score corresponds to demanding that the probability Q ij to find amino acids i
and j aligned must be twice as big as the background probability PiPj in order
A-103
for S to be positive. We have used this method in an attempt to make a reliable
profile-profile alignment.
Databases
A fasta file containing all pdb entries (pdb) was downloaded from NCBI
(ftp://ftp.ncbi.nih.gov/blast/db/pdbaa.Z). A non redundant database of known
protein sequences (sp) was compiled from files downloaded from Swiss-prot
(ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/fasta/*.fas.gz). PDB entries were
downloaded from RCSB (ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb/).
Template identification
The program blastpgp [1] was used to search the databases. In order to find a
template, the query sequence was run against the pdb database. If a template
could not be found with an E value of less than 0.05 the sequence was run two
iterations against sp, and a binary checkpoint file was saved as well as the
position specific scoring matrix in ASCII format (blastpgp does not update
these files after the last iterations, so the saved files correspond to the profile
obtained after the first iteration). The checkpoint file was used to restart a
blastpgp search of the query sequence against the pdb database. The procedure
of iteratively using the sp database to generate a profile that in turn is used to
search the pdb database was continued until a template was found with a E
value of less than 0.05 or a total number of five iterations against the pdb
database had been performed.
Alignment
If a template was identified, we attempted to improve the alignment by
performing a profile-profile alignment. In order to make a sequence profile for
the template sequence we ran the template sequence the same number of
iterations as the query sequence against the sp database and saved the scoring
matrix in ASCII format. If no sequence profile was generated for either the
query or the template sequence, it was constructed from a blosum62 matrix
[11]. A scoring matrix Sij was constructed based on the two profiles.
Sij = (QPi(TAj)+TPj(QAi))/2-1
Where QPi(TAj) is the score of residue j in the template sequence with the
profile at position i in the query sequence, and TP j(QAi) is the score of residue i
in the query sequence with the profile at position j in the template sequence.
These two scores were averaged and 1 was subtracted to reduce the lengths of
the alignments and make them more accurate. The query was then aligned to
the template using a local alignment algorithm [12], with a maximum number
of gaps set to 20, a first gap penalty of 11, and a gap elongation penalty of 1.
6.
Modeling
The corresponding atoms derived from the alignment can be extracted from the
template file and used as a starting point for the homology modeling. Missing
atoms were added using the segmod program [13] from the GeneMine package
(www.bioinformatics.ucla.edu/genemine/). The structures can then refined
using the encad program [14] also from the GeneMine package. The modeling
step was not in place for CASP5 so only alignments were submitted.
8.
Submissions
Alignments were submitted for 41/67 (61%) of the targets (T0130, T0132,
T0133, T0137, T0140, T0141, T0142, T0143, T0144, T0149, T0150, T0151,
T0152, T0153, T0154, T0155, T0158, T0160, T0163, T0164, T0165, T0166,
T0167, T0169, T0171, T0172, T0175, T0178, T0179, T0182, T0183, T0184,
T0185, T0186, T0188, T0189, T0190, T0191, T0192, T0193, T0195). We only
submitted alignments for targets where we estimated that it was at least 95 %
certain that we had identified the correct fold. We furthermore sought to
perform the alignment in such a way that regions where a reliable alignment
could not be made were excluded. We look forward to see if this strategy
worked and to compare our results with those submitted by other groups.
11.
1.
2.
3.
4.
5.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Rychlewski L., Zhang B., Godzik A. (1998) Fold and function predictions
for Mycoplasma genitalium proteins. Fold Des. 3 (4), 229-38.
Jaroszewski L, Rychlewski L, Godzik A. (2000) Improving the quality of
twilight-zone alignments. Protein Sci. 9 (8), 1487-96.
Lyngsø R.B., Pedersen C.N., Nielsen H.R. (1999) Metrics and similarity
measures for hidden Markov models. Proc Int Conf Intell Syst Mol Biol.
178-86.
Yona G., Levitt M. (2000) Towards a complete map of the protein space
based on a unified sequence and structure analysis of all known proteins.
Proc Int Conf Intell Syst Mol Biol. 8, 395-406.
A-104
7.
9.
10.
12.
13.
14.
Fischer D. (2000) Hybrid fold recognition: combining sequence derived
properties with evolutionary information. Pac Symp Biocomput. 119-30.
Kelley L.A., MacCallum R.M., Sternberg M.J. (2000) Enhanced genome
annotation using structural profiles in the program 3D-PSSM. J Mol Biol.
299 (2), 499-520.
Venclovas C. (2001) Comparative modeling of CASP4 target proteins:
combining results of sequence search with three-dimensional structure
assessment. Proteins Suppl 5, 47-54.
Altschul S.F. (1991) Amino acid substitution matrices from an information
theoretic perspective. J Mol Biol. 219 (3), 555-65.
Vogt G., Etzold T., Argos P. (1995) An assessment of amino acid
exchange matrices in aligning protein sequences: the twilight zone
revisited. J Mol Biol. 249 (4), 816-31.
Henikoff S., Henikoff J.G. (1992) Amino acid substitution matrices from
protein blocks. Proc Natl Acad Sci U S A. 89 (22), 10915-9.
Smith T.F., Waterman M.S. (1981) Identification of common molecular
subsequences. J Mol Biol. 147 (1), 195-7.
Levitt M. (1992) Accurate modeling of protein conformation by automatic
segment matching. J. Mol. Biol. 226 (2), 507-533
Levitt M., Hirshberg M., Sharon R. and Daggett V. (1995). Potential
energy function and parameters for simulations of the molecular dynamics
of proteins and nucleic acids in solution. Computer Physics Comm. 91,
215-231.
MacCallum (P0393) - 130 predictions: 130 SS
Evolved Post-processing of PSIPRED Predictions
R. M. MacCallum
Stockholm Bioinformatics Center, Stockholm University, Sweden
maccallr@sbc.su.se
I have submitted completely automated secondary structure (SS) predictions for
every target using a novel post-processing method for PSIPRED[2] predictions.
In outline, a set of conditional reassignments (rules) are applied to each
predicted secondary structure element, based on local and global information in
the raw PSIPRED output. The rules are evolved with genetic programming, an
evolutionary search technique which operates on trees.
PSIPRED predictions were run locally because the .ss2 output files are needed.
The database used during the PSI-BLAST[1] search phase of PSIPRED was
“nr” from the NCBI [ ftp://ftp.ncbi.nih.gov/blast/db/ ]. This was downloaded
once on 22 July 2002 and used for all targets. Complexity filtering of the
database was performed exactly as recommended in the PSIPRED
documentation. BLAST version 2.2.2 was used.
The .ss2 files are read into object oriented data structures which store for each
residue the helix, strand and coil probabilities (neural network outputs).
Consecutive residues of the same predicted secondary structure type are
grouped into objects called “elements”. All subsequent operations are done at
the element level, i.e. from the “element's eye view”.
The goal is to create an object method (a Perl subroutine) which will adjust the
secondary structure prediction of each predicted element towards the “correct”,
DSSP-based, assignment. This is done with a population based stochastic
search in program tree space with fitness selection, crossover and mutation; or
genetic programming as it is better known. Fitness is measured as Q3. The
adjustment of the prediction is based on decisions made from information about
neighbouring elements and the global prediction. The building blocks for the
genetic programming can be summarised as: IF THEN flow control statements,
numeric inequality conditions, functions which return numeric values, global
constants, arithmetic operations and finally “action” methods, where the
secondary structure prediction is altered in some way.
Many methods/functions and global constants are available to the genetic
programming but not all of them are used, so to save space I just describe one
of the “best of run” programs, in fact the one used for the CASP predictions,
shown here in pseudo-code:
IF ((fwd_helix(2)) AND (prev_strand(B) OR A<4)) THEN
reassign_this_element_weighting(helices by 4,
coils by 3,
strands
by
1.3*B/(this_element_length()–
fwd_strand(A)))
IF (this_element_lowest_helix_probability() > 0.29) THEN
reassign_this_element_weighting(
helices
by
67*this_element_mean_coil_probability(),
coils by 4, strands by 48.45 + 9.9*C – C*D)
Where fwd_helix(2) returns 1 if one of the next 2 elements is a helix and 0
otherwise; prev_strand(B) returns 1 if there's a strand in the previous B
elements.
Hopefully
this_element_lowest_helix_probability()
and
this_element_mean_coil_probability() are self-explanatory, they perform
calculations on the current element's PSIPRED per-residue network outputs.
Finally, reassign_this_element_weighting() recalculates the secondary structure
assignment on a winner-takes-all basis from the per-residue helix, strand and
coil PSIPRED network outputs applying to each a specified weighting. As a
result, element boundaries may change. Sometimes no changes are made to the
PSIPRED prediction.
What does the evolved subroutine do? The general strategy agrees quite nicely
with common sense: it disregards strand predictions if there are globally very
few strands. In other words orphan strands are not tolerated, because only in
rare circumstances can they form sheets. In addition to the global number of
strands information, it does also seem to be using “longer-range” information,
i.e. the presence or absence of helices or strands a small number of elements
away, which will often be more than the +/-7 residues of the PSIPRED neural
network window. These extra details may or may not be significant, and in any
case the improvement in Q3 on the training and test sets (ASTRAL SCOP 10%
identity) was only around +0.4%. This is preliminary work and I need to
explore the fitness landscape and different ways to represent and manipulate the
input data, and hopefully get at least +1% improvement in time for CASP6.
1.
Global quantities from original PSIPRED prediction
A = mean length of strands
C = mean length of helices
B = number of strands
D = length of longest helix
2.
A-105
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Martin-Andrew (P0471) - 55 predictions: 55 3D
observed that the final alignment from PSI-BLAST placed indels in structurally
more acceptable positions. Where necessary, alignments were hand-modified in
light of the structure to minimize the structural impact of indels.
Where more than one PDB parent was available, these were first aligned
structurally using SSAP [6]. The PSI-BLAST alignments of the target against
these parents were then applied and hand-modified to resolve conflicts between
the alignments.
CASP5 Comparative Modelling
D. Talbot, N.W. Boxall, A.L. Cuff, H. Fooks, R.C. Gibson, E.G.
Hutchinson, B.S. Lattimore, E.F. Murphy,
S.J.Wills, A.C.R. Martin
School of Animal & Microbial Sciences, The University of Reading,
Whiteknights, PO Box 228, Reading RG6 6AJ, UK.
a.c.r.martin@reading.ac.uk
In the case of T0133 (13% sequence identity), a novel alignment procedure was
also used in which secondary structure elements from the two parents, aligned
using SSAP were converted into pseudo-profiles which were aligned against
the target using global dynamic programming.
We attempted only targets that could be addressed by comparative modelling
with sequence identities between target and principle parent from 13% to 71%.
As shown previously [1], sequence alignment is the primary factor in obtaining
good quality models and the focus of our effort has been in getting good
alignments. MODELLER-6 [2] was used to generate the actual models with no
additional refinement.
Model Generation. Models were generated using MODELLER-6 with the
DO_LOOPS option set. In some cases multiple models were generated, but in
the main, default options were chosen. No further energy refinement was
performed on the resulting models.
Our strategy proceeded as follows. (1) Determine whether a target was suitable
for comparative modelling; (2) reject targets judged to be ‘too hard’; (3)
determine the alignment using PSI-BLAST [2], structural alignment
information and hand-modifications; (4) build models using MODELLER-6;
(5) screen alternative models.
Model Screening. Pseudo-energies were calculated using the RAM potential
[7] (obtained from http://prostar.carb.nist.gov/) and percentage of residues in
core Ramachandran areas, was obtained from PROCHECK [8]. A combination
of these scores and visual inspection of the models was used to make a final
selection.
1.
Target suitability and rejection. PSI-BLAST searches of the target sequence
were made against non-redundant Genpept plus non-redundant PDB sequences.
This database was regularly updated throughout the CASP prediction season. In
the case of distant homologues, predictions were also made with GenThreader
[3] and/or SAMT99 [4] to help confirm homology. Targets were judged as ‘too
hard’ if they contained a large numbers of indels or if individual indels were
longer than 5 residues.
2.
3.
4.
Alignment. This was the most complex phase and the subject of most of our
efforts. In general, the alignments used were those generated by PSI-BLAST.
The initial maximal-coverage PSI-BLAST alignment and the final alignment
were both used and examined in the light of the structure. In most cases, it was
A-106
5.
6.
Martin A.C.R, et al. (1997) Assessment of comparative modelling in
CASP2. Proteins: Struct., Funct., Genet. Suppl 1, 14–28.
Marti-Renom M.A., et al (2000) Comparative protein structure modelling
of genes and genomes. Ann. Rev. Biophys and Biomol. Struct., 29, 291–
325.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25,
3389–3402
Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold
recognition method for genomic sequences. J. Mol. Biol. 287, 797–815.
Karplus K., et al. (1998) Hidden Markow Models for detecting remote
protein homologies. Bioinformatics 14, 846–856.
Taylor W.R. and Orengo C.A. (1989) Protein structure alignment. J. Mol.
Biol. 208, 1–22.
7.
8.
Samudrala R. and Moult J. (1998) An all-atom distance-dependent
conditional probability discriminatory function for protein structure
prediction. J. Mol. Biol. 275, 895–916.
Laskowski et al. (1993) PROCHECK – a program to check the
stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.
Meller-Adamczak (P0441) - 23 predictions: 23 3D
Reading Weak Threading Signals for Difficult Fold
Recognition Targets
J. Meller and R. Adamczak
Pediatric Informatics, Children’s Hospital Research Foundation, University of
Cincinnati, 3333 Burnet Avenue, Cincinnati, OH 45229
jmeller@chmcc.org
Challenging fold recognition targets, with significant sequence and structure
variations with respect to known proteins, often result in largely correct
matches of low statistical confidence. In other words, primary, secondary and
tertiary structure signals, which are used to make predictions, are too weak
compared to what is expected by chance, given certain background probability
distributions associated with our NULL models. However, context dependent a
priori knowledge (e.g. about binding partners) imposes additional constraints
that may be utilized in order to postulate a better reference model and to
estimate what is expected by chance among putative matches satisfying such
constraints.
Many groups in CASP4 used strategies combining automated predictions with
manually enhanced and further validated annotations. However, adding
biological insights and a priori knowledge to recognition protocols proved to
be difficult to automate. In our experience, one encouraging example of such an
approach was an effective protein length filter used in the LOOPP server
threading protocol during CAFASP2 to enhance prediction for difficult targets
[1-3]. Without using family profiles and secondary structures, the LOOPP
A-107
server provided best models for three difficult targets and was ranked as the
third best server in the category of difficult targets [3].
Here, we attempted to combine the threading based LOOPP predictions with
further biological insights and manual evaluation of putative matches,
following the strategy of other groups and our own experience. We used
CAFASP3 prediction server to estimate difficulty of a given target and (with
one exception) only those resulting in low consistency among the servers were
chosen to test our ability to improve upon the initial LOOPP prediction. We
would like to stress that we did not use the new incarnation of the LOOPP
server at the Cornell Theory Center [6], but the one used during previous CASP
assessment [2] that we also continue using for annotations of divergent
genomic sequences [4-5].
The initial, high scoring threading matches were used to build a library of
related folds and their variations and the target sequences were realigned using
our flavor of structurally biased sequence alignment [1]. Such alignments
proved to be more reliable compared to “pure” threading alignments based on a
“local” contact model (THOM2), developed by Meller and Elber [1].
Nevertheless, sequence alignments are often statistically insignificant for
remote homologs and they were used here in the context of the initial threading
matches.
The level of the observed inconsistency between the alternative alignments was
used as one of the filtering criteria. Other criteria, used for some of the targets
included analysis of amino acid residue packing in the models implied by the
(local) alignments in terms of pair distribution functions [7] and consistency
with strongly predicted secondary structure elements, using our novel protocol
[8]. The alignments with putative matches were next analyzed for consistency
with their biological role, investigating (using extensive literature searches) all
known interactions of the matches and their structural implications.
In several cases our approach did not result in a clear winner. The decision
whether a model should be submitted and what should be the ranking of the
models was then made based on intuitive and esthetic comparisons of the
models. We allowed ourselves to submit up to three models per target in order
to evaluate our ranking in some cases. A detailed description accompanied each
model submitted to the CASP server.
1.
2.
3.
4.
5.
6.
7.
8.
Meller J. and Elber R., (2001), Linear Programming Optimization and a
Double Statistical Filter for Protein Threading Potentials, Proteins 45: 241
http://ser-loopp.tc.cornell.edu/loopp_old.html
http://www.cs.bgu.ac.il/~dfischer/CAFASP2; see also Proteins CASP
Sup.
Frary A., Nesbitt T., Frary A., Grandillo S., vd Knaap E., Cong B., Liu J.,
Meller J., Elber R., Alpert K., Tanksley S.D., fw2.2: (2000) A Quantitative
Trait Locus Key to the Evolution of Tomato Fruit Size, Science, 289: 85
Kuznetsova A., Meller J. et. al., PNAS, submitted
http://www.tc.cornell.edu/CBIO/loopp
http://sift.chmcc.org; manuscript under preparation
http://pressage.chmcc.org; manuscript under preparation
Levitt (P0016) - 350 predictions: 350 3D
The Levitt Group Comparative Modeling and Ab Initio
Methods for Protein Structure Prediction
E. Lindahl, P. Koehl, R. Kolodny, T.M. Raschke,
C.M. Summa, and M. Levitt
Department of Structural Biology, Stanford University School of Medicine,
Stanford, CA 94305 USA
michael.levitt@stanford.edu
All target sequences submitted to CASP5 were first screened using the results
from the CAFASP3 servers and comparative modeling was only attempted for
targets where at least one server showed intermediate or high scores. The
remaining sequences were considered ab initio targets. For comparative
modeling at CASP5, our group focused on improving sequence alignments and
on the prediction of sidechains and loop regions in proteins. For ab initio
targets, we generated decoys using fragment assembly followed by selection
based on energy functions and clustering.
A-108
Comparative Modeling. A consensus secondary structure prediction was
derived from all the servers available at the CAFASP3 website, giving
additional weight to the PsiPred [1] method. For all significant fold recognition
hits we extracted both the original structures and other structures in the same
SCOP superfamily [2] with good SPACI scores [3] to get high quality
templates. We computed a sequence profile based on the structural alignments
of these templates, derived from the FSSP database [4]. Position-dependent gap
penalties were introduced based on the experimental and predicted secondary
structures, FSSP fragments, and the distance between endpoints in the template
structures for deletions. We used both our alignments derived from the
structural profiles and automated alignments from CAFASP3 to create a set of
manually tweaked alignments for each target. The emphasis in this tuning
process was not mainly on matching features, but rather on manual
discrimination, correcting possible mismatches, and taking any additional
knowledge about the sequence/structure into account. For large insertions or
changes in secondary structure we first altered the backbone structure of the
template and applied energy minimization with SEGMOD [5] and Gromacs [6]
to get the structure to a reasonable state. Starting from the manual alignments, a
model backbone framework was built by removing two residues on each side of
insertions/deletions in the template. Candidate loop fragments were selected
from a set of geometrically compatible backbone fragments. Similar fragment
sets were generated for positions where there were PRO and GLY mutations
between the template and query sequences. This approach is limited to
insertions shorter than about 15 residues, and for a couple of cases we had to
apply manual modeling using the O program [7] to generate potential loops. In
the final modeling step, we select a set of rotamers for each sidechain, and use a
self-consistent mean-field approach [8-9] to simultaneously optimize sidechains
and the altered backbone fragments. Manual inspection and the energy of the
resulting all-atom models were used to select which predictions to submit.
Ab Iinitio Modeling. We applied the following ab initio prediction method to
target proteins that received low scores from the CAFASP3 comparative
modeling servers. Models were generated by assembling regularized backbone
segments of length 9 (derived from a 2000-protein library) using Monte Carlo
swap moves, as per Jones’ method used in CASP2[10] and Baker’s method in
CASP3 and CASP4 [11-12]. The energy function used for annealing consisted
of terms representing cooperative hydrogen bonds (as done by Keasar & Levitt
in CASP4), residue-based hydrophobic burial propensity, and residue-based
hydrophobic pair interactions. After 50,000 steps of annealing with the
segment replacement method, the models were annealed with “refinement
moves,” consisting of small 2° rotations of the backbone torsion angles, for
10,000 steps. This process was used to model the native sequence and
homologous sequences (where appropriate) using the predicted secondary
structures from several automated servers [1,13-14]. For some targets, the most
likely emitted sequence from a Hidden Markov Model built from the target
sequence family was also used [15]. A set of 1000 decoys was generated for
each sequence/secondary structure combination, and all models were combined
into one large dataset for selection. This dataset was pruned to 3000 members
using a “colony energy” score [16] that combined several energy functions
(atom cluster energy, electrostatic energy, RAPDF [17], and the energy from
the decoy generation procedure) with a measure of the structural similarities
between the decoys. The 3,000 best models were clustered with a hierarchical
clustering method using a Floyd distance metric (distance along the graph)
[18]. Decoys in the top 5 clusters were evaluated by manual inspection, and
typically one decoy from each of the top 5 clusters was submitted.
8.
9.
10.
11.
12.
13.
14.
15.
1.
2.
3.
4.
5.
6.
7.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
http://bioinf.cs.ucl.ac.uk/psipred/
Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247, 536-540
Brenner S.E., Koehl P., Levitt M. (2000) The Astral compendium for
protein structure and sequence analysis. Nucleic Acids Res., 28, 254-256
Holm L., Sander C., Mapping the protein universe. (1996) Science 273,
595-602
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential Energy
Functions and Parameters for Simulations of Molecular Dynamics of
Proteins and Nucleic Acids in Solution. Comp. Phys. Comm. 91, 215-231
Lindahl E., Hess B., van der Spoel D. (2001) GROMACS 3.0: A package
for molecular simulation and trajectory analysis. J. Mol. Mod. 7(8), 306
http://www.gromacs.org
Jones T. A, Kjeldgard M. (1998) Essential O, Software manual, Uppsala
University. http://xray.bmc.uu.se/alwyn/o_related.html
A-109
16.
17.
18.
Koehl P., Delarue M. (1994) Application of a self-consistent mean field
theory to predict protein side-chains conformation and estimate their
conformational entropy. J. Mol. Biol., 239, 249-275
Koehl P., Delarue M. (1995) A self consistent mean field approach to
simultaneous gap closure and side-chain positioning in homology
modeling. Nature Struct. Biol., 2, 163-170.
Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of
NK-lysin using multiple sequences and recognized supersecondary
structural motifs. Proteins: Struct. Funct. Genet. S1, 185-191.
Simons K.T., Bonneau R., Ruczinski I. and Baker D. (1999) Ab initio
protein structure prediction of CASP III targets using ROSETTA. Protein:
Struct. Funct. Genet. S3,. 171-176.
Bonneau R., et. al. (2001) Rosetta in CASP4: Progress in ab initio protein
structure prediction. Proteins: Struct. Funct. Genet. S5, 119-126.
PHD, http://www.embl-heidelberg.de/predictprotein/predictprotein.html
SAM-T02-STRIDE,
http://www.cse.ucsc.edu/research/compbio/HMMapps/T02-query.html
Gough J.and Madera M. (2002) The next generation of structural genome
analysis. CASP5 Abstract.
Xiang Z.X., Soto C.S. and Honig B. (2002) Evaluating conformational free
energies: The colony energy and its application to the problem of loop
prediction. Proc. Natl. Acad. Sci. USA 99 (11), 7432-7437.
Samudrala R. and Moult J. (1998) An all-atom distance-dependent
conditional probability discriminatory function for protein structure
prediction. J. Mol. Biol. 275 (5), 895-916
Tenenbaum J.B., de Silva V. and Langford J.C. (2000) A global geometric
framework for nonlinear dimensionality reduction. Science. 290 (5500),
2319.
MPALIGN (P0135) - 327 predictions: 327 3D
concatenating these short profiles, where profiles obtained as in (i) are also
used for regions that do not correspond to any group of fragments. An
alignment between the target sequence and each PSSM is computed using a
simple dynamic programming algorithm. ASTRAL sequences are scored based
on these alignment scores.
MPALIGN: a Protein Threading Program
Using Multiple Profiles
T. Akutsu1, M. Fujita1, H. Saigo1, J.-P. Vert2 and K. Horimoto3
1
Finally, each candidate sequence is ranked based on weighted sum of the above
scores and the result of secondary structure prediction by PSIPRED[4], where
we only use information about the ratio of the number of residues in alphahelices to the number of residues in beta-strands.
Bioinfomatics Center, Kyoto Univ., 2Ecole des Mines de Paris,
3
Human Genome Center, Univ. Tokyo
takutsu@kuicr.kyoto-u.ac.jp
MPALIGN combines various alignment methods. PSI-blast[1] is used for easy
targets. For the others, dynamic programming based sequence-to-profile
alignment is employed using profiles from (i) PSI-blast search from each
representative sequence, (ii) multiple profiles from sequences in the same fold,
(iii) combination of (i) and profiles from structurally similar fragments. In the
following, we briefly describe the methods for (i)-(iii).
(i). For each sequence in the ASTRAL database (with less than 40% sequence
identity) [2], PSI-blast search is performed against the nr database using ‘–Q’
option, which outputs a PSSM (position specific score matrix) as a result. For
each PSSM, both local alignment and global alignment between the target
sequence and the PSSM are computed. ASTRAL sequences are scored based
on these alignment scores.
(ii). Multiple profiles from several sequences in different families but in the
same fold are used, where each profile is obtained as in (i). An alignment
between the target sequence and multiple profiles is computed for each fold
class, where a simple dynamic programming algorithm is employed for
alignment. Fold classes are scored based on these alignment scores.
(iii). In order to obtain profiles based on fragments of protein structures,
fragments are classified of into several tens of groups based on structural
similarities using the UPGMA clustering method, where each fragment consists
of consecutive 9 C-alpha atoms [3]. For each group, we construct a profile
based on residue frequency in each position (among 9 positions). For each
protein structure in the ASTRAL database, we construct a PSSM by
A-110
1.
2.
3.
4.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Chandonia J.M. et al. (2002) ASTRAL compendium enhancements.
Nucleic Acids Res. 30, 260-263
Simons K.T. et al. (1997) Assembly of protein tertiary structure from
fragments with similar local sequences using simulated annealing and
Bayesian scoring functions. J. Mol. Biol. 268, 209-225
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific score matrices. J. Mol. Biol. 292, 159-202
Murzin (P0448) - 21 predictions: 21 3D
Knowledge-based Approach to Modelling of Homologous
Structures with Low Sequence Similarity and Other Tricks
A.G. Murzin
MRC Centre for Protein Engineering, Cambridge, UK
agm@mrc-lmb.cam.ac.uk
Our semi-manual approach to protein structure prediction is based on the
knowledge of all known structural and probable evolutionary relationships
among proteins of known structure classified in the SCOP database. In CASP2
and CASP4, we successfully applied this approach to the recognition of
probable distant homology, where it existed, between the target proteins and
proteins of known structure and significantly improved the quality of our
distant homology models between the two CASP experiments. In CASP5, I
have applied this approach to distant homology modelling, a new emerging
prediction subcategory.
Having been previously a subset of the fold recognition targets, the distant
homology targets are now becoming a subject of comparative modelling. The
coming of sequence profile-based methods combined with the enlargements of
protein families resulted from the sequencing of many complete genomes have
eased the detection of remote homology. In contrast, the modelling of distant
homology targets remains a challenging problem. It is different from the
classic comparative modelling where the similarity between the target and
parent structures is likely to be extensive. In distantly related proteins, the
amount of similar structure is generally smaller, about one half on the average.
Thus, apart from the identification and alignment of the regions of similar
structure, there is a problem of the prediction of remaining dissimilar regions.
In theory, some of these regions could be “copied” from alternative parent
structures, if they are available, that is, it may be possible to assemble a
composite “parent” structure from fragments of several distantly related
structures that would approximate the target structure better than any of
currently known structures. This theory is sound as shown by our initial test of
composite models in previous CASP experiments. My main objective in
CASP5 was the further improvement of composite models. Ideally, such a
model should provide a structural explanation of every conserved feature in the
multiple sequence alignment of the target immediate family.
I have selected about 15 targets ranging across two prediction categories from
difficult comparative modelling targets to easy to medium fold recognition
targets. The selection criteria were the lack of a high sequence similarity to a
known structure and the availability of at least two probably related structures
with low sequence similarities to each other. Thus, each selected target has
been assigned into a true SCOP superfamily (that is, containing more than one
family of known structure) either by sequence similarity searches or by our
distant homology recognition techniques. The composite “parent” structures
have been assembled from manually selected fragments of the representative
structures of different constituent families. In a few cases, my earlier
predictions speeded up the modelling process. Having explored previously a
A-111
number of true SCOP superfamilies, I assigned to them many sequence families
of unknown structure and aligned the representative sequences of these families
with the sequences of known structures. The CASP5 targets T0130, T0132,
T0136, T0152 and T0169 appeared to be the members of already assigned
sequence families that enabled my use of the prepared alignments. An original,
post-CASP4 model of T0130, built in 2001 for a structural genomics project
has been submitted without further refinement. This model suggests a novel
topology not yet observed in the target superfamily. My other models are
expected to improve on the prediction of local details of particular interest
including variable irregular elements (alpha-helical caps, beta-bulges etc.) near
the putative active sites. It should be noted, that my method does not deal with
the problems of classic comparative modelling, like the prediction of loop or
side chain conformations. Any credit for a correct prediction of these details
goes to MODELLER used to seal the gaps and fix the stereochemistry of the
joints in the composite models.
My other CASP5 exercise was the exploration of a loophole in the design of
CASP experiment. Ideally, for a given target, the prediction should have been
done before the target structure is known to anybody. In reality, the most of the
CASP5 targets have been submitted after their structures have been solved, so
they have been known to at least their authors. Although officially
unpublished, these structures could have been presented elsewhere, so some
information on these structures could have “leaked” into the public domain.
Indeed, I have found previously available information, mainly on the Internet,
on the structures of several CASP5 targets in all prediction categories. It
contained general descriptions of overall protein fold or similarity to a known
structure. Such information has probably no or little effect on my distant
homology models aimed to the prediction of fine details, but it can be crucial
for the predictions of protein fold. To evaluate its effect on the quality of
predictions, I have built and submitted the models utilising the collected
structural information as recorded in the REMARK field of the corresponding
predictions.
MZ-Brussels (P0246) - 54 predictions: 54 3D
highly similar sequences (with identity more than 50% and less than 100%). In
performing these alignments, gaps inside the secondary structure elements
(computed using DSSP [4]) were penalised.
Energy-based 3D Protein Structure Predictions
Koji Ogata12, Raphael Leplae2 and Shoshana J. Wodak2
1
– ZoeGene Corp., Japan,
Service de Conformation de Macromolecule Biologique et Bioinformatique,
Université Libre de Bruxelles, CP 263, Blv. du Triomphe, Brussels, Belgium
mz@ucmb.ulb.ac.be
2
The ModzingerZ (MZ) package performs homology modelling and ab initio
structure prediction respectively, depending on the presence or absence of
template structures in the PDB. Templates are identified in the PDB by a two
step procedure using Psi-BLAST [1] with the default options. When suitable
templates are found, homology modelling is performed, by combining a
profile-profile alignment and energy based loops modelling procedure. In
absence of template structures in PDB, a fragments grafting method is applied.
The grafted fragments are selected from a library of non-redundant fragment
conformations, which is derived from known protein structures by structure
superposition and clustering. In both approaches, generated conformations are
scored using an approximate force-field derived from averaging main chain and
side chain interactions in proteins computed using the AMBER force-field, and
modelling residues by 2 interactions centers. Further details about these
procedures are given in the following paragraphs:
Homology modelling procedure:
To identify structural templates in the PDB a two step procedure was used.
First the target sequence was aligned against a sequence database combining
sequences from Genbank and PDB-sub (PDB-sub containing sequences with
<90% sequence identity) by using Psi-BLAST with default parameters. Second,
individual PDB structures identified by this search were re-run against PDBsub to identify additional homologs with known 3D structure. All the identified
PDB structures were then structurally aligned. A profile was derived from these
structural alignments and the target sequence was aligned against this profile
[2]. In addition a sequence profile was computed for each identified template
protein, by running Psi-Blast against in Genbank [3] and pruning so as to leave
A-112
Structurally conserved regions (SCR) in the target sequence were then defined
as residues that aligned to those of the structural templates that display an
RMSD≤1.0Å in the corresponding multiple structural alignment. For residues
in the target corresponding to SCR’s, the backbone was built using the main
chain coordinates of the template with the highest BLOSUM62 score to the
target. Side chain coordinates from the same template were also used whenever
the amino acid of the target and template were the same.
The remaining regions, called structurally variable regions (SVR), were built
by using the main chain atom coordinates of template structures having the
highest BLOSUM62 score computed without insertion/deletion regions, with
different templates being used for different regions. For the insertions/deletions,
an energy-based loop modelling method [5] was used to find suitable loop
conformations. The force-field used for evaluating conformations, models each
residue by two interactions centers positioned at the C and C atoms. The
pairwise interactions energies between these centres was derived by computing
the average of the potential energy of the AMBER force field [6] for main
chains and side chains interactions for specific residue pairs in the PDB. We
verified that this force-field yields rather accurate predictions for individual
protein loops as well as several interacting loops [Ogata & Wodak in
preparation]. But the maximum loop length amenable to this procedure is 22
residues. Longer loops were therefore simply not modelled. Residues without
side chain coordinates from a template structure were generated using the
Monte Carlo method with the AMBER force field. Models output by the above
procedure were examined, and the alignment was adjusted (either manually or
with alignment tools), whenever some inconsistencies (on the sequence,
structure or biological level) were discovered. The new alignment was then refed to the model building method described above.
Ab initio modelling approach:
When no suitable template was found by the search procedure described above,
a fragment-grafting method was used. This method uses a library of eightresidue fragments with non redundant conformations (conformations with
rmsd>1Å), derived from known protein structures by structure superposition
and clustering. The information on the amino acid sequence and the secondary
structure (computed by DSSP [4]) associated with each fragment cluster is also
stored in the library.
The protein main chain was generated by chaining together fragments starting
from the protein N-terminus in such a way that the 1st 4 residues of the
following fragment and the last 4 residues of the preceding fragment overlap.
To select fragments from the library the rmsd of the overlapping portion was
required to be below 1Å. Since this still yielded a very large number of
overlapping fragments, information on secondary structure was used to further
reduce conformation space, as follows. The target secondary structure was
predicted using PHD[7] and fragments with secondary structure more than 80%
similar to the target were selected. The similarity of the secondary structures is
defined as the percentage of identity between secondary structure elements of
the same type (helix, strand and random coil). The remaining conformations
were evaluated using the force-field described above.
Over one million main chain conformations were generated in this way and the
main chain conformation with the lowest energy was selected as the optimal
solution. Side chain conformations of the selected main chain were then
generated using a Monte Carlo procedure with AMBER force field.
For the CASP5 predicted targets, we used two single processor machines. Due
to the size of the conformation search space, each prediction has been limited to
24 hours CPU time.
1.
2.
3.
4.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Rychlewski L, Jaroszewski L, Li W, Godzik A. (2000) Comparison of
sequence profiles. Strategies for structural predictions using sequence
information. Protein Sci. 9(2), 232-41.
Benson, Dennis A., Karsch-Mizrachi, Ilene, Lipman, David J., Ostell,
James, Rapp, Barbara A., Wheeler, David L. (2002) GenBank. Nucleic.
Acids. Res. 30, 17-20
Kabsch W. and Sander C. (1983) Dictionary of protein secondary
structure: Pattern recognition of hydrogen-bonded and geometrical
features, Biopolymers, 22, 2577-2637.
A-113
5.
6.
7.
Ogata K., Leplae, R., Wodak, SJ. An Energy Based Predictions for Multiloops of Proteins, in preparation.
Weiner, S. J., Kollman, P. A., Nguyen, D. T. and Case, D. A. (1986). An
all atom force field for simulations of proteins and nucleic acids. J Comput
Chem, 7, 230-252.
Rost B. and Sander C. (1994) Combining evolutionary information and
neural networks to predict protein secondary structure. Proteins, 19, 55-72.
nexxus-delrio (P0370) - 7 predictions: 7 3D
paths, the most traversed vertices are identified as nexuses as opposed to those
vertices having most interactions.
Protein Structure Assessment by Matching Residues Function
and Centrality
G. del Rio1, A. Garciarrubio2 and D.E. Bredesen1
1
– Buck Institute, 2 – Biotechnology Institute (UNAM)
gdelrio@buckinstitute.org
Biological systems can be represented by their elements and their interactions
in a graph or network. Graph theory analyzes systems represented by vertices
(i.e. elements) and edges (i.e. interactions). From these graphs, central vertices
or edges (nexuses) can be detected based on diverse criteria, including
connectivity. Nexuses defined in terms of connectivity are those vertices upon
which the connectivity relay on. Since graphs are models of biological systems
then nexuses represent essential elements for biological systems [1, 2].
Several biological systems have been represented as graphs, including
metabolism [3] and protein-protein interactions [2]. In these examples, the
connectivity distribution presents a tail following a power-law distribution, that
is, there are few vertices having many interactions while most vertices have few
interactions. It has been proposed that in such distribution the most connected
vertices play an essential function for the system being modeled [2].
However, none of the systems studied in this way are completely characterized
in terms of vertices and edges. On the other hand, proteins (i.e. protein
structures) can be modeled as graphs where the vertices are amino acid residues
and edges are spatial-distance relationships. In this case, the system is better
defined and can be used as a model to readily test mathematical methods to
identify central/essential elements.
We have found that the connectivity distribution in protein structures do not
follow a power-law distribution but an exponential one [4]. We develop a
method to identify nexuses in power-law distribution or exponential ones
referred as Minimum Interacting Networks, MIN [1, 4]. MINs are obtained by
tracing the shortest path connecting all the vertices in a graph. From these
A-114
In proteins, essential residues tend to be evolutionarily conserved, and because
of this feature, they have been used to classify and identify proteins by function
and structure. In modeling protein structures, refinement and assessment of the
models is systematically done by methods accounting for geometry and energy
criteria. These models then are subject to other approaches to identify essential
residues (e.g. residues in the active site) as a way to test them functionally. We
present a method to assess protein structure models that incorporates functional
validation. Our strategy is represented in figure 1:
Homology modeling
Figure 1. The 3D structure of a protein sequence (line) is modeled by structural
homology with a known 3D structure (cylinder). Multiple conformers
(cylinders, cones) can be generated where geometry and energy criteria are
satisfied. Nexuses can be identified and matched with known functionally
essential residues. The models where the matching is the highest are identified
as functionally/structurally correct.
As in sequence motifs, nexuses are proposed to serve as signature to identify a
particular structure. In the figure, the central regions for a cone are not the same
than for a cylinder, hence, cylinder can be filter out from cones using centrality
as a criterion. This appears to be true for proteins since these do not present all
the possible different shapes. Also, proteins with similar folds may have
distinct essential residues hence these may be used to identify the
corresponding protein structures.
We tested our method with three proteins (beta-lactamase (BL), HIV-1 protease
(HIV1P) and T4 lysozyme (T4Lz)) for which the 3D structure is available and
extensive mutagenesis studies have been conducted to identify essential
residues for the wild-type activity. We found that in all three cases, we were
able to identify the closest models to the crystal structure by matching nexuses
with essential residues (matching ratio) (see figure 2).
0.8
0.7
Specificity
0.6
0.5
0.4
We also used our method in fold recognition. That is, sequences with different
folds may be identified as templates for a target sequence. In order to identify
the best template for the target sequence, 3D models are built to detect nexuses
and match these with essential residues. We tested this procedure with the death
domain of the low affinity neurotrophin receptor (NTRP75) and PROSPECT as
the fold recognition approach. The 5 best sequence alignments identified by
PROSPECT were used to test our approach. As seen in Table 1, the fold with
highest matching ratio was the correct fold.
Table 1. Fold recognition scores based on matching ratio (column CN/NCN)
for NTRP75. Essential residues here are considered evolutionarily conserved
ones. CN: Nexuses found to be conserved residues; NCN: Nexuses found to be
non-conserved residues.
Template
CN
NCN
CN/NCN (%)
Fold Classification
Structure
1B3U (A)
19
39
48.71
Alpha-alpha super helix
1D2Z (A)
19
35
54.28
DEATH domain
1NGR (1)
18
22
81.81
DEATH domain
1QPZ (A)
19
47
40.42
Periplasmic binding proteinlike I
1R69
17
45
37.77
Lambda
repressor-like
DNA-binding domains
0.3
For CASP5, we used PROSPECT to identify the closest sequence homologues
and using our approach we detected those models containing the largest
matching ratios (Essential residues were considered evolutionarily conserved
ones). The models were built using MODELLER.
0.2
0.1
0
3.02
3.04
3.06
3.08
3.1
3.12
3.14
1.
RMSD
Figure 2. RMSD of 40 models of HIV1P superimposed with the crystal
structure (1HIV) is plotted against the specificity of our method. The models
were built with MODELLER and the 1HIV structure as template. Specificity =
No. of essential residues predicted as nexuses/No. of essential residues.
Specificity is used here as the matching ratio.
A-115
2.
3.
4.
del Rio G., et al. (2001) Mining DNA microarray data using a novel
approach based on graph theory. FEBS Lett. 509(2), 230-234.
Jeong H., et al. (2001) Lethality and centrality in protein networks. Nature
411 (6833), 41-42.
Jeong H., et al. (2000) The large-scale organization of metabolic networks.
Nature 407 (6804), 651-654.
del Rio G. et al. Detecting central elements from protein derived networks
to predict essential function. In preparation.
ORNL-PROSPECT (P0012) - 330 predictions: 330 3D
have developed a fully automated prediction pipeline for protein structures,
centered around PROSPECT.
Fold Recognition Using PROSECT
D. Kim, D. Xu, J. Guo, S. Passovets, M. Shah,
K. Ellrott, and Y. Xu*
Protein Informatics Group, Oak Ridge National Laboratory
* xyn@ornl.gov
We have predicted the protein tertiary structures for all 67 targets. The
predictions have been mainly carried out using our newly improved fold
recognition program, PROSECT (http://compbio.ornl.gov/PROSPECT/), and a
recently developed computational pipeline for automated protein structure
predictions (http://compbio.ornl.gov/PROSPECT/PROSPECT -pipeline), with
occasional human intervention.
PROSPECT uses both the sequential and structural information to recognize
the correct sequence-structure relationship. Built on its previous unique features
for rigorously treating pairwise contact energy and protein-specific data as
threading constraints, the new version of PROSPECT has the following unique
features: (1) the use of the evolutionary information not only in a profileprofile sequence alignment score but also in calculating the single and pairwise
energies; and (2) the use of a combined z-score for a sequence-structure
alignment for fold recognition, based on a z-score for the raw alignment score
and a z-score for the pairwise interaction energy. In addition, by performing
objective statistical analysis on a large data set for threading of 600 query
proteins against the whole FSSP database, a prediction confidence index for
measuring the prediction reliability is tabulated.
The tests on several benchmark sets indicate that the evolutionary information
and other new features in PROSPECT greatly improve its alignment accuracy.
We have also demonstrated that the PROSPECT’s performance on fold
recognition is significantly better than any other methods publicly available at
all levels of sequence similarity. Improvement on the sensitivity of the fold
recognition, especially at the superfamily and fold levels, makes PROSPECT a
reliable prediction tool for large-scale applications of protein structures. We
A-116
The pipeline consists of five key components: (a) preprocessing for
identification of protein domains, identification and removal of signal peptides,
and protein secondary structure prediction (using our own in-house program);
(b) protein triage for classification of proteins into membrane proteins, soluble
proteins with and without significant BLAST hits; (c) protein fold recognition
and sequence-structure alignment using PROSPECT, using additional available
information as prediction constraints (like secondary structure or known
disulfide bonds, etc); (d) protein structure modeling, using MODELLER, based
on threading alignments of PROSPECT; and (e) post-processing for structural
quality check, using PROCHECK. A pipeline manager system was developed
to automatically determine the process and prediction pathway based on a user
specification, a set of preset conditions and related control flow, and the triage
result. The whole pipeline is implemented as a client/server system, with a web
interface. XML technology is used for data exchange between the web
interface, the pipeline manager and the tools. Currently the pipeline is running
on a 64-node linux cluster at Oak Ridge National Laboratory.
A general procedure for the CASP5 structure predictions starts with running the
pipeline for each target. If both the prediction reliability score and the structure
quality assessment score from WHATIF are above some pre-determined
thresholds, a structure model generated by MODELLER was submitted as the
final prediction without any human intervention. Most of the homology
modeling targets belong to this case. For the cases with high reliability scores
but low WHATIF scores, several alternative sequence-structure alignments for
a chosen template were generated using different alignment schemes including
global, global-local, and local alignments and/or by using different set of
weighing factors for each energy term. The majority of the fold recognition
targets belong to this case. For targets with low reliability scores, such as new
fold targets and some of the difficult fold recognition targets, additional
information such as predicted functions and consistency with the predicted
secondary structures are used to select the templates and adjust the alignments.
Overall, the predictions were made by maximally utilizing two automated
programs, PROSPECT and Pipeline, while human intervention was kept at the
minimal level.
Osgdj (P0292) - 100 predictions: 100 3D
The PROTSCAPE Protein Folding Algorithm
D.J. Osguthorpe
University of Bath in Swindon
djosg@mgu.bath.ac.uk
The PROTSCAPE protein folding algorithm is based on a simplified geometry
model of protein structure with a physics-based force field representing the
interactions between the pseudo-atoms.
The model used to represent the solvation aspects of protein folding physics is a
novel one unique to this algorithm. It is called the "Differential Dielectric
Model" as it describes what happens to electrostatic interactions when there is a
difference in dielectric between two regions. A consequence of this model is
that there is, for the first time, a direct physical basis for a cooperative effect in
protein structures which develops from the interaction between non-polar
groups and electrostatic interactions. Further, it allows another mechanism for
denaturation by solvents such as Guanidinium HCl which is not based on
disrupting "hydrophobicity" but on reducing the difference in dielectric of the
core and outer surface regions.
The Reduced Representation Model and Force Field
Simplified Geometry Model
The model involves representing the backbone of each residue by one sphere,
or 'atom', and the side chains by up to 4 'atoms'. The side chains of Ala, Val,
Ile, Ser, Thr and Pro are represented by 1 sphere, Leu, His, Asp, Glu, Asn, Gln,
Cys and Met by two spheres and Phe, Trp, Lys and Arg by three spheres and
Tyr by four spheres. The different number of spheres reflects the anisotropic
nature of the average shape of the corresponding side chains. It also enables
assigning different characteristics to parts of the side chain of a residue, for
example, the side chain of Arg includes a hydrophobic chain and a
polar/charged end. Although in this representation many residues have the same
number of atoms, they do not lose their unique identity since they have
different parameters.
A-117
Simplified Potentials
The potentials required can be split into three major groups, the virtual internal
potentials which stabilise the geometry of the protein, secondary structure
stabilisation potentials and the global potentials, which deal with the effects of
the environment but do not require the environment to be modeled explicitly.
The potential energy function for the model is defined as:
E total = E Internal + E Secondary Structure + E van der Waals + E Global
Internal Potentials
The values of the parameters were derived by fitting observed distributions of
the corresponding internals in experimental structures and by emulating the
energy surface calculated using a full atom model.
The internal energy is defined in terms of virtual bond, angles and torsions (or
out of plane). A number of functional forms are used, the standard full-atom
model harmonic terms, quadratic functions and Gaussian functions plus
combinations of these terms. Additionally an out of plane-virtual valence angle
cross-term is defined.
E Internal = E V. bond + E V. angle + E V. torsion + E V. oop
+ E V. oop X V. angle
Virtual angle - virtual angle - virtual torsion angle (theta-theta-phi) cross-terms
are defined for dealing with correlations between the two internal valence
angles of a torsion angle in the backbone. These are particularly important for
turn conformations.
Secondary structure energy/Backbone Hydrogen bonding Potentials
With the simplified geometry model only C alpha atoms exist for the backbone
and yet backbone hydrogen bonding is very important in the stabilisation of the
standard secondary structures. However, the standard secondary structures have
a fixed and specific set of distances between the C alpha atoms. Hence the basic
approach was to determine the equilibrium distances between C alpha atoms in
3-10, alpha-helices and parallel and anti-parallel beta-sheets and to use
Gaussian functions to stabilise these distances.
E Secondary Structure = E Helix + E Sheet
For the beta-sheets it was also necessary to include some vector terms as well
to ensure only when the two strands were aligned was the potential strong.
Further improvements were necessary to the sheet potentials as from trial
folding runs it became clear that additions were needed to remove
conformations created that are never seen in real proteins.
It should be noted that in all cases the secondary structure potentials merely
stabilise distances that are found, this is not a pre-imposition of secondary
structure. The beta-sheet potentials do a full search of all residue pairs to find
any that are close enough to form sheets in each energy calculation.
Secondary structure "preference" energy
This term accounts for the observation that certain residues prefer a particular
secondary structure. This is required in this model as this preference is due to
local interactions between side chain atoms and the backbone atoms which are
missing in this model. Ala, Lys, Arg, Glu, Gln, Leu and Met are assigned a
helix preference, while Val, Ile, Thr, His, Phe, Tyr, Trp and Cys are assigned a
strand preference. An overall preference for any residue has been added by
stabilising virtual torsions and angles using i-i+2, i+1-i+3, and i-i+3 distances
and Gaussian functions for both the helical and strand conformations. As
individual residue conformations only affect the virtual valence angle, the
overall preference is specifically increased only for contiguous pairs of residues
which both prefer the helical conformation or both prefer the strand
conformation. That is, the two central C alphas of a backbone virtual torsion
must both prefer the helical or strand conformation to increase the secondary
structure "preference" potential of the virtual torsion.
E Secondary Structure Prediction = E Turn + E Strand
Global/Solvation Potentials
The remaining potentials are used to represent the non-bonded interactions of
the residues with each other and the interactions with solvent. The fundamental
idea behind the solvation potentials was to use fast approximations to the
physical forces involved in real protein structures. Also, as molecular dynamics
was seen as one of the primary tools to be used in the parameterisation
procedure and for first attempts at protein folding the potentials had to have
analytical derivatives for speed.
Physical Model Solvation Potentials.
In this potential model the physical forces of solvation were included using
simple potential models. The main idea was that most protein atoms should not
have an attractive interaction with other protein atoms, reflecting the fact that
the real interactions with protein atoms would be replaced by solvent
interactions if the atom became exposed, hence its overall energy would not
change depending on whether it was buried or exposed. However, the atoms
should still have excluded volume so a repulsion potential is required at short
distances. An offset Lennard-Jones potential is used, where the well-depth is
offset to 0 at the Lennard-Jones radius and the energy is set to 0 for distances
A-118
between atoms greater than the Lennard-Jones radius. This potential is used for
most atoms, in particular the C alpha backbone atoms and any atom which does
not have a specific Lennard-Jones potential.
E van der Waals = E Offset Lennard-Jones
Physical Model Solvation Potentials - Hydrophobicity
The next effect to consider is the "hydrophobic" effect. This is separated into
two parts, the Van der Waal's potential between atoms (which is attractive) and
effects due to interactions with water. Side chain atoms of hydrophobic groups
were given a standard Lennard-Jones potential with an initial energy
assignment for interactions between the same atom close to the enthalpy of
vapourisation of the most similar hydrocarbon. This would reproduce the
energy of the hydrophobic core when hydrophobic side chains are buried.
This determined the potential between the same side chain atom types. For
dissimilar side chain atom types an analysis of the distribution of side chain
atoms around an atom in known protein structures showed to a first
approximation little difference in preference between the atoms. This
distribution is not that which is created by rules such as the geometric mean
rules. A function was created which would give such a distribution and this was
used to generate the mixed terms for the Lennard-Jones parameters of
hydrophobic side chain atoms. Unlike previous CASP predictions in CASP5 no
additional hydrophobicity term was included.
A final adjustment to the "hydrophobicity" potential was to give certain groups
in residues not normally considered hydrophobic a non-zero Lennard-Jones
function so that an interaction existed between them and hydrophobic groups.
Such groups were the Ala C beta, the Thr C beta (because of the methyl group),
the C beta of the charged amino-acids Asp, Glu, Lys and Arg and Asn and Gln.
It also included the C gamma atom of Lys and Arg. Observations of
experimental structures and surface accessibility calculations show that these
groups are as buried as any of the atoms in the classic hydrophobic side chains.
E Global = E van der Waals + E hydrophobic sigmoid
Physical Model Solvation Potentials - Electrostatics
In these calculations an inverse Kirkwood-Tanford model is used, in which the
electrostatic interactions between the charged groups are varied according to
their local dielectric environment, defined by counting how many non-polar
groups are surrounding them, using a sigmoid function. To take into account of
ionic strength effects, which are assumed to have an affect at large distances
between charges but not at short range (as the Debye-Huckel theory on which
this aspect is based assumes an averaged ionic atmosphere around each charge
which is certainly not true for charges on the surface of a folded protein), a
distance dependent dielectric of the distance squared was used. Electrostatic
interactions were computed using a distance cubed term and in addition the
same term scaled by the sigmoid function of surrounding non-polar groups.
Note that the scaled term accounts for salt-bridges automatically, as interactions
between charged pairs not surrounded by non-polars (high dielectric) will be
weak but strong when surrounded by non-polars (low dielectric). The distance
cubed term was chosen such that at 4.0 angstrom the charge effect was
approximately equivalent to two unit charges with dielectric 80 or so (approx. 1
kcal) whereas at 10-12 angstrom the charge interaction was reduced to less than
0.1 kcal.
E Global += E Electrostatic + E scaled Electrostatic
The other feature of electrostatics that needs to be covered is the difficulty of
burying charges, the self-energy. It is actually a much stronger rule of proteins
that the charged group of charged residues is exposed than that the side chains
of hydrophobic residues are buried. Charged groups are only buried if in a salt
bridge or extensively hydrogen bonded. The simple electrostatic explanation
for this is the self-energy of a charge which says it requires a lot of energy to
move a charge from a high dielectric region into a low dielectric region.
As there is a big difference in surface accessibility between the 4 charged
residues, Lys, Glu, Asp, and Arg independent potentials are used for Lys,
Asp/Glu and Arg. The Lysine charged end point is the most solvent exposed
group of proteins, with an average relative surface accessible area greater than
50%. Glu is next followed by Asp, both in the 45% region, and Arg is the least
exposed at around 35%. This is what you would expect based on charge density
considerations, the self energy being much greater for a charge field which
small and highly charged. The amine group of Lysine is the smallest charged
group, with only one heavy atom, the carboxyl spreads the charge further while
the guanidinium group charge is spread over a very large area (four heavy
atoms). The same sigmoid function counting the number of non-polar groups
surrounding a charge is used as before, scaled by a potential constant which
gives a positive energy for burying a charge.
E Global += E charge-non-polar sigmoid
In CASP5 this treatment was extended to the polar side chains, Ser, Thr, Tyr,
Trp, His, Asn and Gln. The same argument that applies to charges also applies
to atoms with a partial charge, however the self-energy is much weaker because
overall the group is neutral. Recent test folding simulations have suggested that
A-119
these groups are too easily buried with a zero Lennard-Jones potential so have
had an additional self-energy term added.
E Global += E polar-non-polar sigmoid
Another addition in CASP5 were Lennard-Jones terms between polar side
chains (Asn, Gln, Tyr, Ser) and the backbone CA and charged residues (Asp,
Glu, Lys, Arg) and the backbone CA. This attempts to represent side chainbackbone hydrogen bonding interactions. Further, Lennard-Jones terms
between polar and charged side chains (Tyr and Arg, Asp, Glu, Lys) were
introduced. This attempts to represent polar-charged side chain hydrogen
bonding interactions, although relatively weakly at present.
E Global += E polar-backbone + E charged-backbone + E polar-charged
Physical Model Solvation Potentials - Differential Dielectric Model
In the low dielectric environment of the folded protein the stability of the
backbone-backbone hydrogen bonds is significantly enhanced as these
hydrogen bonds are excluded from solvent and a hydrogen bond is essentially
an electrostatic interaction. In the unfolded protein the stability of backbonebackbone hydrogen bonds is likely to be similar to that of backbone-water
hydrogen bonds, hence there should be no energy stabilising backbone
hydrogen bonds. This effect has been included by scaling the backbone
hydrogen bond energy term (E Helix and E Sheet) by a sigmoid function
counting the number of surrounding non-polar groups.
Folding Simulations - Simulated Annealing procedure
The starting conformation is an all-extended structure using a rigid geometry
procedure based on a standard geometry for the RR model. A random MaxwellBoltzmann distribution is used to assign initial velocities. The initial
temperature is set such the average temperature initially is 380 K. The
annealing protocol was first to reduce the total energy by the energy equivalent
to 25 K in 84000 steps followed by 84000 steps at constant total energy. This
was repeated three times. 84000 steps at constant total energy followed and
then the energy was reduced by 12.5 K in 84000 steps. The total energy was
then continuously reduced by 3.125 K in 84000 steps until the average
temperature was around 175 K. This was followed by 5 runs at a constant
temperature of 175 K. Final annealing to close to 0K was done in 4 runs of
84000 steps cooling by 25 K each run followed by a final 110 K cooling run of
84000 steps.
The average energies of the final run at the constant temperature of 175 K were
used to determine which structures to select, the lowest 5 energy structures
submitted in energy order, i.e. model 1 is the lowest energy structure.
Pan (P0032) - 164 predictions: 99 3D, 65 SS
Secondary Structure Prediction
Yang Han and Xian-Ming Pan
Institute of Biophysics, CAS
xmpan@sun5.ibp.ac.cn
Protein secondary structure predictions have been performed by various
methods [1-17]. We have employed a multiple linear regression method [18]
developed in our previous work to predict protein secondary structures of the
targets in CASP5.
Algorithms
For each of CASP5 sequences, a multiple sequence alignment was constructed
using PSIBLAST at first. The protein sequence databases searched by
PSIBLAST were SWISS-PROT 40, TrEMBL 21 and PDB with three iterations.
Each group of sequences that we got were then screened in order to exclude the
sequences which had too high or poor homology. Moreover, the gaps in the
query sequences were deleted to improve the prediction. CLUSTALW was then
executed with default parameters to generate multiple sequence alignments.
We took four profiles as inputs for algorithm of multiple linear regression,
which were extracted from both alignments mentioned above:
1.
Generating the FASTA format of multiple alignment files from the
results of CLUSTALW.
2.
Using HMMER2 [23] package to generate position specific profiles
from alignments of CLUSTALW.
A-120
The simple frequency counts for each amino acid in the PSIBLAST
alignment expressed as percentage of the total for a given column.
4.
Each amino acid residue in PSIBLAST alignment was scored by its
corresponding BLOSUM62 matrix score. The scores were averaged
based on the number of sequences in that column.
Then, the prediction was executed with these four profiles as inputs. The
outputs were assessed for consensus of each position. Positions where there
was a full agreement in the predicted state were taken as the final prediction. To
those positions where the predictions were not coincident, take the most
popular prediction as the final prediction.
1.
It is known that protein secondary structure prediction can be improved by
exploiting the evolutionary information from protein families [17,19-22]. With
the application of multiple sequence alignment profiles, we also hope to
improve our prediction results.
3.
Maggio E.T. and Ramnarayan K. (2001). Recent developments in
computational proteomics. Trends in Biotech. 19, 266-272.
2. Cuff J.A. and Barton G.J. (1999). Evaluation and improvement of multiple
sequence methods for protein secondary structure prediction. Proteins 34,
508-519.
3. Rost B. (1996). PHD: predicting one-dimensional protein structure by
profile based neural networks. Methods Enzymol. 266, 525-539.
4. Przybylski D. and Rost B. (2002). Alignments grow, secondary structure
prediction improves. Proteins 46, 197-205.
5. Ouali M. and King R.D. (2000). Cascaded multiple classifiers for
secondary structure prediction. Protein Sci. 9, 1162-1176.
6. Jones D.T. (1999). Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195-202.
7. Karplus K., Barrett C., Cline M., Diekhans M., Grate L. and Hughey R.
(1999). Predicting protein structure using only sequence information.
Proteins Suppl. 3, 121-125.
8. Baldi P., Brunak S., Frasconi P., Soda G. and Pollastri G. (1999).
Exploiting the past and the future in protein secondary structure prediction.
Bioinformatics 15, 937-946.
9. Rost B. and Sander C. (1996). Bridging the protein sequence-structure gap
by structure preditions. Annu. Rev. Biophys. Biomol. Struct. 25, 113-136.
10. Rost B. and O’Donoghue S.I. (1997). Sisyphus and prediction of protein
structure. CABIOS 13, 345-356.
11. Szent-Györgyi A.G. and Cohen C. (1957). Role of proline in polypeptide
chain configuration of proteins. Science 126, 697.
12. Rost B. and Sander C. (1995). Progress of 1D protein structure prediction
at last. Proteins 23, 295-300.
13. Rost B. (1997). Better 1D predictions by experts with machines. Proteins
Suppl 1,192-197.
14. Eyrich V.A., Marti-Renom M.A., Przybylski D., Madhusudhan M.S., Fiser
A., Pazos F., Valencia A., Sali A. and Rost B. (2001). EVA: continuous
automatic evaluation of protein structure prediction servers. Bioinformatics
17, 1242-1243.
15. Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M. and Barton G.J. (1998).
JPred: a consensus secondary structure prediction server. Bioinformatics
14, 892-893.
16. McGuffin L.J., Bryson K. and Jones D.T. (2000). The PSIPRED protein
structure prediction server. Bioinformatics 16, 404-405.
17. Rost B. and Sander, C. (1993). Prediction of protein secondary structure at
better than 70% accuracy. J. Mol. Biol. 232, 584-599.
18. Pan X.M., (2001). Multiple linear regression for secondary structure
prediction. Proteins 43, 256-259.
19. Zvelebil M.J.J.M., Barton G.J., Taylor W.R. and Sternberg M.J.E. (1987)
Prediction of protein secondary and active sites using alignment of
homologous sequences. J. Mol. Biol. 195, 957-961.
20. King R.D. and Sternberg M.J.E. (1996) Identification and application of
the concepts important for accurate and reliable protein secondary structure
prediction. Protein Sci. 5, 2298-2310.
21. Salamov A.A and Solovyev V.V (1995) Prediction of protein secondary
structure by combining nearest-neighbor algorithms and multiple sequence
alignments. J. Mol. Biol. 247,11-15.
22. Frishman D and Argos P. (1996) Incorporation of non-local interactions in
protein secondary structure prediction from the amino acid sequence.
Protein Eng. 9,133-142.
23. Eddy S.R. (1999) HMMer2. http://hummer.wustl.edu/.
Pas (P0513) - 73 predictions: 73 3D
Multimethod Protein Structure Prediction
J. Pas
Independent predictor
kuba@bioinfo.pl
To determine whether the structure of a target protein can be predicted using
homology modeling PSI-BLAST [1] search was carried out against the
sequences of proteins in the non-redundant protein sequence. PSI-BLAST
iterations were performed using manual inclusion/exclusion procedure.
After that multiple sequence alignment was built using clustalw [2] program
using selected proteins from PSI-BLAST profile. All alignments were manually
inspected.
Selection of template was confirmed using structure prediction
METASERVER [3]. METASERVER was also used to choose template when
no significant hits were found using PSI-BLAST searches.
In addition other available information was used in an attempt to link the target
with a protein with known structure. It was mainly literature search, known
metabolic pathways, gene expression data, position on the chromosome,
distribution of folds in the organism and secondary structure prediction.
Selected target–template structural alignments were visually inspected in
SWISS PDB Viewer and if necessary modified. Molecular 3D models were
then built 3D using both SWISS-MODEL [4] and MODELLER [6] programs.
Initial models were subjected to detailed evaluation, mainly by addition visual
inspection of structural consistency and using Verify 3D program [5]. The same
evaluation procedure was performed for final models.
More than one template protein was used if possible after superimposition of
their molecular structures using 3d hit program [Plewczynski in press]. During
the modeling procedure superimposition of initial models were used to find best
possible backbone conformation
A-121
The overall quality of each modeled structure was evaluated in detail with the
Verify 3D program.
rotamer [3]. To estimate the prediction accuracy of this server we have
submitted the prediction results for CASP5 targets.
1.
The PILOT server has the following two characteristic features. (1) 1D-PSSMs
and 3D-PSSMs are combined in IMPALA. (2) Candidates are re-evaluated by
using the threading potentials through the ‘remount’ procedure [4]. By using
these methods, we have slightly improved the recognition rate of protein
structural similarity. The template sequences whose E-values estimated by PSIBLAST and IMPALA are within or equal 10 are selected as candidates.
2.
3.
4.
5.
6.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Thompson J.D. et al (1994) CLUSTAL W: improving the sensivity of
progressive multiple sequence alignment through sequence weighting.
Nucleic Acids Res. 22, 4673-4680
Bujnicki J.M., Elofsson A., Fischer D., Rychlewski L. (2001) Structure
prediction meta server. Bioinformatics. 17(8),750-751
Guex N., Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer:
an environment for comparative protein modeling. Electrophoresis. 18(15),
2714-2723
Luthy R., Bowie J.U., Eisenberg D. (1999) Assessment of protein models
with three-dimensional profiles. Nature 356, 83-85
Sali A., Blundell T.L. (1993) Comparative protein modeling by satisfaction
of spatial restraints. J. Mol. Biol. 234, 779-815
PILOT (P0378) - 146 predictions: 146 3D
The sequences of templates are derived from the SCOP [4] database of release
1.59 (15 May 2002). These templates are of  50% sequence identity with one
another, of length  40 residues and of no chain break in the structure. These
templates are selected according to the priority specified with our automatic
selection system “PDB-REPRDB” [5].
From the preliminary results we have assigned seven regions in the E-value-Zscore plane according to their confidence level. Here, the E-value is estimated
by PSI-BLAST and/or IMPALA. Z-score is calculated through the ‘remount’
process. Final results are decided and reported by using this plane.
The server is available at http://www.cbrc.jp/pilot/.
1.
PILOT: a Fold Recognition Server Based on PSI-BLAST,
IMPALA and Libra-Rotamer
2.
K. Tomii1, M. Ota2, T. Noguchi1, and Y. Akiyama1
1
– Computational Biology Research Center
National Institute of Advanced Industrial Science and Technology
2
– Global Scientific Information and Computing Center
Tokyo Institute of Technology
k-tomii@aist.go.jp
3.
4.
5.
We have constructed the automated fold recognition server, PILOT, that
integrates the prediction results of PSI-BLAST [1], IMPALA [2], and Libra-
A-122
6.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Schäffer A.A. et al. (1999) IMPALA: matching a protein sequence against
a collection of PSI-BLAST-constructed position-specific score matrices.
Bioinformatics. 15 (12), 1000-1011.
Ota M. et al. (2001) Knowledge-based potential defined for a rotamer
library to design protein sequences. Protein Eng. 14 (8), 557-564.
Ota M. et al. (1999) Feasibility in the inverse protein folding protocol.
Protein Science. 8 (5), 1001-1009.
Lo Conte L. et al. (2002) SCOP database in 2002: refinements
accommodate structural genomics. Nucleic Acids Res., 30(1), 264-267.
Noguchi T. et al. (2001) PDB-REPRDB: a database of representative
protein chains from the Protein Data Bank (PDB). Nucleic Acids Res.,
29(1), 219-220.
POMI (P0465) - 46 predictions: 46 3D
Preissner (P0488) - 20 predictions: 20 3D
Building Full Atom Models by Using Alignment from Fold
Recognition Methods.
Loop Modeling with the Aid of LIP
Rubinstein Rotem
Berlin Center of Genome Based Bioinformatics, Charité, Medical Faculty of
the Humboldt University Berlin, Germany
elke.michalsky@charite.de
E. Michalsky, A. Goede and R. Preissner
Ben-Gurion University
rotemr@cs.bgu.ac.il
Full atom models were filled using the CAFASP fold-recognition servers'
results and SwissPdb Viewer. The goal was to test the accuracy of a modeling
strategy based on alignments taken from the CAFASP servers. The focus was
on those targets that can't be automatically modeled by Swiss-Model because of
their low sequence similarity to known structures, but that had relatively
confident scores reported by the fold-recognition servers.
Comparative modeling methods, give us models for less than 20% of protein
sequences. The fully automatic Swiss Model server supplied results for twenty
targets in CASP5. Therefore it is important to use other methods, such as fold
recognition and ab initio. These methods enable us to broaden the amount of
targets for which we get results. However many of these methods do not supply
us with full atom models. I tried to address this problem by modeling with
alignment from fold recognition servers. Alignments were selected by
comparing the scores of the servers’ models with the confidence thresholds,
compiled from LiveBench4. The modeling was carried out with the SwissPdb
Viewer.
1.
2.
3.
Guex N., and Peitsch M.C., (1997) SWISS-MODEL and the SwissPdbViewer: An environment for comparative protein modeling,
Electrophoresis 18, 2714-2723.
http://www.expasy.org/spdbv/
http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/thresholds.html
A-123
We participated in this year’s CASP experiment to evaluate the applicability
and quality of our new loop construction procedure. Here we focus on this
algorithm and sketch the remainder of the conventional homology modeling
procedure.
One of the most important and challenging parts in protein modeling is the
prediction of loops, as can be seen in the large variety of existing approaches.
Van Vlijmen et al. [1], e.g., present a knowledge based approach, where a set of
loops is selected from a database, followed by a constrained optimization of the
loop orientation and ranking by means of an energy function. Another approach
are the so called ab inito methods. Fiser et al. [2] optimize the positions of all
nonhydrogen atoms with respect to a pseudo energy function, supplemented
with statistical preferences for dihedral angles and for nonbonded atomic
contacts. The algorithm of Tosatto et al. [3] is based on a divide and conquer
approach where the target loop is recursively decomposed until the
conformations of the resulting segments can be compiled analytically and uses
a database of possible conformations for loop segments. The conformations
were anticipated using a list of list of (phi,psi)-angle pairs extracted from the
PDB. CODA, an algorithm presented by Deane et al. [4] combines a knowledge
based and an ab initio method by clustering the predictions of the two
algorithms and making a consensus prediction using a set of filters.
To handle gaps, i.e. insertions as well as deletions in the alignment, we used the
tool LIP (Loops In Proteins) [5]. The program LIP is based on a
comprehensive compilation of approximately 10 8 backbone conformations
from a recent version of the Protein Data Bank [6]. In the first step protein
segments are selected that fit approximately into the gap in the protein structure
and that have the required number of amino acids. In order to evaluate the
fitting, for each segment a goodness is calculated. The goodness is defined as
the RMSD between a loop candidate and the gap in the protein structure with
respect to the distance between the stem residues and several certain dihedral
angles. This extraction procedure takes at most 15 seconds on a usual PC.
3.
Thereafter, the selected protein segments are evaluated using an optimized
scoring function. Besides the goodness, it includes additional values, i.e. the
RMSD between the stem residues as well as a sequence alignment score based
on a modified BLOSUM mutation matrix. In particular, exchanges of glycine
and proline with other amino acids are treated individually. Clashes of the new
loop with the core of the protein are avoided. The best-ranked segment is
inserted into the gap between adjacent secondary structures, followed by a local
geometry optimization. For the homology modeling approach LIP is combined
with several publicly available tools.
5.
6.
The first step in homology modeling is to identify suitable templates. For this
purpose, we performed searches with the alignment search tools BLAST and
PSI-BLAST, respectively, in the Protein Data Bank [6-7]. A proper alignment
between the target and template amino acid sequences is one of the major
components in a structure constructed by comparative modeling. To obtain
reasonable alignments using entire available protein family information, we
used STRAP, which is a tool for generating multiple structure based
alignments, developed in our research group [8]. Mutations, side chain rotamer
selection and energy minimizations were performed by means of the protein
visualization and modeling tool Swiss-PdbViewer, version 3.7b [9].
4.
7.
8.
9.
Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop
modeling. Protein Eng. 15 (4), 279-286.
Deane C.M. et al. (2001) CODA : A combined algorithm for predicting the
structurally variable regions of protein models. Protein Sci. 10 (3), 599612.
Michalsky E. et al. (in preparation).
Berman H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res. 28,
235-242.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Gille C. et al. (2001) STRAP: editor for STRuctural Alignments of
Proteins. Bioinformatics 17 (4), 377-378.
Guex N. et al. (1997) SWISS-MODEL and the Swiss-PdbViewer: An
environment for comparative protein modeling. Electrophoresis 18, 27142723.
Protfinder (P0282) - 222 predictions: 222 3D
Sequence-Structure Alignments With the Protfinder
Algorithm
U. Bastolla
LIP is embedded in a graphical user interface and will be available after
publication. A demo version for Windows can be downloaded from
http://www.protein-design.com.
1.
2.
van Vlijmen H.W.T. et al. (1997) PDB-based Protein Loop Prediction:
Parameters for Selection and Methods for Optimization. J. Mol. Biol. 267
(4), 975-1001.
Fiser A. et al. (2000) Modeling of loops in protein structures. Protein Sci. 9
(9), 1753-1773.
A-124
Centro de Astrobiologia (INTA_CSIC), Madrid, Spain
bastollau@inta.es
The Protfinder algorithm predicts protein structures by aligning the query
sequence to candidate structures in the PDB. Alignments are evaluated through
a minimal model of protein folding, which reproduces approximately some key
features of protein thermodynamics and is very convenient for rapid
computation.
Information on sequence homology is not used in the scoring function.
Nevertheless, when sequence homologs are present in the structure database,
they are in almost all cases predicted as the best scoring alignment.
Protein structures are represented as contact maps and their effective
intramolecular interactions are modeled as a sum of contact interactions. We
use the contact energy function optimized in Ref. [1], which assigns lowest
energy to the experimentally known native structure for almost every sequence
of monomeric protein whose structure has been determined by X-ray
crystallography, except small fragments and chains with large cofactors.
Moreover, it generates well-correlated energy landscapes, in the sense that
structures very dissimilar from the native one have energies much higher than
the native energy. This property is crucial for protein structure prediction. The
effective energy function is also able to estimate the folding free energies of a
set of small proteins folding with two-state thermodynamics, with reasonable
agreement with experimental data [2].
The scoring function consists of three elements: the effective energy function
described above, a chain entropy term estimated in Ref. [2] and a term
penalizing gaps in the alignment. Gaps in secondary structure elements are
strictly forbidden. Gaps in the structure are allowed only if the two residues that
are shortcut are close in space and the angles characterizing their pseudopeptidic bond lie within a predefined range. Gaps in the sequence are allowed
only on the surface of the protein, which is identified by the fact that the
number of contacts per residue is smaller than a threshold. Allowed gaps
receive an energetic penalty G0 plus a penalty G1 for each residue in the gap.
To speed up the computation, each structure in the PDBSELECT [3] nonredundant subset of the PDB was preprocessed to produce its contact map and
the list of allowed shortcuts in the structure. Secondary structure was obtained
from the DSSP file [4] when available, otherwise from the PDB file. The few
structures for which no secondary structure assignment could be obtained were
discarded. Preprocessing, together with the fact that the code uses mostly
integer arithmetic, speed up considerably the computation.
To search for the optimal alignment, we use a stochastic version of the
deterministic Build-up algorithm developed by Park and Levitt to look for low
energy configurations of discrete protein models [5]. The algorithm is very
efficient at finding high-scoring alignments, although it is not guaranteed to
find the best optimum.
A-125
The algorithm starts by generating all possible gapless alignments of length l
between the query sequence and the test structure and stores the M alignments
with maximum score. At each subsequent step, an attempt is made to add the
residue at position k in the sequence to the M alignments. There are three
possibilities: either the residue is aligned to the next structural position, or it is
aligned introducing a gap in the structure (if allowed), or the residue is not
aligned, initiating a gap in the sequence. All possible continuations are
generated, and the M best scoring alignments are stored in memory and used as
seeds for the next step. The algorithm is iterated until no other residue can be
added.
Some tricks are used to improve the efficiency of the algorithm: 1) The
algorithm is first applied using a small value M=50 to scan rapidly the whole
database. The 200 proteins with the best alignments are then stored in memory
and used for a second more accurate search with M=800. 2) Instead of using the
deterministic algorithm described above, we select the M alignments at each
step based on the sum of their score plus a random number. The relative
importance of the randomness is large in the first steps, allowing the algorithm
to visit a larger fraction of the alignment space instead of constructing very
similar alignments. Then the randomness decreases as the alignments get
longer, so that the choice of the complete alignment is made on the basis of the
deterministic score. 3) Since the construction of the starting fragment is the
most delicate step, the algorithm is applied using two or three different values
of l.
Each candidate structure receives the score of its best alignment. The best
scoring structure is used as prediction. The goodness of the prediction is
estimated through the normalized energy gap, a parameter measuring the
difference between the best score and the score of an alternative structure in
units of the best score, divided by the structural distance between the best
scoring structure and the alternative structure. If the minimal value of the
normalized energy gap over all alternative structures is large (larger than 0.2),
the prediction is reliable, if it is small alignments with very different structure
have scores quite similar to the best one and reliability is very low.
1.
Bastolla U. et al. (2000) A statistical mechanical method to optimize
energy functions for protein folding. Proc. Natl. Acad. Sci. USA 97, 39773981
2.
3.
4.
5.
Bastolla U. Testing the thermodynamics of a minimal model of protein
folding, in preparation
Hobohm U. and Sander C. (1994) Enlarged representative set of protein
structures. Protein Sci. 3, 522-524
Kabsch W. and Sander C. (1983) Dictionary of protein secondary
structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers 22 (12), 2577-2637
Park B.H. and Levitt M. (1995) The complexity and accuracy of discrete
state models of protein structure. J. Mol. Biol. 249, 493-507
observed rotamer in a database of protein structures [1]. New conformations
are generated by perturbing the existing conformation at a random residue
position using either a value from a 14-state / model, or replacing three /
values for three residues with identical sequence which are obtained from a
database of known structures. The optimization function used is primarily a
combination of an all-atom distance-dependent conditional probability
discriminatory function (rapdf) and a hydrophobic compactness function (hcf)
[2,3]. The fitness of the conformations were optimised by using two different
protocols: a straight-forward Monte Carlo/simulated annealing approach [4]
combined with a Genetic Algorithm strategy, and a conformational space
annealing approach [5]. A combination of minimisation parametres and
scoring functions were used to generate a large pool of conformations.
PROTINFO-AB (P0140) - 260 predictions: 260 3D
An Automated Approach for De Novo Structure Prediction
Ram Samudrala, Shing-Chung Ngan
University of Washington
ram@compbio.washington.edu, ngan@compbio.washington.edu
Our general paradigm for predicting structure involves sampling the
conformational space (or generating "decoys") such that native-like
conformations are observed, and then selecting them using a hierarchical
filtering technique using many different scoring functions. Our goal was to
devise a method that would combine the best aspects of the more successful ab
initio methods at the previous CASP experiments. There are three stages to our
approach:
1. SECONDARY STRUCTURE PREDICTION: The consensus of the
secondary structure predictions from the various servers at the CAFASP metaserver was used as the secondary structure prediction.
2. SEARCHING PROTEIN CONFORMATIONAL SPACE: We initially start
with an all-atom conformation where residues predicted to be in helix/sheet by
the consensus secondary structure prediction are set to idealised helix and sheet
values. The remaining / values are set in an extended conformation. Side
chain conformations are predicted by simply using the most frequently
A-126
3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations
generated were minimised using ENCAD [6] and scored using a combination
of scoring functions that hierarchically reduces the total number of
conformations produced to five which are used for the final submissions. The
scoring functions used for the final filtering include a simple residue-residue
contact function (Shell), a density-scoring function that is based on the distance
of a conformation to all its relatives in the conformation pool, a secondary
structure based scoring function that evaluates the match between the predicted
structure and the secondary structure of a final energy-minimised conformation,
and standard physics-based electrostatics and Van der Waals terms.
This work is an attempt at combining the best de novo prediction methods from
the previous CASP experiments [2-5]. In addition, there are components that
are unique to this approach primarily in the form of the hierarchical filtering
methodology employed, the density scoring function, and in subtle variations of
each of the search methods.
1.
2.
3.
Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain
construction on near-native main chains for ab initio protein structure
prediction. Prot Eng 7: 453-457.
Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction
of protein tertiary structures using a hierarchical approach. J Mol Biol 300:
171-185.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
4.
5.
6.
Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of
protein tertiary structures from fragments with similar local sequences
using simulated annealing and bayesian scoring functions. J Mol Biol 268:
209-225.
Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein
folding by conformational space annealing and an off-lattice united-residue
force field: application to the 10-55 fragment of staphylococcal protein A
and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
PROTINFO-CM (P0138) - 251 predictions: 251 3D
An Automated Approach for the Comparative Modeling of
Protein Structure
Ram Samudrala
University of Washington
ram@compbio.washington.edu
The interconnected nature of interactions in protein structures, thorough
sampling of side chain and main chain conformations, and devising a
discriminatory function that can distinguish between correct and incorrect
conformations are the major hurdles preventing the construction of accurate
homology models. We present an algorithm that uses graph theory to handle the
problem of interconnectedness. Sampling of side chain and main chain
conformations is accomplished by exhaustively enumerating all possible
choices using a discrete state model, including fragments from a database of
protein structures. The optimal combination of these possibilities is selected
using an all-atom scoring function aided by the graph-theoretic approach.
Following is a brief description of the components and steps of this method,
which can be divided into: discriminatory function, identification of template
and generation of alignment, initial model building, construction of variable
A-127
main chain and side chain regions, and moving models closer to the native
conformation.
1. DISCRIMINATORY FUNCTION: the function used throughout generally is
an all-atom distance-dependent conditional probability discriminatory function
based on a statistical analysis of known protein structure. The negative log of
the conditional probability of observing two atoms interact given a particular
distance is used as a ``pseudo-energy'' term [1].
2. IDENTIFICATION OF TEMPLATE AND GENERATION OF
ALIGNMENT: The CAFASP meta-server data (http://bioinfo.pl/cafasp) were
used to identify the template proteins that a given target sequence was related to
(based on a consensus of all the hits produced by the different servers). The
templates were then fed into a multiple sequence alignment method
(CLUSTALW [2]) and the pairwise alignments between the target and each of
the templates were used to construct initial models. The initial models were
then ranked by our discriminatory function and the models that ranked highest
were used for further model-building. In addition to these initial models, a
model based on the alignment derived from a structure comparison of the best
scoring model output from our de novo fold generation method (see the abstract
for PROTINFO-FR) to the corresponding template structure was also used.
3. INITIAL MODEL BUILDING: Following the sequence alignment, for each
parent structure, an initial model was generated by copying atomic coordinates
for the main chain (excluding any insertions) and for the side chains of residues
that are identical in the target and parent structures. Residues that differ in type
were constructed using a minimum perturbation technique. The MP method
changes a given amino acid to the target amino acid preserving the values of
equivalent chi angles between the two side chains, where available. The other
chi angles are constructed by the MP method using an internally developed
library based on residue type.
4. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN
REGIONS: Main chain sampling is performed using an exhaustive enumeration
technique based on discrete states of / angles. For longer main chain regions,
we use fragments (3-tuples) from a database of protein structures to generate
the discrete / angles.
Side chains possibilities are generated by selecting the most probable side chain
rotamers based on the interactions of a given rotamer with the local main chain
(evaluated using the discriminatory function above) [3].
Side chains
possibilities were also constructed using the program SCWRL [4].
We then use a graph-theoretic approach to assemble the sampled side chain and
main chain conformations together in a consistent manner. Each possible
conformation of a residue is represented using the notion of a node in a graph.
Each node is given a weight based on the degree of the interaction between its
side chain atoms and the local main chain atoms. The weight is computed
using a all-atom conditional probability discriminatory function. Edges are then
drawn between pairs of residues/nodes that are consistent with each other (i.e.,
clash-free and satisfying geometrical constraints). The edges are also weighted
according to the probability of the interaction between atoms in the two
residues. Once the entire graph is constructed, all the maximal sets of
completely connected nodes (cliques) are found using a clique-finding
algorithm. The cliques with the best probabilities represent the optimal
combinations of mixing and matching between the various possibilities, taking
the respective environments into account [5]. Clique-finding is accomplishing
using the Bron and Kerbosch algorithm [6]. All models used were refined
using ENCAD [7].
this approach including the scoring function. It remains to be seen how the
improvements in methodology correlate with model accuracy.
1.
2.
3.
4.
5.
6.
7.
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:
8.
Once we had generated a final model for each parent, we used an off-lattice 14state / model and a sequential build-up algorithm to generate structures
around the conformational space of the final model. We then used our scoring
function to select the best ranking ones. The goal here is that some of the
conformations sampled would actually be closer to the native conformation and
that our scoring function will be able to select it.
We test how the above approach works in a comparative modelling scenario
and assess the predictive power of this method by applying it to properly
controlled blind tests as part of the fifth meeting on the Critical Assessment of
protein Structure Prediction methods (CASP5). Compared to CASP2-4, where
a similar approach was used [8], we have improved the method used to sample
main chains and have made minor enhancements to the other components of
A-128
Samudrala R., Moult J. (1998) An all-atom distance dependent conditional
probability discriminatory function for protein structure prediction. J Mol
Biol 275: 893-914.
Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucleic Acids Res 22: 4673-4680.
Samudrala R., Moult J. (1998) Determinants of side chain conformational
preferences in protein structures. Prot Eng 11: 991-997.
Bower M.J., Cohen F.E., Dunbrack R.L. (1997) Prediction of side-chain
orientations from a backbone-dependent rotamer library: A new homology
modelling tool. J Mol Biol 267: 1268-1282.
Samudrala R., Moult J. (1998) A graph-theoretic algorithm for
comparative modelling of protein structure. J Mol Biol 279: 287-302.
Bron C., Kerbosch J. (1973) Algorithm 457: Finding all cliques of an
undirected graph. Communications of the ACM 16: 575-577.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
PROTINFO-FR (P0139) - 325 predictions: 325 3D
servers). The templates were then fed into a multiple sequence alignment
method (CLUSTALW; [1]) and the pairwise alignments between the target and
each of the templates were used to construct initial models. The initial models
were then ranked by our discriminatory function and the models that ranked
highest were considered candidates for the template structure.
An Automated Approach for De Novo Fold Generation
Ram Samudrala
University of Washington
ram@compbio.washington.edu
2. SECONDARY STRUCTURE PREDICTION: The consensus of the
secondary structure predictions from the various servers at the CAFASP metaserver was used as the secondary structure prediction.
This is a completely novel and automated approach, based on the idea of
driving a particular protein folding simulation towards a particular fold. The
idea was derived from the observation that among distant homology
recognition programs, at least could identify the correct template for every
CASP target, even if the alignment was incorrect. The logic here is that if a
template fold could be identified, we can use our de novo simulation approach
to guide the conformation towards the fold, in conjunction our scoring
functions.
The advantage of this approach is that it completely does away with the issues
of alignment, the building of non-conserved side chains and main chains, and
the use of a fixed template to construct a model. This enables us to circumvent
explicit bias to the homologous parent structure (usually a problem in
comparative modelling/fold recognition methods since there is no easy
approach to move a model based on a template closer to its native structure).
We expect that this approach will perform well on cases where the sequence
similarity between two proteins is not very high, in terms of improving the
alignment, as well as obtaining a better conformation for the global fold.
Our general paradigm for predicting a fold involves sampling the
conformational space (or generating "decoys") such that native-like
conformations are observed, and then selecting them using a hierarchical
filtering technique using many different scoring functions. There are four
stages to our approach:
1. IDENTIFICATION OF THE TEMPLATE: The CAFASP meta-server data
were used to identify the template proteins that a given target sequence was
related to (based on a consensus of all the hits produced by the different
A-129
3. FITTING THE TARGET SEQUENCE TO THE TEMPLATE FOLD: We
initially start with an all-atom conformation where residues predicted to be in
helix/sheet by the consensus secondary structure prediction are set to idealised
helix and sheet values. The remaining / values are set in an extended
conformation. Side chain conformations are predicted by simply using the most
frequently observed rotamer in a database of protein structures [2]. New
conformations are generated by perturbing the existing conformation at a
random residue position using either a value from a 14-state / model, or
replacing three / values for three residues with identical sequence which are
obtained from a database of known structures. The optimization function used
is primarily the CA RMSD between the conformation being generated and the
template structure, along with a combination of an all-atom distance-dependent
conditional probability discriminatory function (rapdf) and a hydrophobic
compactness function (hcf) [3,4]. The fitness of the conformations were
optimised by using two different protocols: a straight-forward Monte
Carlo/simulated annealing approach [5] combined with a Genetic Algorithm
strategy, and a conformational space annealing approach [6]. A combination of
minimization parametres and scoring functions were used to generate a large
pool of conformations.
3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations
generated were minimised using ENCAD [7] and scored using a combination
of scoring functions that hierarchically reduces the total number of
conformations produced to five which are used for the final submissions. The
scoring functions used for the final filtering include a simple residue-residue
contact function (Shell), a density-scoring function that is based on the distance
of a conformation to all its relatives in the conformation pool, a secondary
structure based scoring function that evaluates the match between the predicted
structure and the secondary structure of a final energy-minimised conformation,
and standard physics-based electrostatics and Van der Waals terms.
As we note above, this is a completely novel approach that combines aspects of
all three major modelling approaches (comparative modelling, fold recognition,
de novo prediction) to handle the most difficult targets. This method can also
be used to generate alignments based on a structure comparison between the
final models and the template structures, which we can feed into a more
traditional comparative modelling procedure (see abstract for PROTINFOCM). We expect that this approach will perform best on proteins where the
evolutionary relationship between two proteins is not apparent from sequence
comparison methods.
1.
2.
3.
4.
5.
6.
7.
Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucleic Acids Res 22: 4673-4680.
Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain
construction on near-native main chains for ab initio protein structure
prediction. Prot Eng 7: 453-457.
Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction
of protein tertiary structures using a hierarchical approach. J Mol Biol 300:
171-185.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of
protein tertiary structures from fragments with similar local sequences
using simulated annealing and bayesian scoring functions. J Mol Biol 268:
209-225.
Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein
folding by conformational space annealing and an off-lattice united-residue
force field: application to the 10-55 fragment of staphylococcal protein A
and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
A-130
Pushchino (P0203) - 263 predictions: 263 3D
Threading Using Multiple Homology and Secondary
Structure Prediction Information
M.Yu.Lobanov1, I.Litvinov2, N.S.Bogatyreva1,
O.V.Galzitskaya1, M.S.Kondratova1, S.A.Garbuzynskiy1,
D.N.Ivankov1, M.A.Roytberg2, A.V.Finkelstein1
1
- Institute of Protein Research, Russian Academy of Sciences, 142290,
Pushchino, Moscow Region, Russia,
2
- Institute of Mathematical Problems of Biology, Russian Academy of
Sciences, 142290, Pushchino, Moscow Region, Russia
afinkel@vega.protres.ru
For dividing long targets into domains we used our new program PROFILE (to
be published) and the results of PSI-BLAST [1].
To predict secondary structure of targets we used programs JPRED [2],
PSIPRED [3] and ALB [4].
To obtain an initial information on the possible target's fold we used standard
programs PSI-BLAST, HMMer (on HMM profile libraries PFAM [5] and
SUPERFAMILY [6]) and PROSITE [7].
Threading (of bunches of reliably homologous sequences onto 3D templates)
was done by our program SCF_THREADER [8] with the scoring function that
takes into account: homology of sequences, homology of predicted (for target)
and real (for template) secondary structures, 3D-structure dependent gap
penalties, 3D constrains of gaps in sequences threaded onto a template.
When the target-template homology was unambiguously detected by PSIBLAST or HMMer, the SCF_THREADER usually gave a confident prediction
of the same template. In such cases we generated the final model from all
available alignments. When PSI-BLAST and HMMer gave nothing, we
checked the best SCF_THREADER models for compact structures having
hydrophobic cores.
Raghava-Gajendara (P0054) - 482 predictions: 224 3D, 258 SS
Finally, visual inspection of the best results presented by programs and
selection of the most reasonable ones was done. This inspection sometimes
gave us also a possibility to merge several good predictions and make a
common model. We preferred to take as templates the representatives of the
most frequent fold in the programs' outputs, i.e. we did a kind of clustering of
obtained predictions. In all cases we corrected the final alignments manually to
get the resulting set of models.
The authors acknowledge support of the Howard Hughes Medical Institute, the
Russian Foundation for Basic Research, INTAS, and the Netherland
Organization for Scientific research.
1.
2.
3.
4.
5.
6.
7.
8.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Cuff J.A. et al. (1998) JPred: a consensus secondary structure prediction
server. Bioinformatics. 14 (10), 892-893.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Ptitsyn O.B., Finkelstein A.V.(1983) Theory of protein secondary structure
and algorithm of its prediction. Biopolymers. 22 (1), 15-25.
Sonnhammer E.L.L., Eddy S.R., Durbin R. (1997) Pfam: a comprehensive
database of protein domain families based on seed alignments. Proteins:
Structure, Function and Genetics. 28 (3), 405-420.
Gough J., Karplus K., Hughey R., Chothia C. (2001) Assignment of
homology to genome sequences using a library of hidden markov models
that represent all proteins of known structure J. Mol. Biol. 313 (4), 903919.
Falquet L. et al. (2002) The PROSITE database, its status in 2002. Nucleic
Acids Res. 30 (1), 235-238.
Rykunov D.S. (2000) Search for the most stable folds of protein chains: III
improvement in fold recognition by averaging over homologous sequences
and 3D structures. Proteins. 40 (3), 494-501.
RPFOLD: Recognition of Protein Fold from Sequence and
Predicted Secondary Structure Using Standard
Sequence Searching Methods
G. P. S. Raghava
Institute of Microbial Technology, Chandigarh, INDIA
raghava@imtech.res.in
RPFOLD uses the following steps for fold recognition: (1) Secondary structure
of query protein is predicted using PSIPRED; (2) predicted secondary structure
was searched using FASTA against database of secondary structure generated
from PDB by DSSP; (3) query protein sequence was searched against nonredundant database using 3 iterations of PSIBLAST and profile was generated;
(4) the profile was used to search similar sequences in PDB using PSIBLAST;
(5) SSEARCH was used to search query sequence against PDB; (6) All the hits
obtained from above were ranked based on score and weightage; (7) Clustal-W
was to align query sequence with predicted secondary structure information and
target protein in PDB with assigned secondary structure information to get final
alignment and re-ranking of hits.
APSSP/Raghava-Gajendra (P0137) - 65 predictions: 65 SS
APSSP: Automatic Method for Protein
Secondary Structure Prediction
G. P. S. Raghava
Institute of Microbial Technology, Chandigarh, INDIA
raghava@imtech.res.in
This method is similar to APSSP2 where it uses the three steps for protein
secondary structure. First, JNET was used for predicting secondary structure of
A-131
proteins. In second step it predicts the secondary of proteins using modified
example based learning (EBL) technique. Modification of standard EBL is
major step in this study. In third step secondary structure predicted from above
two steps are combined in order to predict the final structure. The combination
of two is based on reliability score. The modified EBL approach is same as
APSSP2 except that it uses only one distance matrix instead of 8000 distance
matrices were used in APSSP2.
1,300,000 pattern. For example in order to predict secondary structure of
protein having 200 amino acid one need to compare nearly 200x1,300,00 at
pattern level and 17x200x1,300,000 at residue level (considering pattern length
17). Thus it takes hours on reasonable powerful machine so it’s not practical
advisable to use the standard EBL method in fully automatic servers such as
APSSP2.
We divide the pattern for training and prediction in 8000 sets based on three
central residues (central and left and right residue to central). This rehashing is
similar to BLAST. Than we create the distance metric for each of 8000 sets
thus improve the speed 8000 times. This may affect the performance of method
which was compensated because we used 8000 matrices instead of one matrix.
APSSP2/Raghava-Gajendra (P0055) - 65 predictions: 65 SS
APSSP2: A Combination Method for Protein Secondary
Structureprediction Based on Neural Network and
Example Based Learning
Author, believe that in future we have no alternate except to combine the
generalized methods (such as neural network) with homology based methods
(such as EBL), because in future we will have more and more known examples.
In that case EBL method is more accurate in comparison to generalized
methods. It has been observed in EVA that on number of proteins APSSP2
were able to predict secondary structure with 100% accuracy where other
methods fail to do so.
G. P. S. Raghava
Institute of Microbial Technology, Chandigarh, INDIA
raghava@imtech.res.in
This method uses the three steps for protein secondary structure. First, it uses
the standard neural network and multiple sequence alignment generated by
PSIBLAST instead of single sequence. In second step it predicts the secondary
of proteins using modified example based learning (EBL) technique.
Modification of standard EBL is major step in this study. In third step
secondary structure predicted from above two steps are combined in order to
predict the final structure. The combination of two is based on reliability score.
In order to implement EBL, we first select all proteins that have resolution
better than 2.8A in PDB and minimum length of 50. Than we assigned
secondary structure of protein using DSSP and we generated pattern of length
17 residues with secondary structure state of central residue. Thus, we got more
than 6,000,000 patterns, which have more than 1,300,000 non-redundant
patterns. We trained our EBL method on these 1,300,000 unique patterns for
prediction of secondary structure. One of the major limitations of standard EBL
method is its speed, because one needs to compare query pattern against all
A-132
RAPTOR (P0144) - 227 predictions: 227 3D
Protein Threading by Integer Programming
J. Xu and M. Li
Department of Computer Science, University of Waterloo
j3xu@uwaterloo.ca, mli@uwaterloo.ca
Protein three-dimensional structure prediction through threading method has
been extensively studied and various models and algorithms have been
proposed. In order to further explore ways to improve accuracy and efficiency
of the threading process, our program RAPTOR investigates the effectiveness
of a new method: protein threading via integer programming. RAPTOR
minimizes the energy function (i.e. seeks for the optimal alignment between
sequence and template) by integer programming method. The energy function
used by RAPTOR rigorously takes the pair-wise contact potential into account.
Based on the contact map models of protein 3D structural templates, we
formulate the threading problem as a large scale integer programming problem,
then relax to a linear programming problem, and finally solve the integer
program by the branch-and-cut method. In solving the linear programs, 99%
real data generate the integral linear optimal solution directly without
branching, which means 99% instances could be solved within polynomial time
although the problem itself is NP-hard. The final solution is optimal with
respect to energy functions incorporating pair-wise interaction potential and
allowing variable gaps. After optimal alignments, raw z score is calculated by
randomly shuffling the query sequence. Fourteen features including the raw z
score are extracted out from the optimal alignment. Support Vector Machine
(SVM) method is employed to do fold recognition using these features. SVM
method could recognize much more folds than the raw z score according to
the experimental tests. The algorithm has been implemented as software
package RAPTOR (Rapid Protein Threading predictOR). Experimental results
for fold recognition show that RAPTOR significantly outperforms other
programs at the fold similarity level. The RAPTOR server is available at
http://www.cs.uwaterloo.ca/~j3xu/RAPTOR_form.htm. For more detailed
description, please refer to our paper [1-2]. No manual intervention is used to
adjust the final models. All submissions are generated directly from RAPTOR.
1.
2.
Jinbo Xu, Ming Li, Ying Xu et al. (2003) Protein Threading By Linear
Programming. PSB2003.
Jinbo Xu, Ming Li, Ying Xu. (2003) On the Power of Integer
Programming Approach to Protein Threading. Submitted to
RECOMB2003.
Rokko (P0327) - 109 predictions: 109 3D
De Novo Protein Structure Prediction Using Simfold;
Physico-Chemical Approach
Yoshimi Fujitsuka1, George Chikenji1, Nobuyasu Koga1,
Akira R. Kinjo2, and Shoji Takada12
1
Kobe University, 2Japan Science and Technology Corporation
stakada@kobe-u.ac.jp
For CASP5, we use SimFold, a protein simulation program that we have been
developing recently [1,2]. We briefly describe a) the energy function, b) the
sampling method in SimFold, and c) how we did in CASP5.
a) SimFold uses a coarse-grained protein model that has explicit backbone
atoms and a sphere at the center of mass of sidechain. Each sidechain can take
one of several rotamer states. The energy function is based on physico-chemical
consideration and consists of many terms such as hydrophobic interaction,
hydrogen bonds, vdW interactions, and so on. In particular, hydrogen bond
interactions include dependence on local dielectric constant and correlation in
neighboring two bonds in beta sheet. Many of length-parameters are
determined from database survey. For the energetic parameters that need to be
accurate, we optimized them on the basis of the energy landscape theory. For
each of a 40 training protein structure set, we maximize |Z| score, the
normalized difference between native energy and average energy in decoy
structures.
b) For conformational sampling, SimFold uses either the fragment assembly
(FA) method or the replica exchange MD method. We emphasize that, very
uniquely, both FA and MD methods are available in a single program SimFold.
Our FA is different from what has been developed by Baker's group in two
respects. First, for most of calculation, we only use three-residue-fragments,
instead of nine residue ones. Second, we have developed an algorithm of
"reversible FA method" (Chikenji, Fujitsuka, & Takada unpublished). We note
that the typical FA protocol does not obey the detailed balance, but our
algorithm does. Thanks to this property, we could even combine the FA with the
A-133
multi-canonical ensemble approach, which is indeed used in CASP5 and helps
conformational sampling very significantly. Structures either with the lowest
energy or at the center of large clusters are chosen as predicted models. We also
perform MD-based replica exchange simulation, where each replica has the
same protein with different temperature and exchanges of replicas are tried at a
certain frequency. The lowest energy structure is searched in the replica at the
lowest temperature, while high temperature replica is useful for escaping from
misfolded traps.
Ron-Elber (P0300) - 259 predictions: 259 3D
Protein Structure Prediction With Threading Using the
LOOPP2 Algorithm
T. Galor1, C. Lowe1, J. Meller2, J. Pillardy3,
O. Teodorescu1 and R. Elber1
1
c) In CASP5, for all targets that have no homologous sequences of known
structures, we submitted structures predicted by SimFold. For chains shorter
than ~120, starting from random structures, we performed FA sampling either
with multi-canonical ensemble method or with simulated annealing. We chose
either structures with low energies or those at the center of large clusters. For
longer sequences, we started from models in CAFASP server and performed
replica exchange MD for sampling and chose structures in the lowest
temperature replica. For some targets, other information such as annotation was
used too.
1.
2.
Takada S. (2001) Protein Folding Simulation With Solvent-Induced Force
Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins,
Proteins 42, 85-98.
Fujitsuka Y., Takada S., Luthey-Schulten Z.A., & Wolynes, P.G. (2002)
Optimizing Physical Energy Functions for Protein Folding submitted.
2
–Department of Computer Science, Cornell University, Ithaca, N.Y., 14853;
–Cincinnati Children’s Medical Center, Pediatric Informatics, 3333 Burnet
Avenue, Cincinnati, OH 4522; 3– Computational Biology Service Unit,
Cornell University, Ithaca, N.Y., 14853
loopp@tc.cornell.edu
The structures of the target proteins were predicted using a new version of our
LOOPP (Learning, Observing and Outputting Protein Patterns) threading
algorithm [1]. The algorithm is centered on threading, while sequence
similarity and secondary structure prediction were used in order to improve
search scope and accuracy.
Compared to the earlier version of LOOPP that participated in CASP4 the
following enhancements were introduced: (i) deeper Z-score searches, (ii) use
of multiple sequence alignment, and (iii) secondary structure filtering.
Calculations of the Z score (for global and local alignments) are expensive and
in LOOPP we limited our Z score calculations to the first 50 global and 250
local lowest energy scoring sequences, only the sequences occurring in both
lists were considered for final scoring. Here we extend the depth of the search
to include all the sequences occurring in any (local or global) of the alignments.
We use BLAST [2] to detect other sequences of significant similarity (between
40 and 80 percent sequence identity). At most 10 homolog sequences are used
and compared against the database of annotated protein sequences and
structures. Every hit of the related sequences count as a prediction.
We use a secondary structure predictor JNET [3] to eliminate false positive by
removing prediction with negative correlation to JNET prediction.
A-134
An extensive test of 68 probe sequences against a threading database will be
presented. The threading database includes a list of 132 homologous proteins
to probe sequences and 692 decoys. On average there are 2 homolog sequences
for each probe sequence in the database. The set was prepared using the CE
program [4] ensuring structural similarity of homologous pair and structural
dissimilarity of decoys. It is used to compare the performance of the old
LOOPP with the new fully automated protocol of LOOPP 2. In brief, the old
LOOPP found 30 sequences in the top 5 while LOOPP 2 found 57, which is an
increase of 90 percent. When LOOPP2 is used with only one sequence, 54 pairs
are found at the top five. Hence, the multiple sequence approach gave an
enhancement of about 6 percent. If the top 10 hits are considered, the single
sequence version of LOOPP2 found 70 pairs while the multi sequence version
found 83, an increase of 18.6 percent.
1.
2.
3.
4.
Elber R, Meller J., (2001) Linear programming optimization and a double
statistical filter for protein threading protocol. Proteins: structure, function
and genetics 45 (3) 241-261.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Cuff J.A., Barton G.J. (2000) Application of multiple sequence alignment
profiles to improve protein secondary structure prediction. Proteins:
structure, function and genetics 40 (3) 502-511.
Shindyalov I.N., Bourne P.E. (1998) Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path. Protein
Engineering 11 (9) 739-747.
A-135
Rykunov-Reva-Tarakanov (P0529) - 198 predictions: 198 3D
Fold Recognition by Threading with Energy Averaging over
Homologous Sequences and 3D Structures
D. Rykunov1, B. Reva2, and A. Tarakanov1
2
Institute of Mathematical Problems of Biology, Russian Academy of Sciences,
Puschino, Moscow region, 142292; 1Institute of Theoretical and Experimental
Biophysics, Russian Academy of Sciences, Puschino Moscow region, 142292
rykunov@vega.protres.ru
In fold predictions, we tried to implement the results of recent publications on
developing threading method [4-6]. For threading database, we determined a
diverse set of protein domain structures for available protein families and
super-families basing on SCOP [1] classification. Each of the SCOP families
was represented by sequences with pairwise similarity less than 80%. This
analysis resulted in 5887 individual protein structures representing 1790 protein
families. We used PSI-BLAST [2] for determining homologs of target
sequences; only those of sequence similarity 80-50% were retained for
computations. Each of the target sequences and the corresponding homologs
were threaded over 5887 individual protein backbones using the threading
model of [6] with distance-dependent phenomenological potentials of [3].
Sequence-to-structure alignments were computed in approximation of “external
field” [6], i.e. interactions between residues remote along a chain were
substituted by interactions with a template protein; short-range interactions
between neighbor residues were taken into account explicitly. For obtained
sequence-to-structure alignments the actual energies were computed [6]. These
energies were averaged both for target homologs and for templates within
structural families [4,5]. The sequence-to-structure alignments for top ranking
families were selected for human expertise that included comparison of
secondary structure prediction given by threading with the one obtained from
PredictProtein [7] and visual inspection of computed 3D structures. The less
contradictive models were chosen for submissions.
1.
2.
3.
4.
5.
6.
7.
Murzin A. et al. (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol.
247, 536-540.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Reva B.A. et al. (1997) Residue-residue mean-force potentials for protein
structure recognition. Protein Engineering 10 (8), 865-876.
Reva B.A. et al. (1999) Averaging interaction energies over homologs
improves protein fold recognition in gapless threading. Proteins 35, 353359.
Rykunov D.S. et al. (2000) Search for the most stable folds of protein
chains. III. Improvement in fold recognition by averaging over
homologous sequences and 3D structures. Protein 40, 494-501.
Reva B.A. et al. (2002) Threading with Chemostructural Restrictions
Method for Predicting Fold and Functionally Significant Residues:
Application to Dipeptidylpeptidase IV (DPP-IV). Proteins 47, 180-193.
Rost B. (1996) Methods in Enzymology 266, 525-539.
Use the SAM-T2K method for finding homologs of the target and aligning
them.
Make local structure predictions using neural nets and the multiple alignment.
We currently have 5 local-structure alphabets:
DSSP
STRIDE
STR
an extended version of DSSP that splits the beta strands
into multiple classes (parallel/antiparallel/mixed, edge/center)
ALPHA a discretization of the alpha torsion angle: CA(i-i), CA(i), CA(i+1),
CA(i+2)
DSSP_EHL2 CASP's collapse of the DSSP alphabet
DSSP_EHL2 is not predicted directly by a neural net, but is computed as a
weighted average of the other 4 networks (each probability vector output is
multiplied by conditional probability matrix P(E|letter) P(H|letter) P(L|letter)).
The weights for the averaging are the mutual information between the local
structure alphabet and the DSSP_EHL2 alphabet in a large training set.
We make four 2-track HMMs (1.0 amino acid + 0.3 local structure) and use
them to score a template library of about 6200 templates. We also used a
single-track HMM to score not just the template library, but a non-redundant
copy of the entire PDB. [Difference from server: the web server did not
include the ALPHA alphabet in either the DSSP_EHL2 computation or the 2track HMMS.]
SAM-T02-human (P0001) - 203 predictions: 138 3D, 65 SS
SAM-T02: Protein Structure Prediction with Neural Nets,
Hidden Markov Models, and Fragment Packing
Kevin Karplus, Rachel Karchin, Richard Hughey,
Jenny Draper, Yael Mandel-Gutfreund, Jonathan Casper,
and Mark Diekhans
One-track HMMs built from the template library multiple alignments were used
to score the target sequence.
Center for Biomolecular Science and Engineering,
University of California, Santa Cruz
karplus@soe.ucsc.edu
All the logs of e-values were combined in a weighted average (with rather
arbitrary weights, since we did not have time to optimize them), and the best
templates ranked.
The SAM-T02 human predictions start with the same method as the SAM-T02
server:
Alignments of the target to the top templates were made using several different
alignment methods (all using the SAM hmmscore program).
After the large set of alignments were made the "human" methods and
A-136
the server diverge significantly. The server just picks the best-scoring
templates (after removing redundancy) and reports the local posterior-decoding
alignments made with the 2-track AA+STR target HMM.
The hand method used SAM's "fragfinder" program and the 2-track AA+STR
HMM to find short fragments (9 residues long) for each position in the
sequence (6 fragments were kept for each position).
interfaces. This is a crude hack that we hope to get rid of when we have
multimers implemented. Because undertaker does not (yet) have a hydrogenbond scoring function, we often had to add constraints to hold beta sheets
together. In some cases where the register was not obvious, we had to guess or
try several different registers. In some cases, when we got desperate for initial
starting points, we threw the Robetta ab-initio models into the undertaker pool,
and optimized from them as well as the ones undertaker started with.
Then the "undertaker" program (named because it optimizes burial) is used to
try to combine the alignments and the fragments into a consistent 3D model.
No single alignment or parent template was used, though in many cases one
had much more influence than the others. The alignment scores were not
passed to undertaker, but were used only to pick the set of alignments and
fragments that undertaker would see.
For multiple-domain models, we generally broke the sequence into chunks
(often somewhat arbitrary overlapping chunks), and did the full SAM-T02
method for each subchain. The alignments found were all tossed into the
undertaker conformation search. In some cases, we performed undertaker runs
for the subchains, and cut-and-pasted the pieces into one PDB file (with bad
breaks) and let undertaker try to assemble the pieces.
A genetic algorithm with about 16 different operators was used to optimize a
score function. The score function was hand-tweaked for each target (mainly
by adding constraints to keep beta sheets together, but also by adjusting what
terms were included in the score function and what weights were used).
Undertaker was undergoing extensive modification during CASP season, so
may have had quite different features available for different targets.
Bower and Dunbrack's SCWRL was run on some of the intermediate
conformations generated by undertaker, but the final conformation was chosen
entirely by the undertaker score function.
Optimization was generally done in many passes, with hand inspection of the
best conformation after each pass, followed (often) by tweaking the score
function to move the conformation in a direction we desired. In a few cases,
when we started getting a decent structure that did not correspond well to our
input alignments, we submitted the structure to VAST to get structure-structure
alignments, to try to find some other possible templates to use as a base. In
some cases, when several conformations had good parts, different
conformations were manually cut-and-pasted, with undertaker run to try to
smooth out the transitions.
Because undertaker does not (yet) handle multimers, we often added
"scaffolding" constraints by hand to try to retain structure in dimerization
A-137
SAM-T02-server (P0189) - 221 predictions: 221 3D
SAM-T02 Protein Structure Prediction Webserver
Kevin Karplus, Rachel Karchin, and Richard Hughey
Center for Biomolecular Science and Engineering,
University of California, Santa Cruz
karplus@soe.ucsc.edu
SAM-T02 predicts the fold and secondary structure of a target protein sequence
using multi-track hidden Markov models and neural nets trained on multiple
alignments generated by the SAM-T2K iterated search procedure [1-4].
As a first step, we build a multiple alignment of homologs to the target
sequence. Next, neural nets and the multiple alignment are used to make local
structure predictions. SAM-T02 is currently using these local-structure
alphabets: DSSP [5], STRIDE [6], STR [1] an extended version of DSSP that
splits
the
beta
strands
into
multiple
classes
(parallel/antiparallel/mixed/edge/center), and DSSP-EHL2, CASP’s collapse of
the DSSP alphabet into three states. DSSP-EHL2 is not predicted directly by a
neural net, but is computed as a weighted average of the other 3 networks (each
probability vector output is multiplied by conditional probability matrix
P(E|letter) P(H|letter) P(L|letter)). The weights for the averaging are the mutual
information between each local structure alphabet (DSSP, STRIDE, STR) and
the DSSP-EHL2 alphabet in a large training set.
http://www.soe.ucsc.edu/research/compbio/sam.html
1.
2.
We make four 2-track HMMs (1.0 amino acid + 0.3 local structure) and use
them to score a template library of about 6200 templates. We also use a singletrack HMM to score not just the template library, but a non-redundant copy of
the entire PDB. One-track HMMs built from the template library multiple
alignments are also used to score the target sequence.
The HMM scores for each sequence, with respect to the four 2-track HMMs
and one template HMM, are converted to e-values, and the logs of all
e-values are combined in a weighted average (with rather arbitrary weights,
since we did not have time to optimize them). The combined scores are used to
rank the best templates.
Alignments of the target to the top templates are made using several different
alignment methods (all using the SAM hmmscore program [8]).
After the large set of alignments is made, the server picks the best-scoring
templates (after removing redundancy) and reports the local posterior-decoding
alignments made with the 2-track AA+STR target HMM. The AA+STR
HMM has produced the best quality alignments in our tests on sets of
structurally similar protein pairs with low to moderate sequence identity [1].
The server also provides secondary structure predictions for the target sequence
in a variety of formats and sequence logos [7] for the multiple alignment and
secondary-structure predictions.
SAM-T02 is available at:
http://www.soe.ucsc.edu/research/compbio/HMM-apps/T02-query.html
Stand-alone SAM programs (free for academics, government labs and nonprofits) are available at:
A-138
3.
4.
5.
6.
7.
8.
Karchin R. et. al. (2002) Hidden Markov models that use predicted local
structure for fold recognition: alphabets of backbone geometry. Submitted
to Proteins: Structure, Function, and Genetics
Karplus K. et. al. (2001) What is the value added by human intervention in
protein structure prediction?. Proteins: Structure, Function, and Genetics,
45 (S5), 86-91.
Karplus K. et. al. (1998) Hidden Markov models for detecting remote
protein homologies. Bioinformatics, 14 (10), 846-856.
Park J. et. al. (1998) Sequence comparisons using multiple sequences
detect three times as many remote homologues as pairwise methods.
J. Mol. Biol., 284 (4), 1201-1210
Kabsch W. and Sander C. (1983) Dictionary of protein secondary
structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers, 22 (12), 2577-2637
Frishman D. and Arogs P. (1995) Knowledge-based Protein Secondary
Structure Assignment. Proteins: Structure, Function, and Genetics, 23 (4),
566-579
Schneider T.D. and Stephens R.M. (1990) Sequence logos: a new way to
display consensus sequences. Nucleic Acids Res, 18 (10), 6097-100
Hughey R. and Krogh A. (1996) Hidden Markov models for sequence
analysis: extension and analysis of the basic method. CABIOS, 2 (12), 95107
SAMUDRALA-COMPARATIVE-MODELLING (P0053) –
248 predictions: 248 3D
templates were then fed into a multiple sequence alignment method
(CLUSTALW [2]) and the pairwise alignments between the target and each of
the templates were used to construct initial models. The initial models were
then ranked by our discriminatory function and the models that ranked highest
were used for further model-building. In addition to these initial models, a
model based on the alignment derived from a structure comparison of the best
scoring model output from our de novo fold generation method (see the abstract
for SAMUDRALA-FOLD-RECOGNITION) to the corresponding template
structure was also used.
An Automated Approach for the Comparative Modeling of
Protein Structure
Ram Samudrala
University of Washington
ram@compbio.washington.edu
The interconnected nature of interactions in protein structures, thorough
sampling of side chain and main chain conformations, and devising a
discriminatory function that can distinguish between correct and incorrect
conformations are the major hurdles preventing the construction of accurate
homology models. We present an algorithm that uses graph theory to handle the
problem of interconnectedness. Sampling of side chain and main chain
conformations is accomplished by exhaustively enumerating all possible
choices using a discrete state model, including fragments from a database of
protein structures. The optimal combination of these possibilities is selected
using an all-atom scoring function aided by the graph-theoretic approach.
Following is a brief description of the components and steps of this method,
which can be divided into: discriminatory function, identification of template
and generation of alignment, initial model building, construction of variable
main chain and side chain regions, and moving models closer to the native
conformation.
1. DISCRIMINATORY FUNCTION: the function used throughout generally is
an all-atom distance-dependent conditional probability discriminatory function
based on a statistical analysis of known protein structure. The negative log of
the conditional probability of observing two atoms interact given a particular
distance is used as a ``pseudo-energy'' term [1].
2. IDENTIFICATION OF TEMPLATE AND GENERATION OF
ALIGNMENT: The CAFASP meta-server data (http://bioinfo.pl/cafasp) were
used to identify the template proteins that a given target sequence was related to
(based on a consensus of all the hits produced by the different servers). The
A-139
3. INITIAL MODEL BUILDING: Following the sequence alignment, for each
parent structure, an initial model was generated by copying atomic coordinates
for the main chain (excluding any insertions) and for the side chains of residues
that are identical in the target and parent structures. Residues that differ in type
were constructed using a minimum perturbation technique. The MP method
changes a given amino acid to the target amino acid preserving the values of
equivalent chi angles between the two side chains, where available. The other
chi angles are constructed by the MP method using an internally developed
library based on residue type.
4. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN
REGIONS:
Main chain sampling is performed using an exhaustive
enumeration technique based on discrete states of / angles. For longer main
chain regions, we use fragments (3-tuples) from a database of protein structures
to generate the discrete / angles.
Side chains possibilities are generated by selecting the most probable side chain
rotamers based on the interactions of a given rotamer with the local main chain
(evaluated using the discriminatory function above) [3].
Side chains
possibilities were also constructed using the program SCWRL [4].
We then use a graph-theoretic approach to assemble the sampled side chain and
main chain conformations together in a consistent manner. Each possible
conformation of a residue is represented using the notion of a node in a graph.
Each node is given a weight based on the degree of the interaction between its
side chain atoms and the local main chain atoms. The weight is computed
using a all-atom conditional probability discriminatory function. Edges are then
drawn between pairs of residues/nodes that are consistent with each other (i.e.,
clash-free and satisfying geometrical constraints). The edges are also weighted
according to the probability of the interaction between atoms in the two
residues. Once the entire graph is constructed, all the maximal sets of
completely connected nodes (cliques) are found using a clique-finding
algorithm. The cliques with the best probabilities represent the optimal
combinations of mixing and matching between the various possibilities, taking
the respective environments into account [5]. Clique-finding is accomplishing
using the Bron and Kerbosch algorithm [6]. All models used were refined
using ENCAD [7].
1.
2.
3.
4.
5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:
Once we had generated a final model for each parent, we used an off-lattice 14state / model and a sequential build-up algorithm to generate structures
around the conformational space of the final model. We then used our scoring
function to select the best ranking ones. The goal here is that some of the
conformationssampled would actually be closer to the native conformation and
that our scoring function will be able to select it.
5.
We test how the above approach works in a comparative modelling scenario
and assess the predictive power of this method by applying it to properly
controlled blind tests as part of the fifth meeting on the Critical Assessment of
protein Structure Prediction methods (CASP5). Compared to CASP2-4, where
a similar approach was used [8], we have improved the method used to sample
main chains and have made minor enhancements to the other components of
this approach including the scoring function. It remains to be seen how the
improvements in methodology correlate with model accuracy.
8.
Note: This method is completely automated and the models are generated using
the
same
process
as
the
corresponding
PROTINFO
server
(http://protinfo.compbio.washington.edu). The difference between the
predictions submitted as part of the server registration and the ones submitted
under this group code is that because of lack of the time limits that we had for
CAFASP (48 hours), we can make more predictions. Also, in cases we noticed
clearly egregious output, we re-ran the automated methods with different
parametre weights (i.e., there was an additional step involving interactive
observation for a small number of the targets).
A-140
6.
7.
Samudrala R., Moult J. (1998) An all-atom distance dependent conditional
probability discriminatory function for protein structure prediction. J Mol
Biol 275: 893-914.
Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucleic Acids Res 22: 4673-4680.
Samudrala R., Moult J. (1998) Determinants of side chain conformational
preferences in protein structures. Prot Eng 11: 991-997.
Bower M.J., Cohen F.E., Dunbrack R.L. (1997) Prediction of side-chain
orientations from a backbone-dependent rotamer library: A new homology
modelling tool. J Mol Biol 267: 1268-1282.
Samudrala R., Moult J. (1998) A graph-theoretic algorithm for
comparative modelling of protein structure. J Mol Biol 279: 287-302.
Bron C., Kerbosch J. (1973) Algorithm 457: Finding all cliques of
anundirected graph. Communications of the ACM 16: 575-577.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
SAMUDRALA-FOLD-RECOGNITION (P0052) - 315 predictions: 315 3D
An Automated Approach for De Novo Fold Generation
Ram Samudrala
University of Washington
ram@compbio.washington.edu
This is a completely novel and automated approach, based on the idea of
driving a particular protein folding simulation towards a particular fold. The
idea was derived from the observation that among distant homology
recognition programs, at least could identify the correct template for every
CASP target, even if the alignment was incorrect. The logic here is that if a
template fold could be identified, we can use our de novo simulation approach
to guide the conformation towards the fold, in conjunction our scoring
functions.
The advantage of this approach is that it completely does away with the issues
of alignment, the building of non-conserved side chains and main chains, and
the use of a fixed template to construct a model. This enables us to circumvent
explicit bias to the homologous parent structure (usually a problem in
comparative modelling/fold recognition methods since there is no easy
approach to move a model based on a template closer to its native structure).
We expect that this approach will perform well on cases where the sequence
similarity between two proteins is not very high, in terms of improving the
alignment, as well as obtaining a better conformation for the global fold.
Our general paradigm for predicting a fold involves sampling the
conformational space (or generating "decoys") such that native-like
conformations are observed, and then selecting them using a hierarchical
filtering technique using many different scoring functions. There are four
stages to our approach:
1. IDENTIFICATION OF THE TEMPLATE: The CAFASP meta-server data
were used to identify the template proteins that a given target sequence was
related to (based on a consensus of all the hits produced by the different
servers). The templates were then fed into a multiple sequence alignment
method (CLUSTALW; [1]) and the pairwise alignments between the target and
each of the templates were used to construct initial models. The initial models
were then ranked by our discriminatory function and the models that ranked
highest were considered candidates for the template structure.
2. SECONDARY STRUCTURE PREDICTION: The consensus of the
secondary structure predictions from the various servers at the CAFASP metaserver was used as the secondary structure prediction.
3. FITTING THE TARGET SEQUENCE TO THE TEMPLATE FOLD: We
initially start with an all-atom conformation where residues predicted to be in
helix/sheet by the consensus secondary structure prediction are set to idealised
helix and sheet values. The remaining / values are set in an extended
A-141
conformation. Side chain conformations are predicted by simply using the most
frequently observed rotamer in a database of protein structures [2]. New
conformations are generated by perturbing the existing conformation at a
random residue position using either a value from a 14-state / model, or
replacing three / values for three residues with identical sequence which are
obtained from a database of known structures. The optimization function used
is primarily the CA RMSD between the conformation being generated and the
template structure, along with a combination of an all-atom distance-dependent
conditional probability discriminatory function (rapdf) and a hydrophobic
compactness function (hcf) [3,4]. The fitness of the conformations were
optimised by using two different protocols: a straight-forward Monte
Carlo/simulated annealing approach [5] combined with a Genetic Algorithm
strategy, and a conformational space annealing approach [6]. A combination of
minimization parametres and scoring functions were used to generate a large
pool of conformations.
3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations
generated were minimised using ENCAD [7] and scored using a combination
of scoring functions that hierarchically reduces the total number of
conformations produced to five which are used for the final submissions. The
scoring functions used for the final filtering include a simple residue-residue
contact function (Shell), a density-scoring function that is based on the distance
of a conformation to all its relatives in the conformation pool, a secondary
structure based scoring function that evaluates the match between the predicted
structure and the secondary structure of a final energy-minimised conformation,
and standard physics-based electrostatics and Van der Waals terms.
As we note above, this is a completely novel approach that combines aspects of
all three major modelling approaches (comparative modelling, fold recognition,
de novo prediction) to handle the most difficult targets. This method can also
be used to generate alignments based on a structure comparison between the
final models and the template structures, which we can feed into a more
traditional comparative modelling procedure (see abstract for SAMUDRALACOMPARATIVE-MODELLING). We expect that this approach will perform
best on proteins where the evolutionary relationship between two proteins is not
apparent from sequence comparison methods.
Note: This method is completely automated and the models are generated using
the
same
process
as
the
corresponding
PROTINFO
server
(http://protinfo.compbio.washington.edu). The difference between the
predictions submitted as part of the server registration and the ones submitted
under this group code is that because of lack of the time limits that we had for
CAFASP (48 hours), we can make more predictions. Also, in cases we noticed
clearly egregious output, we re-ran the automated methods with different
parametre weights (i.e., there was an additional step involving interactive
observation for a small number of the targets).
1.
2.
3.
4.
5.
6.
7.
Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW:
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucleic Acids Res 22: 4673-4680.
Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain
construction on near-native main chains for ab initio protein structure
prediction. Prot Eng 7: 453-457.
Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction
of protein tertiary structures using a hierarchical approach. J Mol Biol 300:
171-185.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of
protein tertiary structures from fragments with similar local sequences
using simulated annealing and bayesian scoring functions. J Mol Biol 268:
209-225.
Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein
folding by conformational space annealing and an off-lattice united-residue
force field: application to the 10-55 fragment of staphylococcal protein A
and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
A-142
SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D
An Automated Approach for De Novo Structure Prediction
Ram Samudrala, Shing-Chung Ngan
University of Washington
ram@compbio.washington.edu, ngan@compbio.washington.edu
Our general paradigm for predicting structure involves sampling the
conformational space (or generating "decoys") such that native-like
conformations are observed, and then selecting them using a hierarchical
filtering technique using many different scoring functions. Our goal was to
devise a method that would combine the best aspects of the more successful ab
initio methods at the previous CASP experiments. There are three stages to our
approach:
1. SECONDARY STRUCTURE PREDICTION: The consensus of the
secondary structure predictions from the various servers at the CAFASP metaserver was used as the secondary structure prediction.
2. SEARCHING PROTEIN CONFORMATIONAL SPACE: We initially start
with an all-atom conformation where residues predicted to be in helix/sheet by
the consensus secondary structure prediction are set to idealised helix and sheet
values. The remaining / values are set in an extended conformation. Side
chain conformations are predicted by simply using the most frequently
observed rotamer in a database of protein structures [1]. New conformations
are generated by perturbing the existing conformation at a random residue
position using either a value from a 14-state / model, or replacing three /
values for three residues with identical sequence which are obtained from a
database of known structures. The optimization function used is primarily a
combination of an all-atom distance-dependent conditional probability
discriminatory function (rapdf) and a hydrophobic compactness function (hcf)
[2,3]. The fitness of the conformations were optimised by using two different
protocols: a straight-forward Monte Carlo/simulated annealing approach [4]
combined with a Genetic Algorithm strategy, and a conformational space
annealing approach [5]. A combination of minimisation parametres and
scoring functions were used to generate a large pool of conformations.
3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations
generated were minimised using ENCAD [6] and scored using a combination
of scoring functions that hierarchically reduces the total number of
conformations produced to five which are used for the final submissions. The
scoring functions used for the final filtering include a simple residue-residue
contact function (Shell), a density-scoring function that is based on the distance
of a conformation to all its relatives in the conformation pool, a secondary
structure based scoring function that evaluates the match between the predicted
structure and the secondary structure of a final energy-minimised conformation,
and standard physics-based electrostatics and Van der Waals terms.
This work is an attempt at combining the best de novo prediction methods from
the previous CASP experiments [2-5]. In addition, there are components that
are unique to this approach primarily in the form of the hierarchical filtering
methodology employed, the density scoring function, and in subtle variations of
each of the search methods.
Note: This method is completely automated and the models are generated using
the
same
process
as
the
corresponding
PROTINFO
server
(http://protinfo.compbio.washington.edu). The difference between the
predictions submitted as part of the server registration and the ones submitted
under this group code is that because of lack of the time limits that we had for
CAFASP (48 hours), we can make more predictions. Also, in cases we noticed
clearly egregious output, we re-ran the automated methods with different
parametre weights (i.e., there was an additional step involving interactive
observation for a small number of the targets).
1.
2.
3.
4.
Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain
construction on near-native main chains for ab initio protein structure
prediction. Prot Eng 7: 453-457.
Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction
of protein tertiary structures using a hierarchical approach. J Mol Biol 300:
171-185.
Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind
protein structure predictions. BMC Structural Biology 2: 3-18.
Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of
protein tertiary structures from fragments with similar local sequences
A-143
6.
7.
using simulated annealing and bayesian scoring functions. J Mol Biol 268:
209-225.
Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein
folding by conformational space annealing and an off-lattice united-residue
force field: application to the 10-55 fragment of staphylococcal protein A
and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030.
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy
function and parameters for simulations of the molecular dynamics of
proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.
Sasson-Iris (P0265) - 66 predictions: 66 3D
Full-Atom Modeling using INBGU and 3D-SHOTGUN
I. Sasson
Ben-Gurion University
sassonir@cs.bgu.ac.il
Full atom models were generated using initial alignments from the INBGU
server as multiple template input to Modeller.
The goal was to test the power of the 3D-SHOTGUN selection procedure
carried out by the SHGU method and at the same time to generate protein-like,
full-atom models, without the fragmentation and collisions present in the Calpha only SHGU models.
Most of the models were generated in a fully automated manner.
We describe the preliminary approach used. We are in the process of
incorporating a fully automated procedure into the INBGU server.
SBC (P0084) - 94 predictions: 94 3D
The Pcons and Pmodeller Consensus Fold Recognition Servers
shown that the existence of such fragments are useful for comparing the
performance between
different fold recognition methods and that this
performance correlate well with performance in fold recognition.
We developed a neural network based method to predict the quality of a protein
model (ProQ). ProQ extracts structural features, such as frequency of atomatom contacts, and predicts the quality of a model, as measured either by
LGscore or MaxSub. We show that ProQ performs at least as good as other
measures when identifying the native structure and better at the detection of
correct models. This performance is maintained over several different test sets.
Björn Wallner, Fang Huisheng and Arne Elofsson
Stockholm Bioinformatics Center, Stockholm University,
106 91 Stockholm, Sweden
arne@sbc.su.se
In the CASP and CAFASP processes it has been shown that manual experts are
better to predict the fold of an unknown protein than fully automated methods.
The best manual predictions seem to be performed by authors using a widerange of different methods, and the most obvious similarity between them is
that they have worked on fold recognition for years. Several of these experts
also develop methods, however these methods do not perform as well as the
experts them self. What are the secrets that the manual experts possess, but are
not able to put into a computer?
We have recently showed that one such secret is the use of a ``consensus''
approach in fold recognition. By using several different methods, the same
method with different parameters or searching using several homologous
sequences a ``consensus'' prediction can be made. The consensus analysis can
also be done using only a single sequence and a single method, by searching for
similar hits among the top-scoring hits. In contrast, most automatic methods do
only use a single sequence, a single set of parameters and do not use the topscoring hits to search for ``consensus'' predictions. We will describe a new
method for fold recognition, Pcons[1], that utilizes the ``consensus analysis'' to
improve automatic fold recognition.
Further, the ability to separate correct models of protein structures from less
correct models is of the greatest important for protein structure prediction
methods. Several studies have examined the ability of different types of energy
function to detect the native, or native-like, protein structure from a large set of
decoys. In contrast to earlier studies we examine here the ability to detect
models that only show some structural similarity to the native structure. These
correct models are defined by the existence of a fragment that show significant
similarity between this model and the native structure. Further, it has been
A-144
ProQ [2] can also be combined with the Pcons[1] fold recognition predictor to
increase its performance. However, the improvement is quite marginal, with the
main advantage being the elimination of a few high-scoring false positive
models.
ProQ is freely available as a standalone web server on
http://www.sbc.su.se/~bjorn/ProQ/, and is incorporated into Pcons consensus
server, available at http://www.sbc.su.se/~arne/pcons/ as Pmodeller. Current
results in LiveBench indicates that Pmodeller performs significantly better than
Pcons.
1.
2.
Lundström et al. (2001) Pcons: A neural network based consensus
predictor thatimproves fold recognition. Protein Science 10(11):2354-6
Wallner and Elofson (2002) Can correct protein models be identified ?
submitted
Scheraga-Harold (P0314) - 135 predictions: 135 3D
Physics-Based Protein-Structure Prediction Using the UNRES
and ECEPP/3 Force Fields - Test on CASP5 Targets
C. Czaplewski1,2, D.R. Ripoll1,3, St. Ołdziej1,2,
R. Kaźmierkiewicz1,2, J.A. Vila1,4, A. Liwo1, J. Pillardy1,3,
J.A. Saunders1, M. Chinchio1, M. Nanias1, M. Khalili1,
Y.A. Arnautova1, A. Jagielska1, Y. K. Kang1,5, K.D. Gibson1
and H.A. Scheraga1*
1
– Baker Laboratory of Chemistry and Chemical Biology, Cornell University,
Ithaca, NY, 14853-1301, 2 – Faculty of Chemistry, University of Gdańsk, ul.
Sobieskiego 18, 80-952 Gdańsk, Poland, 3 – Cornell Theory Center, Cornell
University, Ithaca, NY, 14853-1301,4 - Escuela de Fisica, Facultad de Ciencias
Fisico Matematicas y Naturales, Universidad Nacional de San Luis, Ejercito de
los Andes 950, 5700 San Luis, Argentina, 5 - Department of Chemistry,
Chungbuk NationalUniversity, Cheongju, Chungbuk 361-763, Korea
*
has5@cornell.edu
The structures of the target proteins were predicted using a hierarchical
algorithm consisting of three major stages, in which the tertiary structure is
predicted at low resolution and then refined.
In stage 1, the protein is represented by a simplified low-resolution united
residue (UNRES) model, in which the atoms of the peptide group and side
chain of each amino-acid residue are replaced with two centers of interactions:
the united peptide group (p) located in the middle between two consecutive carbon atoms and the united side chain (SC). The lengths of the virtual C …C
and C…SC bonds are held fixed, but the virtual-bond angles and the
orientations of the C…SC virtual bonds are variable. The interactions of this
simplified model are described by the UNRES potential derived from the
generalized cumulant expansion of restricted free energy (RFE) function of
polypeptide chains [1]. The cumulant expansion enabled us to determine the
functional forms of the multibody terms in the UNRES potential.
A-145
The UNRES potential was parameterized using RFE surfaces of systems
modeling interacting fragments of polypeptide chains calculated at quantummechanical ab initio level [using the Möller-Plesset perturbation theory up to
the second order (MP2) with 6-31G* basis set], as well as correlation and
distribution functions determined from the Protein Data Bank (PDB). The
folding property of the potential function was achieved by applying our novel
hierarchical optimization method targeted at decreasing the energy while
increasing the native-likeness of a structure of benchmark protein(s) [2].
Our conformational space annealing (CSA) method [3] was used to search for
the lowest-energy families of UNRES conformations. To speed up the search in
the case of larger proteins, information from secondary structure prediction was
used in the generation of the initial structures and/or to restrict the
conformational search. However, unrestricted search was also performed in
most of the cases. The five families with the lowest UNRES energy were
chosen as models 1-5; the structures of these models were then refined in stages
2 and 3, as described below.
In stage 2, the low-resolution UNRES conformations of a target protein were
converted to all-atom models by using our energy-based method for the
reconstruction of an all-atom polypeptide chain from its C-trace and sidechain-centroid coordinates [4,5]. Finally, in stage 3, the all-atom structures
were refined by minimizing their energies with the all-atom ECEPP/3 force
field [6] subject to C-distance constraints of the parent UNRES models.
1.
2.
3.
Liwo A. et al. (2001) Cumulant-based expressions for the multibody terms
for the correlation between local and electrostatic interactions in the
united-residue force field. J. Chem. Phys. 115 (5), 2323-2347.
Liwo A. et al. (2002) A method for optimizing potential-energy functions
by a hierarchical design of the potential-energy landscape: Application to
the UNRES force field. Proc. Natl. Acad. Sci. USA. 99 (4), 1937-1942.
Lee J. et al. (1999) Conformational space annealing and an off-lattice
united-residue force field: application to the 10-55 fragment of
staphylococcal protein A and to apo calbindin D9K. Proc. Natl. Acad. Sci.
USA. 96 (5), 2025-2030.
4.
5.
6.
Kaźmierkiewicz R. et al. (2001) Energy-based reconstruction of a protein
backbone from its -carbon trace by a Monte Carlo method. J. Comput.
Chem. 23 (7) 715-723.
Kaźmierkiewicz R. et al. (2002) Addition of side chains to a known
backbone with defined side-chain centroids. Biophys. Chem. in press.
Nemethy G. et al. Energy parameters in polypeptides. 10. Improved
geometrical parameters and nonbonded interactions for use in the ECEPP/3
algorithm with application to proline-containing peptides. J. Phys. Chem.
96 (15) 6472-6484.
Schulten-Wolynes (P0093) - 118 predictions: 118 3D
The complete 3-dimensional models of the target proteins with side chains were
made using the Modeller package as implemented in Insight II [6,10]. Starting
from the alignments mentioned above, three models were generated using the
highest level of optimization.
Bioformatics Based Threading for Protein Structure
Prediction
P. O’Donoghue1, F. Autenrieth1, R. Amaro1, M. Januszyk1, T.
Pogorelov1, P. G. Wolynes2, and Z. Luthey-Schulten1
1
Three procedures were used to generate sequence alignments of the target
sequence to the scaffold: our in-house sequence-structure threading alignment
algorithm [4], a sequence to structure profile-profile alignment procedure as
described in [5], and a hybrid method that uses the threading alignment over
some regions and the profile-profile alignment over other regions of the target
sequence. This hybrid method is also described in [5]. Sequence profiles were
generated using Clustal-W [8], on sequences obtained from a PSI-BLAST
search of Swiss-Prot or the Non-Redundant sequence databases. Structure
profiles were generated using the CE algorithm for structural alignment [7].
We manually checked complete alignments for agreement between the PsiPred
[3] secondary structure prediction for the target and the secondary structure of
the scaffold. We also checked for correct alignment of homologous functional
sites.
Our in-house sequence-structure threading alignment algorithm [4] was used to
rank the model structures constructed for a given target sequence. After
threading the target sequence onto the model structures of the target, the model
with the best local Hamiltonian energy was chosen as our “MODEL 1”.
Additional models followed this energy ranking.
- University of Illinois at Urbana-Champaign
2
- University of California, San-Diego
zan@uiuc.edu
We used a combination of methods to select scaffolds for the target sequences.
These methods included: 1) using our in-house sequence-structure threading
alignment algorithm (termed the local Hamiltonian) [4] to thread the target
sequence against a PDB select database, PDB25 and/or PDB90, [2]; 2) a PSIBLAST simultaneous search against the PDB database and sequence databases
from several organisms from NCBI or the Biology Workbench [1,9]; 3) a
PubMed literature search [12] for functionally related proteins to the target
sequence. If CAFASP3 [11] reported scaffolds not related to those found by
the first three search methods, then these scaffolds were included in our
analysis. Large proteins were subdivided into putative domains using a variety
of methods including analysis of multiple sequence alignments and exon
prediction algorithms.
A-146
1.
2.
3.
4.
5.
Altschul F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25,
3389-3402.
Hobohm U. et al. (1992) Selection of representative protein data sets. Prot.
Sci. 1, 409-417.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Koretke K.K. et al. (1996) Self-consistently optimized statistical
mechanical energy functions for sequence structure alignment. Prot. Sci. 5,
1043-1059.
O'Donoghue P. et al. (2001) On the Structure of hisH: Protein Structure
Prediction in the Context of Structural and Functional Genomics. J. Struct.
Biol. 134, 257-268.
6.
7.
Sali A. et al. (1993) J. Mol. Biol. 234, 779-815.
Shindyalov I.N. et al. (1998) Protein structure alignment by incremental
combinatorial extension (CE) of the optimal path. Prot. Eng. 11, 739-747.
8. Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids
Res. 22, 4673-4680.
9. http://workbench.sdsc.edu
10. http://www.accelrys.com
11. http://www.cs.bgu.ac.il/~dfiischer/CAFASP3/
12. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
SDSC2:Reddy-Bourne (P0347) - 54 predictions: 54 3D
Using Combination of PSI-BLAST, Expdb, 3D-PSSM, SAMT02, FUGUE, Multalin, Swissmodel and Swiss-PDB Viewer
Web Servers with Human Input to Model Protein Structures.
[7]. We have used the top five suggested templates from each of these servers
for further consideration as possible templates. We have also considered their
structural homologues from FSSP as possible template set. We have submitted
the target sequence for PSI-BLAST [1] search with NR sequence database to
identify its available sequence homologues. Functional information from the
NR homologues sequences was matched with the possible templates from
above fold recognition servers to decide on a single basis structure to be used to
model the target structure.
Template Target Alignment:
We have randomly chosen few (at least 4 each) distant homologues (e-value 1020
or more) of template and target sequences (a total of 8 including the template
and the target sequences). All these sequences were multi aligned using the
MultAlin [2] server. The resultant alignment between template and the target
sequences along with their homologs is used as final alignment between
template and target sequence.
Model Building and Visualization:
We have used Swiss-PDB viewer (Deep View) [3] to load the template
structure and the target sequence. Both the sequences were aligned as per the
MultAlin [2] and the resultant alignment is submitted to Swiss Model through
web submission form. The model built by Swiss-Model is viewed and the
coordinate file was edited appropriately as per the CASP5 TS submission
format. Only one model is submitted for each target. In all our models we have
used only one best possible template to model a target structure.
V. Boojala B. Reddy1 and Philip E. Bourne1,2
1
San Diego Supercomputer Center,
2
Department of Pharmacology,
University of California, San Diego, CA 92093 - 0537
Template Selection:
Target sequences are taken from the CASP5 target site in the FASTA format
and submitted for PSI-BLAST [1] search at the NCBI Blast site using the pdb
as search sequence database for template identification. If template(s) is
identified with expectation score less than 10-5, one best resolved structure with
good sequence coverage has been selected as basis structure to model the target
sequence. ExPDB[6] template search is used for this purpose.
If no template is identified in the NCBI-pdb sequences we have submitted the
sequence to fold recognition servers, 3D-PSSM [5], SAM-T02 [4] and FUGUE
A-147
1.
2.
3.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389 - 3402.
Corpet F. (1988) Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res. 16 (22), 10881 - 10890.
Guex N., Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer:
an environment for comparative protein modeling. Electrophoresis. 1997
18 (15): 2714 - 2723.
4.
5.
6.
7.
Karplus K., Karchin R., Barrett C., Tu S., Cline M., Diekhans M., Grate L.,
Casper J., Hughey R. (2001) What is the value added by human
intervention in protein structure prediction? Proteins. Suppl 5: 86 - 91.
Kelley L.A., MacCallum R.M., Sternberg M.J. (2000) Enhanced genome
annotation using structural profiles in the program 3D-PSSM. J Mol Biol.
299 (2): 499 - 520.
Peitch M.C., Schwede T. and Guex N. (2000) Automated protein modeling
– the proteome in 3D. Pharmacogenomics 1: 257 – 266.
Shi J., Blundell T.L., Mizuguchi K. FUGUE: sequence-structure homology
recognition using environment-specific substitution tables and structuredependent gap penalties. J Mol Biol. 310 (1): 243 - 257.
Shakhnovich-Eugene (P0459) - 26 predictions: 26 3D
Structure Prediction: a Synthesis of Threading and Folding
The ELISA database is a database of protein domains sorted by structural and
sequence similarity, and further annotated by function. Each set of domains is
connected to others in a graph structure on the basis of a threshhold Z-score.
We submitted the query sequence to ELISA, which yielded domain hits and a
set of graph neighbors, the number of which can be selected by adjusting the
threshhold Z-score. This small set of structures was then subjected to threading
for fold recognition refinement or direct modeling if the initial ELISA was
significant enough.
Threading was also used for fold recognition, or for generating an alignment if
a significant ELISA hit was retrieved. The query sequence is subjected to
Monte Carlo threading through a very large set of templates. [2] Appropriate
bioinformatic constraints in the form of structural profiles or alignment
fragment number restrictions gives accuracy and significance to threading hits.
The most significant template hit and alignment is judged by the magnitude of
the  -parameter, and is selected as a template for model building. [3-4]
Using the template and threading alignment, a full-atom backbone is generated
with appropriate proline geometries. The backbone secondary structure is
obtained directly from the alignment and the template. Loop regions (nonaligned regions) are of appropriate length and are random in conformation. The
backbone is minimized with a dRMS function to reflect the tertiary structure of
the template. Sidechains with random rotameric states are built onto the final
dRMS-minimized model.
W. Chen1, E. Kussell1, F. Zhang2, B. Shakhnovich3,
I. Hubner2, E.I. Shakhnovich2
1 - Dept of Biophysics, Harvard University ,
2 - Dept of Chemistry and Chemical Biology, Harvard University,
3
–Bioinformatics Program, Boston University
eugene@belok.harvard.edu
We present a method for protein structure prediction that is a synthesis of
bioinformatic and all-atom physical approaches. Our method is a combination
of threading, model-building, specific potential derivation, and full-atom
folding. This novel structure prediction protocol is flexible in that it provides
for varying levels of detail in both input and output. At the same time, the
method is efficiently automated and human intervention is controllable at each
step. The combination of informatic and physical concepts in this method also
means that it draws upon strengths inherent to each approach.
To refine the model in both rotamer and backbone states, we derive a familyspecific full-atom potential for use in full-atom refinement. The fold family is
obtained from relatives of the threading template hits. These related proteins
are used for derivation of a full-atom potential. We use an atom-typing scheme
based on six residue types: aliphatic, aromatic, positively charged, negatively
charged, polar, and special. Atoms within each residue type then have unique
types based loosely on similarity of chemical connectivity. This results in 28
atom types for the 20 amino acids. The potential is derived using the potential procedure [5], allowing for the parameters N ab and Ñab to be summed
across the family.
We begin with a query sequence, the structure of which is unknown. Fold
recognition begins with either of two methods: threading or ELISA query. [1]
The dRMS-minimized model with sidechains is refined using the derived
potential and a generic geometric hydrogen bond potential. Refinement is done
A-148
by Monte Carlo annealing at low temperatures to bring the protein to low free
energy conformations. [6] For next-stage refinement, conformations can be
automatically selected on the basis of low free energy, or by hand after visual
examination with a molecule viewer. The penultimate structure is put through
a final round of refinement by fixing the backbone and allowing sampling of
sidechain rotamers from a rotamer library, using the derived potential. This
"repacking" of sidechains ensures both improved packing of the interior of the
model and also realistic rotamer conformations. [7] The final structure is
inspected using the Protein Health Facility in Quanta.
SHESTOPALOV (P0044) - 159 predictions: 79 3D, 80 SS
The method we have outlined draws its strengths from the combination of
bioinformatic and realistic full-atom approaches. Threading constrains the
query chain to a small subset of possible conformations, and full-atom folding
further refines the coarse results of threading. The method, being multi-stage
and open-ended to iterations, is flexible and can be adapted to other sorts of
input at each stage. Ab initio models can be generated by elimination of the
threading stage, or BLAST and ELISA and other fold-selection criteria can be
substituted for threading as the input to the model-building stage. [8]
It is presented the new version of the doublet code model of protein secondary
structure and its application for fold recognition and secondary structure
prediction in the CASP5 experiment. The basis of the model see in [1-2]. This
version allows more accurate prediction (about 4-5%). The prediction is
obtained for 98% of the protein chain. On the basis of the model the hypothesis
is formulated that secondary structure predicted by doublet code method is the
secondary structure for unfolded state of protein chain. On the basis of this
hypothesis it is possible to search parents for a target using predicted secondary
structures of the target and proteins from Protein Data Bank [3]. If secondary
structures for unfolded state are similar, secondary structures for folded state
and three-dimensional structures are similar also. To avoid mistakes in
prediction, secondary structures of homologs of these proteins are used. For
easy targets BLAST results [4-5] and Conserved Domain Database and Search
Service v1.58 [6] are used. Secondary structure has been predicted for all 65
proteins and parents have been suggested for 48 proteins.
1.
2.
3.
4.
5.
6.
7.
8.
Shakhnovich B. et al. (2002) Functional fingerprints of folds: evidence for
correlated structure-function evolution. (JMB, submitted).
Mirny L.A. & Shakhnovich E.I. (1998) Protein structure prediction by
threading. Why it works and why it does not. J. Mol. Biol. 283 (2), 507526.
Chen W. et al. (2002) Fold recognition with minimal gaps. (submitted).
Mirny L.A. et al. (2000) Statistical significance of protein structure
prediction by threading. Proc. Natl. Acad. Sci. 97 (18), 9978-9983.
Kussell E. et al. (2002) A structure-based method for derivation of allatom potentials for protein folding. Proc. Natl. Acad. Sci. 99 (8), 53435348.
Shimada J. et al. (2001) The folding thermodynamics and kinetics of
Crambin using an all-atom Monte Carlo simulation. J. Mol. Biol. 308 (1),
79-95.
Kussell E. et al. (2001) Excluded volume in protein side-chain packing.
J. Mol. Biol. 311 (1), 183-193.
Shakhnovich B. et al (2002). In prepration
Protein Fold Recognition and Secondary Structure Prediction
by Doublet Code Method in the CASP5 Experiment
B.V. Shestopalov, G.R. Mavropulo-Stolyarenko, A.M. Lebedev
Institutute of Cytology, Russian Academy of Sciences
shest@mail.cytspb.rssi.ru
1.
2.
3.
4.
5.
A-149
Shestopalov B.V. (1990) Prediction of protein secondary structure by
doublet code method. Mol. Biol. (Moscow), Engl. Transl. 24 (4) 900-907
Shestopalov B.V. (2000) Doublet Code of Protein Secondary Structure and
its application for Secondary Structure Prediction and Fold recognition.
Abstracts submitted to the CASP4 meeting
Berman H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res.
28(1), 235-242
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Gish W. (1996-1999) http://blast.wustl.edu
6.
Marchler-Bauer A. et al. (2002) CDD: a database of conserved domain
alignments with links to domain three-dimensional structure. Nucleic Acids
Research 30 (1), 281-283
Shortle (P0349) - 32 predictions: 32 3D
Protein Structure Prediction Using Fragment Ensembles with
Highly Favorable Ramachandan / Rotamer Propensities
Qiaojun Fang and David Shortle
Department of Biological Chemistry
The Johns Hopkins University School of Medicine
dshortl1@jhmi.edu
and solvent exposure propensities and that also have negative energies as
assessed by empirical pair potentials. A small number of homologues (3 to 10)
were also threaded at this step, with clustering of conformations between
conformational sets from different homologous sequences to reduce noise from
individual sequences and to identify the most common turn features (i.e., those
with the highest entropy,[2]). A final set of 20 to 40 conformations, recovered
from one or more homologues, was saved for recombination to form larger
fragments.
Conformation sets overlapping by one helix/strand were recombined, selecting
for relatively compact, relatively bump free chains with low energies (empirical
pair potentials). Visual inspection of superposed sets of recombinants by both
investigators was used extensively to infer the most common topology of these
long chain fragments. In the last step, individual recombinant chains were
manually reworked to reduce bumps, achieve greater compactness, and enforce
protein-like patterns of tertiary interactions within the inferred topology.
Three dimensional models of the backbone plus CB atom for targets in the fold
recognition and new fold categories were constructed in three steps: (i)
prediction of secondary structure; (ii) definition of turn directions between
elements of secondary structure; (iii) recombination of helix/strand – turn –
helix/strand fragments to generated longer pieces (60-120 amino acids) of the
target protein. At each step, extensive use was made of ensembles of fragments
from the PDB to identify the predominant low resolution patterns.
1.
Secondary structure was predicted by threading overlapping segments of 6 to
12 residues from the target sequence through known protein structures,
selecting for conformations that optimize Ramachandran, rotamer, and solvent
exposure propensities [1]. Each position was assigned the most common
secondary structure after averaging over approximately 100 fragments. Final
decisions on ambiguous secondary structure, which could not be resolved by
analysis of a small number of homologues, were often arbitrated by the results
from PSIPRED and other CAFASP servers.
sk-lab (P0403) - 2 predictions: 2 3D
Segments of the target sequence corresponding to helix/strand – turn –
helix/strand elements were threaded through approximately 5000 PDB
structures, selecting for conformations that optimize Ramachandran, rotamer
A-150
2.
Shortle D. (2002) Composites of local structural propensities: evidence for
local encoding of long range structure. Protein Science 11, 18-26.
Shortle D., Simons K.T., Baker D. (1998) Clustering of low energy
conformations near the native structures of small proteins. PNAS 95,
11158-11162.
Knowledge based consensus method for structure prediction
S. Krishnaswamy, Preeti Mehta, P.D. Kumar,
A.V.S.K. Mohan Katta
Bioinformatics Centre, School of Biotechnology, Madurai Kamaraj
University, Madurai 625 021, India
krishna@mrna.tn.nic.in
The modeling of three-dimensional structures of proteins is rendered difficult
due to the sequence-structure-function degeneracy. Thus the most successful
methods have relied upon comparative modeling techniques [1-2]. These
comparative modeling techniques rely on the convergence of structure based on
sequence identity [3]. The sequence to structure degeneracy has lead to the use
of fold prediction or threading methods [4-5]. These methods rely on the idea
that known structures can be used as a knowledge base from which one extracts
information for modeling. There are many examples of groups of proteins with
a similar fold but with no sequence similarity [6]. The method that we have
adopted assumes that sub-structures can be stitched together to form larger
structures [7].
The method we refer to as the knowledge based consensus method was
developed [8] and refined [9] for the prediction of a protein called McrA from
E.coli, which we are in the process of structure determination. McrA has less
than 20% identity to known proteins in the database. We have used this method
in the CASP5 experiment to predict the structure of two targets T0192 and
T0176. The modules InsightII, Homology and Discover of the Biosym package
were used for the model building and energy minimization. The CVFF
forcefield available in the Biosym software was used for the energy
minimization. The method involves fragment selection based on searches
against the PDB. The matches are not selected only based on the E-value but
based on predicted secondary structure matches, possible functional similarity
of the template protein and hydrophobic characteristics. The final decision took
into account the need to preserve contiguity of secondary structures and was
arrived at on the basis of consensus of these selection criteria. Wherever
possible preference was given to selection from the same set of templates in
order to minimize the number of template structures used. Thus this selection
has a certain amount of subjectivity. The longest and region with the best
consensus was chosen for the start. A homology-based approach was then used
to assign coordinates to the sub-sequence (termed here as ‘pseudo SCR’) based
on the template structure. This process was continued with the selection of a
new region until all the possible pseudo SCRs were assigned coordinates. Each
time it was ensured that the coordinates of the previous and new fragments
were in the same reference frame. The template was discarded if there were
contact or topology problems and a new template was chosen at the next
consensus level. Once this process is complete, the intervening regions were
assigned coordinates using the loop search algorithm. The joint or splice
regions were repaired, side chain conformations were optimized based on
rotamer libraries and the model was subjected to cycles of energy minimization
using Steepest Descent and Conjugate Gradient methods. The resulting model
A-151
was examined in the graphics for inconsistencies such as buried charge residues
and distribution in the allowed regions of the Ramachandran plot. These were
corrected, if possible, or the model was discarded and a different template
structure was chosen for the problem region and the modeling process was restarted from that position.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Johnson N.S. et al (1994) Knowledge based protein modeling. Crit. Rev.
Biochem. Mol. Biol. 29, 1-68.
Tramontano A. (1998) Homology modeling with low sequence identity.
Methods: A companion to Methods in Enzymology. 14, 293-300.
Chothia C. and Lesk A.M. (1986) The relation between the divergence of
sequence and structure in proteins. EMBO J. 5, 823-826.
Jones D.T. et al (1992) A new approach to protein fold recognition. Nature
358, 86-89.
Godzik A. et al (1992) Topology fingerprint approach to inverse protein
folding problem. J. Mol. Biol. 227, 227-238.
Murzin A.G. et al (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J . Mol. Biol.
247, 536-540.
Jones A.T. and Thirup S. (1986) Using known substructures in protein
model building and crystallography EMBO J. 5, 819-822.
Krishnaswamy S. et al (1995) Knowledge based consensus approach to
molecular modeling of McrA. Protein Science 4 (suppl) 2, 86.
Deva T. (2000) Structural analysis of type II restriction endonucleases
And the atypical modified cytosine restriction endonuclease McrA from
E.coli Ph.D. thesis submitted to Madurai Kamaraj University, Madurai,
India.
Skolnick-Kolinski (P0010) - 361 predictions: 361 3D
1.
TOUCHSTONE: A Unified Approach To Protein Structure
Prediction
2.
3.
Y. Zhang1, A. Arakaki1, D. Kihara1, M. Boniecki2, A. Szilagyi1,
A. Kolinski1,2 and J. Skolnick1
1
4.
Center of Excellence in Bioinformatics, University at Buffalo,
2
Faculty of Chemistry, Warsaw University, Poland
skolnick@buffalo.edu
5.
6.
We have applied the TOUCHSTONE [1] folding algorithm that spans the
range from homology modeling to ab initio folding to all the protein targets in
CASP5. Using our threading algorithm PROSPECTOR [2], one first threads
against a representative set of PDB templates. If a template is significantly hit,
generalized comparative modeling using a number of variants is done. Among
these variants involve freezing the aligned regions either on (the CABS model a
and the
side chain center of mass) or off lattice and relaxing the remaining structure to
accommodate insertions or deletions with respect to the template. Alternatively,
if multiple templates are identified, both local and long range distant restraints
are extracted and used in the CABS lattice based structure assembly algorithm.
The generalized comparative modeling component is designed to span the
range from closely to distantly related proteins from the template. If a
significant template is not identified, then consensus contacts from weakly
threading templates are pooled and incorporated into our ab initio folding
algorithm. In addition for both generalized comparative modeling and ab initio
cases, predicted secondary structure from PSIPRED [3] as well as consensus
local distance restraints from PROSPECTOR are used. For ab initio folding,
the CABS model is used exclusively. In all cases, conformational space is
sampled by replica exchange Monte Carlo [1,4-5]. The resulting structures are
clustered [6-7] and ranked according to cluster diversity, population and where
applicable functional considerations. On this basis, the top five candidates are
submitted to CASP.
A-152
7.
Kihara D., et al. (2001) TOUCHSTONE: an ab initio protein structure
prediction method that uses threading-based tertiary restraints. Proc Natl
Acad Sci U S A 98(18), 10125-10130.
Skolnick J. and Kihara D. (2001) Defrosting the frozen approximation:
PROSPECTOR--a new approach to threading. Proteins 42(3), 319-331.
McGuffin L.J., Bryson K., and Jones D.T. (2000) The PSIPRED
protein structure prediction server. Bioinformatics 16(4), 404-405.
Swendsen R.H. and Wang J.S. (1986) Replica Monte Carlo
simulations. Phys. Rev. Lett. 57, 2607-2609.
Ferrenberg A.M. and Swendsen R.H. (1988) New Monte Carlo
technique for studying phase transitions. Phys. Rev. Lett. 61, 2635-2637.
Betancourt M.R. and Skolnick J., (2001) Universal similarity measure
for comparing protein structures. Biopolymers 59(5), 305-309.
Betancourt M.R. and Skolnick J. (2001) Finding the needle in a
haystack: Educing native folds from ambiguous ab initio protein structure
prediction. J Comput Chem. 22, 339-353.
SMD-CCS (P0249) - 4 predictions: 4 3D
Protein Modeling at CASP5
F.Giordanetto1, M. Saqi2, S. Jha1 and P.V. Coveney1
1
– Centre for Computational Sciences, Department of Chemistry, Queen Mary,
University of London, Mile End Road, E1 4NS, London,
2
– Bioinformatics, Dept. of Medical Microbiology, Barts and The London,
Queen Mary’s School of Medicine and Dentistry,
University of London, 32 Newark St., London E12AA
f.giordanetto@qmul.ac.uk
PSI-BLAST [1] and GenTHREADER [2] were employed in order to identify
possible three-dimensional templates for the target sequences. Search and
evaluation of structural neighbours and structural comparison was carried out
with DALI [3]. Multiple sequence alignments between the target and the
probable templates were performed using T-COFFEE [4] and CLUSTALW [5].
Secondary structure predictions were carried out using PSIpred V2.0 [6] and
PredictProtein [7]. Three-dimensional models of the targets were built using
MODELLER 6v2 [8]. Loop fragments or uncertain regions arising from the
alignment were sampled using the loop-searching routines implemented in
MODELLER [9].
All the molecular mechanics calculations have been performed employing the
Amber 98 force field [10], as previously ported to the Large-scale
Atomistic/Molecular Massively Parallel Simulator (LAMMPS) [11]. The final
homology-built structures were subjected to 1000 steps of energy minimization
in vacuo. Subsequently, the systems were neutralized by adding sodium
counter-ions and solvated by TIP3P water. Solvent and ions were energyminimized and then evolved using molecular dynamics for 20 ps holding the
protein atoms fixed. Both solvent and solute were energy-minimized again and
sampled for 300 ps with positional constraints on the main chain atoms of the
residues which displayed a “good” alignment with the template structures.
Structural evaluation of the overall three-dimensional structures was
accomplished using the package Procheck [12].
1.
2.
3.
4.
5.
6.
7.
8.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold
recognition method for genomic sequences. J. Mol. Biol. 287, 797-815.
Holm L. et al. (1993) Protein structure comparison by alignment of
distance matrices. J. Mol. Biol. 233, 123-128.
C. Notredame et al. (2000) T-Coffee: A novel method for multiple
sequence alignments. J. Mol. Biol. 302, 205-217.
Higgins D. et al. (1994) Improving the sensitivity of progressive multiple
sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Rost B. et al. (1993) Prediction of protein secondary structure at better than
70% accuracy. J. Mol. Biol. 232, 584-599.
Šali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234, 779-815.
A-153
9.
Fiser A. et al. (2000) Modeling of loops in protein structures. Protein Sci.
9, 1753-1773.
10. Cornell W.D. et al. (1995) A second generation force field for the
simulation of proteins, nucleic acids and organic molecules. J. Am. Chem.
Soc. 117, 5179-5197.
11. Plimpton S.J. et al. (1996) A New Parallel Method for MolecularDynamics Simulation of Macromolecular Systems. J. Comput. Chem. 17,
326-337.
12. Laskowski R.A. et al. (1993) PROCHECK: a program to check the
stereochemical quality of protein structures. J. Appl. Cryst. 26, 283-291.
Solovyev-Softberry (P0270) - 242 predictions: 177 3D, 65 SS
SoftPM: Softberry tools for protein structure modelling
V. Solovyev, D. Affonnikov, A. Bachinsky, I. Titov Ivanisenko
and Y. Vorobjev
Softberry Inc., 116 Radio Circle, Suite 400
Mount Kisco, NY 10549, USA
victor@softberry.com
A suit of new programs SoftPM: Software for protein modeling and prediction
of 3-D protein structure has been developed recently by Softberry research
team (www.softberry.com). It includes: Ffold, Getatoms, Hmod3Dmm,
Hmod3Dmd, Cover3D, Abini3D programs and upgraded SSPALm program
developed earlier. The programs were designed to cover all aspects of
analyzing new sequences and elucidation of their 3D structures.
SSPALm (Secondary Structure Prediction by Alignments) is a new version of
secondary structure prediction program SSAPL that is based on nearestneighbor approach. This m-version is using local alignments [1] with nonredundant database of ~ 4000 proteins of known tertiary structure and multiple
alignment of target sequence as input. Knowledge database of secondary
structure and environment parameters for known 3D structures were computed
by Softberry SSENVID program. Ffold (Find fold) is a fold recognition
program that identifies a ranked list of structurally closest proteins with known
3D structure by aligning target sequence with a database of these proteins.
Alignment score is a combination of environment potential, secondary structure
(predicted by SSPALm) and amino acid sequence similarity. Getatoms is a
program for modeling atomic coordinates of a protein with unknown 3D
structure. It uses main chain coordinates from 3D structure of similar protein,
which sequence is aligned with a sequence of query protein. Restoration of
loops in alignment will be added later. The program has an option to provide
coordinates of H-atoms. Getatoms computes 3D coordinates of a query protein
and estimates quality of produced 3D structure using several scores. Initially,
Getatoms selects most typical conformations of side chains, and then the
conformations are optimized using soft sterical potential. Using Monte Carlo
generation of initial coordinates and set of rules, Getatoms can restore loop
structures and adjust gaps boundaries. Hmod3Dmm (Homology MODeling of
3D with Molecular Mechanics) finds the geometry with a minimum energy of
a protein structure derived by Getatoms on the base of similar protein. It uses
AMBER-like force field and conjugate-gradient method of energy
optimization. The program can be useful to remove large forces on atoms
before applying molecular dynamic optimization programs. The current version
of Hmod3DMM is taking into energy computation a model of water
environment. Hmod3Dmd (Homology MODeling of 3D with Molecular
Dynamics) does final refinement of the protein structure via the MD simulation
of the protein model structure in an implicit solvent with a simulated annealing
protocol. The AMBER force field [2] was used to calculate internal protein
energy, i.e. covalent bond/angle deformation torsion and improper torsion
energies, the van der Waals and electrostatic non-bonded interactions. The
water solvent has been modeled implicitly via the solvation energy density
model of Lazaridis and Karplus [3]. The final protein models have been ranked
according to total free energy in the implicit solvent, which has been calculated
with averaging taken for a series of snapshots. Cover3D uses Ffold results to
generate coverage of target sequence by similar protein fragments with known
3D structure. It outputs several variants of such coverage to be used in Abini3D
to compute a putative 3D model of target sequence. Abini3D finds optimal
conformation of a set of 3D-fragments representing target sequence. It uses
simplified model of amino acid residues and contact potentials derived from
statistics on known tertiary structures.
A-154
Any CASP5 target sequence having similarity with known 3D structure (found
by Ffold program) has been modeled by Getatoms program. Next we have used
Hmod3Dmm and Hmod3Dmd (in many cases) to generate submitted structure.
Other target sequences (without found significant long similarity) were
analyzed by Cover3D program. After that we applied Getatoms and
Hmod3Dmm (in most cases) programs to generate submitted 3D coordinates.
1.
2.
3.
4.
Salamov A.A., Solovyev V.V. (1997) Protein secondary structure
prediction using local alignments. J. Mol. Biol. 268, 1, 31-36.
Cornell W. et al. (1995) A second generation force field for the simulation
of proteins, nucleic acids and organic molecules. J. Am. Chem. Soc. 117,
5179-5197.
Lazaridis T., Karplus M. (1999) Effective energy function for proteins in
solution. Proteins 35, 133-152.
Vorobjev Y.N., Almagro J.C., Hermans J. (1988) Discrimination between
native and intentionally misfolded conformations of proteins. Proteins 32,
399-413
SPAM1 (P0400) - 87 predictions: 87 3D
Protein Structure Prediction Using Multiple Methods in the
Advanced Selectivity/Sensitivity Benchmarking Protocol
S. Veretnik1, W. Li 1, P.E. Bourne1,2 and I.N. Shindyalov1
1
– San Diego Supercomputer Center, UCSD, MC0537, 9500 Gilman Dr, La
Jolla, CA 92093-0537, 2 – Department of Pharmacology UCSD, MC0537,
9500 Gilman Dr, La Jolla, CA 92093-0537
shindyal@sdsc.edu
We present a new approach for structure prediction SPAM1 – “Systematic
Protein Annotation and Modeling 1”. SPAM1 is a protein annotation pipeline
comprising a number of structure prediction methods and rigorous
benchmarking technology [1]. SPAM1 was developed to be used in genome
annotation. Predictions obtained from SPAM1 were further refined by human
experts based on biological systematic, functional properties and combining
non-overlapping predictions (Fig 1).
The following methods were incorporated into the pipeline: WU-BLAST [2],
NCBI-BLAST [3], PSIBLAST [3], 123D [4], TMHMM [5], COILS [6],
SIGNALP [7].
Recognition of sequence similarity between uncharacterized protein (target)
and characterized protein (template) is the main principle of structure
prediction. The abovementioned methods characterize reliability of similarity
between the target and the template by estimating the probability of by-chance
occurrence of such similarity. The principal problem is that statistical models
embedded in these methods are not adequate to the actual statistics describing
relationship between targets and templates involved in annotation. Thus, socalled “recommended” thresholds, e.g. BLAST e-values are often used.
Sometimes thresholds are obtained from benchmark performed on some
“golden standard” which is unrelated to real targets and templates which are
consequently used. In [1] it was demonstrated how misleading reliability
estimates can be when they are relying on abovementioned concepts. The new
approach was introduced [1] for benchmarking of methods used in annotation,
providing substantially more accurate reliability estimates based on the
principle of prediction consistency evaluated for a given library of templates
and targets (typically complete set of proteins from a given genome).
1.
2.
3.
4.
5.
6.
Alexandrov N.N. et. al. Reliability of sequence comparison assessed by
functional, structural, and expression benchmarks, in preparation.
Gish W, and States D. T (1993). Identification of protein coding regions by
database similarity search. Nature Genetics 3, 266-72.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Alexandrov N.N, and Luethy R. (1998) Alignment algorithm for homology
modeling and threading. Protein Sci. 7(2), 254-258.
Sonnhammer E.L. et al. A hidden Markov model for predicting
transmembrane helices in protein sequences. (1998) Proc Int Conf Intell
Syst Mol Biol., 175-182.
Lupas A. et al. (1991) Predicting Coiled Coils from Protein Sequences.
Science 252, 1162-1164.
A-155
7.
Nielsen H. et al. (1997) Identification of prokaryotic and eukaryotic signal
peptides and prediction of their cleavage sites. Protein Engineering 10, 16.
Figure 1. Predictions and their reliabilities
(A-0.999, B-0.99, C-0.9, D-0.5, E-0.1) by SPAM1 for CASP5 targets.
SRBI (P0331) - 109 predictions: 109 3D
Sternberg (P0105) - 71 predictions: 71 3D
Bayesian Fold Recognition
Fold Recognition Using 3D-PSSM and Human Intervention
and Its Application to Comparative Modeling
P. Cherukuri, G. McAllister, and J. Bienkowska
Serono Reproductive Biology Institute
One Technology Pl., Rockland, MA 02370
Jadwiga.Bienkowska@serono.com
L.A. Kelley and M.J.E. Sternberg
Structural Bioinformatics Group, Imperial College of Science,
Technology and Medicine, London, United Kingdom
m.sternberg@ic.ac.uk
Our method uses a set of structural Hidden Markov Models automatically
designed for protein domains present in SCOP. Models for all non-redundant
domains are built and grouped according to the fold classification. All models
representing a given fold constitute a fold model. Bayesian statistics is used for
solving the first problem of protein structure prediction: the recognition of the
correct fold for a given sequence. The probability of observing a given
structural model for a sequence is not associated with the lowest free energy.
According to Bayes, fold recognition is measured by an a posteriori probability
of a model given the query sequence. In our approach alternative models of
protein structure are regarded as generators of a protein sequence and for a
given model the a priori probability of generating a sequence is equal to a sum
over probabilities of all sequence-to-model alignments [1]. Fold recognition is
reported only if the top ranking fold has a posterior probability higher than 0.5.
The next 4 alternative folds are also reported if their probability is higher than
0.01. Once the fold model is identified for a sequence, the optimal alignment to
the sequence of the target functional domain is generated using the sequence
profile alignment software PIMA [2]. Sequence profiles for each functional
domain are generated automatically by selecting a set of diverse functional
homologs (profile defining set) and creating a profile using PIMA. We add the
query sequence to the profile defining set and generate the alignment. In case
this attempt at generating the alignment fails, we align just a pair of sequences
– the query sequence and the target sequence.
1.
2.
Bienkowska J, He H., Rogers R.G. Jr. and Yu L. (2002), Bayesian
Approach to Fold Recognition Protein Structure Prediction:
Bioinformatics Approach edt. I.Tsigielny. IUL
Das S. and Smith T.F. (2000) Identifying Nature’s Protein Lego Set in
Analysis of Amino Acid Sequences. Adv. Prot. Chem. 54 159-183.
A-156
Our program 3D-PSSM [1-2] was developed for fold recognition and our main
objective at CASP5 was to test its performance both as a fully automated server
and in combination with human intervention. In addition, the performance of
fold recognition methods in generating alignments at comparable levels of
accuracy with those from comparative modeling was observed at CASP4 and
consequently we used 3D-PSSM to generate models for comparative modeling
at CASP5. The methodologies we have used for each type of target overlap
significantly. Regardless of target type we initially run the target sequence
through our fold recognition system 3D-PSSM [1-2] which uses a weekly
updated representative fold library containing approximately 8000 structures or
domains at the time of writing.
The 3D-PSSM hits are examined for highly confident or high sequence identity
hits. If a structural template has been confidently found by the PSI-Blast [3]
component of 3D-PSSM it is treated as a comparative modeling target.
Otherwise it is treated as a fold recognition target. In addition, matches over
subsequences of the target are examined and the target is manually chopped
into separate domains if required. Each domain is subsequently treated and
modeled separately, with the exception of cases where a highly similar protein
with the same sequence of domains is present in the structure database. Often
such determination of domain boundaries is guided by either PSI-Blast multiple
alignments or highly confident hits from 3D-PSSM over a region of a target
sequence.
Comparative modeling targets: The specific template used for modeling is
chosen by manually evaluating the length and quality of the alignment and the
percentage sequence identity between template and target. Targets are also
scanned against the non-redundant sequence database to detect closer
homologues not present in the 3D-PSSM fold library. The alignment is adjusted
as described later in this abstract, and insertions and deletions are treated using
the Loopy [4] algorithm. Sidechains are automatically modeled using SCWRL
[5]. Generally, large insertions are not modeled as the accuracy of modeled
loops decays rapidly with length.
In the relatively few cases in which we were presented with more than one high
scoring, or otherwise viable template from the same fold or superfamily, we
would analyze and interactively adjust the alignment to each template, looking
for the template that fulfills as many of the above criteria as possible.
1.
2.
Fold recognition: In cases where no confident 3D-PSSM hit has been detected,
the top 20 highest scoring 3D-PSSM hits are manually examined, homologues
of the target are submitted to the server, and the results from other automatic
servers participating in the CAFASP experiment are investigated.
3.
Importantly, we would make judgments about lower scoring matches from the
3D-PSSM top 20 based on the SAWTED [6] text score and on the keywords
shared between query and template. Although this feature is automatically
included in the server results, below threshold SAWTED scores were often
taken into account when choosing a fold or superfamily. Also, literature related
to the target sequence and potential structural templates was retrieved and
examined.
Once a fold had been chosen on the basis of the above analysis, the automatic
alignments produced by the 3D-PSSM server were often manually adjusted to
meet a variety of criteria:
1 Maintenance of a hydrophobic core based on three-dimensional models
generated from the alignments.
2 Equivalencing of known core residues (as precalculated by using a mutual
contact algorithm) with hydrophobic residues in the target.
3 Preservation of the continuity of secondary structure elements.
4 Maintenance of the spatial arrangements of residues suspected to form the
active site.
5 Alignment of known motifs (such as the Walker A and B motifs in P-loops,
or known conserved residue types in specific folds as determined by literature
searches).
6 Maintenance of the spatial distances between cysteine residues believed to
form disulfide bridges.
A-157
4.
5.
6.
Kelley L.A. et al. (2000) Enhanced genome annotation using structural
profiles in the program 3D-PSSM. J. Mol. Biol. 299(2), 501-522
Bates P.A. et al (2001) Enhancement of protein modeling by human
intervention in applying the automatic programs 3D-Jigsaw and 3D-PSSM.
Proteins Suppl. 5, 39-46.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Xiang Z. and Honig B. (2002) Evaluating configurational free energies: the
colony energy concept and its application to the problem of protein loop
prediction. Proc. Natl. Acad. Sci. USA. 99(11), 7432-7437.
Bower M. et al. (1997) Sidechain prediction from a backbone-dependent
rotamer library: A new tool for homology modeling. J. Mol. Biol. 267,
1268-1282.
MacCallum R.M. et al. (2000) SAWTED: Structure assignment with text
description - enhanced detection of remote homologues with automated
SWISS-PROT annotation comparisons. Bioinformatics 16(2), 125-129.
SUNDARAMS (P0381) – 0 predictions
total non-bonded interaction energy and a 3D model are presented to aid in the
choice of a desired conformation. Among the initial structures generated as
described here, information derived from predictions of motifs [5] and domains
[6-7] as well as other intuitive deductions based on protein function, etc., will
be used to shortlist probable candidate structures.
Protein 3D Structure from Primary Sequence Data
(A Constrained Simulated Annealing Approach)
K.Sundaram1 and Shyam Sundaram2
1
– S.A. Engineering College, Chennai 600077, India,
2
– Bioinformatics Developer, Virginia, USA
ksundaram@vsnl.com
Our objective is to derive the full three-dimensional structure of a protein
including H atoms from sequence information. The distinctive feature of our
approach is the belief that the native 3D fold of a protein is largely determined
by the geometrical constraints imposed by chain connectivity, compactness,
and the avoidance of steric clashes, etc. This view is supported by a couple of
recent researches [1-2]. Our technique to nudge the protein to a compact 3D
fold in the refinement process would be to constrain it within a rigid enclosure
of appropriate size to be determined by trial and error. Water and/or CCl4
molecules will also be floated around to fill the gaps and simulate the polar and
non-polar environments in the native cellular environment.
As an alternative to large-scale simulation using supercomputer power or
distributed computing over the Internet, we have tried to first derive a sampling
of tangible structures that can be used as initial states in short sequence Monte
Carlo simulations. In this process we make full use of the renowned services
for the prediction of secondary structures, motifs, domains, etc. Typically, we
have first used the residue sequence to query PROF [3-4] to generate a
companion conformation sequence file consisting of letters E, H, or L to
represent the local conformation at each  carbon atom. As the two linear
sequences pass through (in tandem and residue by residue) specially designed
interactive modeling software, the ,angle pair chosen at each residue
position is presented on a Ramachandran plot template. Similarly, slider bars
appear for choosing the  angles appropriate for the residue in question. The
initial ,pair is chosen randomly within the specified region (H, E, or L).
The chosen angles can be varied manually by clicking on the desired point or
by pressing one of several buttons that choose the best energy based position,
relaxing the side-chain alone, the residue alone, or the whole molecule. The
A-158
The shortlist structures will be subjected to Monte Carlo simulation within a
rigid spherical enclosure. In the simulation module sophisticated potential
functions including valence electronic polarizabilities are used, but, some of the
energy components can be selectively turned off for computational efficiency,
in the initial stages, or, if found to be not influencing the protein fold
significantly.
We have been using this method to derive structures for CASP5 targets T0180,
T0188, and T0190 and TMW target 8.
1.
2.
3.
4.
5.
6.
7.
Banavar J.R. et al. (2002) Geometry and physics of proteins. Proteins 47,
315-322.
Hartl F.U. (1996) Molecular chaperones in cellular protein folding. Nature
381, 571-580.
Rost B. and Sander C. (1993) Prediction of protein secondary structure at
better than 70% accuracy. J Mol Biol. 232, 584-599.
Rost B. et al. (1996) Topology prediction for helical transmembrane
proteins at 86% accuracy. Prot Science, 7, 1704-1718
Hofmann K. et al. (1999) The PROSITE database, its status in 1999.
Nucleic Acids Res. 27, 215-219.
Corpet F. et al. (1998). The ProDom database of protein domain families.
Nucleic Acids Res 26, 323-326.
Corpet F. et al. (2000) ProDom and ProDom-CG: tools for protein domain
analysis and whole genome comparisons. Nucleic Acids Res. 28, 267-269.
SUPERFAMILY (P0065) - 925 predictions: 925 3D
it is fast and can be used on large datasets; it can deal with multi-domain
proteins, including domains that are non-contiguous in their gene sequence, and
is robust with respect to gene prediction errors; it has a reliable confidence
score. In addition, the website provides a platform for browsing sequence
alignments and comparing the distributions of protein families and their domain
combinations across genomes. The server has been used in the annotation of
several genomes and a number of other research projects of biological
importance.
Structural Domain Predictions For All Genomes
J. Gough
Structural Biology, School of Medicine,
Stanford University, CA94305-5126, U.S.A.
gough@stanford.edu
The SUPERFAMILY library of hidden Markov models (HMMs) [1-2] was
designed to provide structural assignments to protein sequences on a genomic
scale, and the extensive web site [http://supfam.org] aims to facilitate an
analysis of the results. The library has been applied to all completely
sequenced genomes and other large data sets, such as SwissProt + TrEMBL and
nrdb90.
The database and model library are available for download from the web site.
A development version of the next generation server was also submitted to
CASP as SUPERFAMILY profile-profile.
1.
2.
An obvious by-product of the work is three-dimensional structure prediction,
which also offers the possibility of independent assessment and comparison to
other methods. The predictions submitted to CASP were obtained by a method
identical to that used for genome assignments, and are therefore indicative of
their quality. However, since only significant hits are used in genome
assignments, and since the assignments require an overall error rate of less than
1%, the CASP server was “forced” to produce several models for each target,
regardless of the confidence score.
The full details of how the model library was created are described elsewhere
[1], but it should be noted here that the procedure uses the latest public release
of the SAM package of profile HMM programs [3]. The library is generated
with expert intervention and offers advantages over the default SAM T99
procedure included in the release, but does not contain improvements due to the
more advanced but as yet unreleased versions of the software. The
SUPERFAMILY database is also based on the SCOP classification of proteins
[4] and so structures added to the PDB since the latest release (1.59) are not
included.
Because this server was designed for genome analysis, the underlying method
satisfies three criteria that go beyond what is required of other CASP methods:
A-159
3.
4.
Gough J. et al. (2001) Assignment of genome sequences using a library of
hidden Markov models that represent all proteins of known structure. J.
Mol. Biol. 313 (4), 903-919.
Gough J. and Chothia C. (2002) SUPERFAMILY:HMMs representing all
proteins of known structure. SCOP sequence searches, alignments, and
genome assignments. Nucl. Acids Res. 30 (1), 268-272.
Karplus K. et al. (1998) Hidden Markov models for detecting remote
protein homologies. Bioinformatics. 14 (10), 846-856.
Murzin A. et al. (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol. 247
(4), 536-540.
SUPFAM_PP (P0086) - 728 predictions: 728 3D
but the extra information contained in the profile is enough to achieve a marked
improvement in performance.
The Next Generation of Structural Genome Analysis
J. Gough1 and M. Madera2
1
- Structural Biology, School of Medicine, Stanford University, CA94305-5126,
U.S.A., 2 - Structural Studies, MRC Laboratory of Molecular Biology,
Cambridge, CB22QH, UK
gough@stanford.edu
This automatic server is a development version of the next-generation
replacement for SUPERFAMILY [1-2] (which was also submitted to CASP,
see the corresponding abstract). Although it is currently based on the same
library of profile hidden Markov models (HMMs), this server employs a
number of new techniques that make it fundamentally different from the
production server. However, since it is intended for genome analysis, the
modifications have been restricted to those techniques that can realistically be
applied on a genomic scale.
We are planning to incorporate two key enhancements into the next generation
of our server. Firstly, we intend to improve remote homology recognition by
comparing profiles to profiles (rather than profiles to sequences) in a manner
pioneered by [3], and secondly, we aim to provide additional biological
information via a family-level classification of the query sequence.
Development versions of both enhancements have been implemented in this
server.
Regarding the first enhancement, in the current production version of
SUPERFAMILY a query sequence is searched directly against a library of
profile HMMs that represent all proteins of known structure. This model
library is pre-generated using expert intervention, which is feasible because of
the limited number of structural superfamilies. By contrast, this CASP server
first uses an automated method to build a profile HMM from the query
sequence, and then compares the profile HMM (rather than the query
sequence) to all models in the library. Because it is built using an automated
method, the query profile may not be as good as models in the curated library,
A-160
As far as family-level classification is concerned, SUPERFAMILY was
originally designed to provide a structural classification of protein domains at
the SCOP [4] superfamily level. However, most large superfamilies are diverse
and contain a number of distinct families with different biological functions.
From the point of view of sequence annotations it would therefore be
exceedingly useful if we could determine the family of the query sequence, in
addition to its superfamily. Our approach to this problem was motivated by the
following question: “Given that a domain belongs to a particular superfamily,
which is the most similar structure?” The SUPERFAMILY profile HMMs are
bad at answering this question because they aim to represent the entire
superfamily. On the other hand, pairwise sequence comparison methods are
often unable to detect distant similarities, including ones within more divergent
families. The method used here was therefore a hybrid of the two. It is based
on a direct comparison of a pair of sequences, but uses a profile HMM as a
guide. The profile provides an alignment of the two sequences, but the
alignment is scored using a conventional substitution matrix and gap penalties.
However, to capture more of the information contained in the profile, the scores
at each position are weighted by the degree of conservation shown by the
profile.
The novel tools used by this server are currently under development, but will be
made public and applied to genome analysis in the near future.
1.
2.
3.
4.
Gough J. et al. (2001) Assignment of genome sequences using a library of
hidden Markov models that represent all proteins of known structure. J.
Mol. Biol. 313 (4), 903-919.
Gough J. and Chothia C. (2002) SUPERFAMILY: HMMs representing all
proteins of known structure. SCOP sequence searches, alignments, and
genome assignments. Nucl. Acids Res. 30 (1), 268-272.
Rychlewski L. et al. (1999) Comparison of sequence profiles. Strategies
for structural prediction using sequence information. Protein Sci. 9, 232241
Murzin A. et al. (1995) SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol. 247
(4), 536-540.
Szed-Asmat (P0515) - 6 predictions: 6 3D
Taylor (P0423) - 113 predictions: 113 3D
CASP Modelling Methods
Comparative Modeling of Six Randomly Selected Target
Proteins by MODELLER 6
W.R. Taylor and K.X. Lin
NIMR, London NW7 1AA, UK
wtaylor@nimr.mrc.ac.uk
A. Salim and S. Zarina
Department of Biochemistry, University of Karachi, Karachi 75270, Pakistan
szed4@yahoo.com
Three dimensional structure predictions were made for 6 target proteins by
comparative modeling technique using the protein structure-modeling program
1
MODELLER 6v2 (windows version) which constructs the protein models by
satisfaction of spatial restraints. The targets for 3D modeling included (i)
Hypothetical Cytosolic Protein yckF (T0167) (ii) Hypothetical protein HP0162
(T0177), (iii) Spermidine synthase homolog (T0179), (iv) TM1478 (T0182),
(v) TM1816 (T0188) and (vi) Transthyretin-related protein (T0190). These
targets have sequence similarities between 30-42% with their respective
templates identified by 2BLAST search against protein databank. As many as
10-20 models for each target protein were constructed by the MODELLER and
the best model was selected which satisfies most of the stereochemical criteria
after evaluating them with the program, 3PROCHECK. The Ramachandran
plots of these models showed no residues in the disallowed region except in
case of protein TM1478 (T0182) where a single residue was located in the
disallowed region of the plot. The corresponding template of this target protein
also has the identical residue in the disallowed region. Structural superpositions
of the Calpha atoms between the models and the corresponding experimental
structures showed root mean square deviations (RMSD) between 0.2 to 1.5Å
showing that sequence similarity >30% produces models of greater accuracy.
1.
2.
3.
Sali A., Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol 234, 779-815.
Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. (1990) Basic
local alignment search tool. J Mol Biol 215, 403-410.
Laskowski R.A., McAurthur M.W., Moss D.S., Thornton J.M. (1993)
PROCHECK: A program to check the stereochemical quality of protein
structures. J Appl Cryst 26, 283-291.
A-161
The target sequence was fed through an automatic databank search protocol
over the non-redundant protein sequence databank. This involved a pre-scan
through locally installed psiBLAST (4 cycles, p=0.001).
The sequence
fragments hit by psiBLAST were then extracted and realigned by MULTAL
(automatically removing homologues closer that 90%) before being fed to the
databank search program QUEST.
QUEST is able to 'push-the envelope' further than psiBLAST as it has a built-in
multiple alignment stage that reigns-in iterations that hit too many sequences.
It often found useful sequences where psiBLAST had none (or just trivial
variants). The final iteration reduced the selected sequences (if there were
enough) to a non-homologous set in which no two pairs had more than 60%
identity.
This produced the target family for which the secondary structure was then
predicted by psiPRED. Rather than run each member of the family against the
databases again, a local BLAST directory was setup containing just the family
members and this was used by a locally-installed psiPRED. The resulting
predictions, along with the sequences coloured by their amino acids were
written as postscript and displayed using "gv" which allows easy magnification
and browsing over large alignments.
The CAFASP results were taken as a prefilter for the selection of proteins to
model. The CAFASP summary results for each target were downloaded and
all sequence fragments extracted.
These were fed to MULTAL which
produced multiple (sub)alignments for each protein. As above, each alignment
was fed to QUEST and scanned against the NR sequence DB. QUEST has the
property that it anticipates a consensus domain size so any small fragments or
over long members become regularised. This results in a set of families, at
least one member of which has a known structure and has been seen by the
CAFASP methods. These families, called the PDBseq. families were treated as
was the target family to produce secondary structure predictions. The known
secondary structure was also calculated and stored.
The target family was aligned using MST (Multiple sequence Threading) with
each PDBseq. family in turn. MST uses a combination of 3D packing with
predicted/observed secondary structure matching and profile/profile alignment
to produce an alignment of the two families. This was then visualised in gv
(with predicted secondary structures plus motif colouring) alongside the model
of the target sequence on the known structure (also coloured by predicted
secondary structure).
When all went well (ie there were homologues for both sides of the alignment)
the full process was completely automatic and modelled structures could be
'flicked' through one-after-the-other. If there was a reasonably clear match with
a number of known structures, then the structure with the top CAFASP jury
score was taken (as this should make comparison with other models more
direct).
If there was no homologues for the target, the QUEST search was rerun byhand allowing the envelope to be pushed a little further before the junk flooded
in. Occasionally some members were deleted from the search because of an
improbable connection based on functional key-words. If there was no
homologues for the template then the structural neighbours of the template
were aligned using SAP. (and fed back into MST). If the MST alignment was
only good in parts then marker points were inserted to hold the good part while
the sequences were realigned. If there were still no homologues, then a novel
ab-initio method was run. This uses a secondary structure lattice but also
wanders off-lattice. It generates thousands of models that are then filtered by
shape fold and packing. (Taylor, unpublished).
1.
2.
3.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Taylor W.R. (1998) Dynamic databank searching with templates and
multiple alignment. J. Mol. Biol. 280, 375-406.
Taylor W.R. (1997) Multiple sequence threading: an analysis of alignment
quality and stability. J. Mol. Biol. 269 , 902-943.
A-162
TCS-Bioinformatics (P0404) - 40 predictions: 40 SS
Highly Trained Neural Network Predictors
Dilip Antony Joseph1, M. Vidyasagar2 and Sharmila Mande2
1
Indian Institute of Technology, Madras, 2Tata Consultancy Services
dilip@peacock.iitm.ernet.in
Artificial Neural Network based predictors have proven to be very effective in
protein secondary structure prediction. The neural network is able to predict
the state of the central residue of a window of amino acids. These predictors
have obtained prediction accuracies of over 75%. The number of neurons in a
predictor is often more than 20000. To effectively train a network with such a
large number of changeable parameters requires a very large training set. The
predictor developed here attempts to use a large amount of the available
secondary structure data in training.
A standard feed forward neural network was used in the predictor. The input to
the neural network consisted of a window of 15 amino acids [3]. Each amino
acid in the window was represented by the 20 numbers obtained from the
PSIBLAST [1, 2] profile of the sequence. The network classifies the central
residue of the window as either in the Alpha Helix, Beta Strand or Coil state. A
second network was used to ‘clean up’ the secondary structure sequence
produced by the first network. The training set consisted of over 6500 protein
sequences from the PDB SELECT [4, 5] database, which gave 1436264 inputoutput pairs for training the network. The effectiveness of the above training
set is diminished by the similarity (up to 90 %) between some of the sequences
in the set. However, this training set did give better prediction accuracies than
the networks trained on a smaller number of sequences. Nine neural networks
trained independently on the same training set (randomly shuffled) were
constituted into a jury. This also led to a small increase in prediction accuracy.
It has been observed that the larger and more varied the training set; the better
is the prediction accuracy. As more and more structural data becomes known
in the future, it is important to include those sequences in the training set.
However, retraining the whole network is a time consuming exercise. The
effectiveness of retraining the network with only the new sequences is studied.
A jury consisting of highly trained predictors along with specialized alpha and
beta strand predictors can effectively increase the prediction accuracies.
1.
2.
3.
4.
5.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Rost B. and Sander C. (1993) Improved prediction of protein secondary
structure by use of sequence profiles and neural networks. Proc. Natl.
Acad. Sci. USA, 90, 7558-7562
Hobohm U., Scharf M., Schneider R., Sander C. (1992) Selection of a
representative set of structures from the Brookhaven Protein Data Bank.
Protein Science 1, 409-417
Hobohm U. and Sander C. (1994) Enlarged representative set of protein
structures, Protein Science 3, 522
THW-FR (P0377) - 241 predictions: 241 3D
Net Charge Center for Protein Fold Recognition
I. Y. Torshin1,2,3
1
– Chair of Physical Chemistry, Chem. Dept., Moscow State University,
2
– Comp. Sci. Dept., GSU, Atlanta, GA, 3 – Biol. Dept., GSU
biotiy@suez.cs.gsu.edu
Net charge center (NCC) is a novel physico-chemical model developed for
analysis of the relationship between protein structure and function. Various
“quantitative” models (often built, perhaps, as attempts to imitate the amazingly
accurate mathematical apparatus of modern physics [1]) may allow to fit
experimental data to calculations using a number of arbitrary empirical
parameters, but do not appear to have clear physical significance. The NCC
model does not include any empiric parameters whatsoever and is calculated
A-163
solely on the base of three-dimensional structure of the protein using an
extremely simple formula [unpublished]. The physical significance of the
model is that NCC describes spatial distribution of the charged/ionized residues
in a molecule of a protein at given physico-chemical conditions. The biological
significance of the NCC model is that NCC is very often located in the
biologically important sites of protein molecules of different biochemistry
[unpublished data]. The last fact allows application of NCC model to the
problem of fold recognition by generating a library of template structures.
Sequences around positive and negative charge centers (PNCC) are likely to be
folding cores or folding intermediates [2-4]. The NCC model, as noted above,
is likely to determine the location of functional regions and sequences. These
two properties of a native protein were used to compile a template library for
fold recognition. The library was based on domain database GTDD (Gestalt
Theory Domain Database [unpublished]). Gestalt theory [5], though being
proposed over 50 years ago, is still one of the best theories that describe
principles of perception. The theory explains a large number of experimental
facts pertaining to perception (human perception, in particular) without overcomplicated or purely statistical explanations characteristic of modern
behaviorism. The gestalt principles can be computerized for many purposes and
in this study they were used to generate a database of domains using nonredundant PDB. In short, NCC + PNCC allows selection of potential templates
from GTDD library of templates (that is, to perform fold recognition).
Non-redundant GTDD for fold recognition was built using non-redundant set of
PDB sequences selected at BLAST E-value of 10e-7 [6]. Program FoldRec-CC
generated a set of models then the models were refined by energy minimization
using AMMP [7]. Several modifications of the method were used: 1. The full
FoldRec-CC method; 2. NCC only; 3. PNCC only and 4. full FoldRec-CC but
using a multiple sequence alignment (BLAST [8]) for the target. Preliminary
data (as judged by proteins with definitely known templates such as T0137,
T0144 etc) suggest that using NCC model alone (as in modification 2 above)
often leads to correct fold recognition, though the full FoldRec-CC method
(modification 1 above) is more reliable.
This method was also applied during the TMW-1 (Ten Most Wanted)
experiment and, although structures of TMW proteins are not known, there are
circumstantial evidences that at least correct fold was predicted for a number of
the TMW targets. Although the method is fully automated, visual inspection of
the final 10-20 models for a target as well as using secondary structure
predictions [9] are likely to improve the results of fold recognition.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Kőhler W. (1947), Gestalt Psychology: An introduction to New Concepts
in Modern Psychology, Liveright Publishing, NY, p.42.
Torshin I. et al (2002) Charge centers and formation of the protein folding
core. Proteins, 43:353-364.
Torshin I. et al (2002) Identification of protein folding cores and nuclei
using charge center model of protein structure. TheScientificWorld
Journal, 2:84-86.
Torshin I. et al. (2002) Protein folding: search for basic physical models,
submitted.
Kőhler W. (1947), Gestalt Psychology, 136-279.
Madej T. et al (1995) Threading a database of protein cores. Protein Struct.
Funct. Genet. 23, 356-369.
Harrison R. et al (1995). Analysis of six protein structures predicted by
comparative modeling techniques. Proteins 23, S463-471.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25,
3389-3402.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195-202.
TOME (P0450) - 260 predictions: 260 3D
Evaluation of a New Protein Structure Modelling Pipeline
TOME
G. Labesse, V. Catherinot, J.-L. Pons, L. Martin and D. Douguet
1
- Centre de Biochimie Structurale (CNRS), Montpellier,France
labesse@cbs.cnrs.fr
automatically to six distinct fold recognition or protein structure prediction
servers:
3D-PSSM[2],
PDB-BLAST
(http://bioinformatics.burnhaminst.org/pdb_blast/), FUGUE[3], GenTHREADER[4], SAM-T99[5] and JPRED2[6] with default parameters but for PDB-BLAST (10 iterations). A
consensus ranking was estimated. Fold recognition searches were resumed for
multi-domain targets to re-assessed template ranking and scoring or to highlight
structural similarities. In a few cases several runs were necessary for proper
domain delimitation.
As most “threaders” use the “frozen approximation”, each structural alignment
was further evaluated using T.I.T.O [7]. Sequence identity, threading scores
and ranking as well as the percentage of target sequence and template structure
overlap was taken into account for validation of the proposed fold. Presence of
hits with homologous structures (large family and/or easy target) or with related
folds (small family and/or difficult targets) was also checked. Template-related
structures were searched using FSSP [8].
For easy targets, models were built directly using MODELLER 6.0 [9]. Models
were evaluated using PROSA [10] and Verify3D [11]. Indel modelling was
carefully analyzed by visual inspection using XmMol [12]. Structural
alignments were manually refined locally. Side chain modelling in the common
core (as defined by target-template alignment) was also performed using
SCWRL 2.8 [13] and similarly evaluated but not further refined.
For difficult targets, additional evaluations of the fold compatibility were
performed through extensive modelling using a dozen of distinct templates as
well as careful structural alignment refinement through T.I.T.O. Additional
restraints to be used in MODELLER 6.0 were deduced from template
secondary structure assignment using P-SEA [14] and mixed with predictions,
for exemple, from J-Pred.
At least three models were deposited for each targets (but a few ones likely in
the NEW_FOLD class): one built by MODELLER (the most complete),
another built using SCWRL (no indel building) and a third one derived from
the template by T.I.TO. (common core only, backbone of aligned residues +
side chains of conserved residues). Additional models were sometimes added
when distinct models were obtained (sub-optimal alignment of loop building
with equivalent validation score).
1.
The fold compatibility between the targets and PDB entries was analyzed using
our recently developped meta-server [1]. Query sequences are sent
A-164
Douguet D. et al. (2001) Easier threading through web-based comparisons
and cross-validations. BioInformatics 17, 752-753.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
Kelley L.A. et al. (2000) Enhanced Genome Annotation using Structural
Profiles in the Program 3D-PSSM. J. Mol. Biol. 299, 501-522
Shi J. et al. (2001) FUGUE: sequence-structure homology recognition
using environment-specific substitution tables and structure-dependent gap
penalties. J. Mol. Biol. 310, 243-257.
McGuffin L.J. et al. (2000) The PSIPRED protein structure prediction
server. Bioinformatics 16, 404-405
Karplus K. et al. (1998) Hidden Markov models for detecting remote
protein homologies. Bioinformatics 14, 846-856.
Cuff J.A. et al. (1998) Jpred: A Consensus Secondary Structure Prediction
Server. Bioinformatics 14, 892-893
Labesse G. et al. (1998) A Tool for Incremental Threading Optimization
(T.I.T.O.) to help alignment and modelling of remote homologs.
Bioinformatics 14, 206-211.
Holm L. et al. (1994) The FSSP database of structurally aligned protein
fold families. Nucleic Acids Res. 22(17):3600-3609.
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234, 779-815.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins. Proteins 17, 355-362.
Eisenberg D. et al. (1997) VERIFY3D: assessment of protein models with
three-dimensional profiles. Methods Enzymol 277, 396-404
Tuffery P. (1995) XmMol: an X11 and motif program for macromolecular
visualization and modeling. J. Mol. Graph. 13, 67-72.
Dunbrack R.L. et al. (1993) Backbone-dependent rotamer library for
proteins. Application to side-chain prediction. J Mol Biol. 230, 543-574.
Labesse G. et al. (1997) P-SEA: a new efficient assignment of secondary
structure from Ca trace of proteins. CABIOS 13, 291-295.
UCLA-DOE (P0301) - 59 predictions: 59 3D
The Directional Atomic Solvation Energy: An Atom-Based
Empirical Potential for the Assignment of Protein Sequences
to Known Folds
Parag Mallick1, Charlotte Deane1,
Robert Weiss1 and David Eisenberg1
1
– UCLA-DOE Center for Genomics and Proteomics
& Howard Hughes Medical Institute
parag@mbi.ucla.edu
The Directional Atomic Solvation Energy (DASEY) is an atom-based
description of the environment of an amino acid position within a known 3D
protein structure. DASEY has been developed to align and score a probe
amino acid sequence to a library of template protein structures for fold
assignment. The DASEY is computed by summing the atomic solvation
parameters [1] of atoms falling within a tetrahedral sector, or petal, extending
16Å along each of the four bond axes of each alpha-carbon atom of the protein.
The DASEY is able to discriminate between pairs of structurally equivalent
positions and random pairs, in proteins structure sharing a fold, but belonging
to different superfamilies, unlike some previous descriptors of protein
environments, such as area buried. Furthermore, DASEY values have
characteristic patterns of residue replacement, an essential feature of a
successful fold assignment method. Benchmarking fold-assignment with
DASEY scoring achieves coverage of 56% of sequences with 90% accuracy,
why probe sequences are matched to protein structural templates belonging to
the same fold, but to a different superfamily; an improvement of greater than
200% over a previous method of sequence derived properties.
For each CASP target, models were built by first identifying a candidate fold
family, refining the fold prediction to a family prediction, generating multiple
alignments to potential templates and by then building and refining molecular
models. PHD [2], PSI-PRED [3] and JPRED [4] were used to predict the
secondary structure all CASP targets. Next each prediction was used with to
identify likely fold candidates by DASEY, the Method of Sequence Derived
A-165
Properties [5], PSI-BLAST [6] and by a simple composition based filter. Next,
DASEY was used to identify which superfamilies within a fold class were most
similar to the target and which templates were most likely. Alignments to the
selected templates were generated by DASEY and then visually inspected
within SeaView [6]. Whenever possible, profile-profile alignments and PFAMA [7] alignments were also generated for comparison. MODELLER [8] was
used to build and refine 10 models of each alignment. Models were evaluated
by MODELLER Energy, ERRAT [9], Verify3D [10] and by SwissPDBViewer
Threading Potential. In some cases, alternate alignments and templates were
used if no suitable candidate models were initially generated.
1.
Eisenberg D., and McLachlan A.D. (1986). Solvation energy in protein
folding and binding. Nature 319, 199-203.
2. Rost B. (1996). PHD: predicting one-dimensional protein structure by
profile-based neural networks. Methods Enzymol 266, 525-539.
3. McGuffin L.J., Bryson K., and Jones D.T. (2000). The PSIPRED protein
structure prediction server. Bioinformatics 16, 404-405.
4. Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M., and Barton G.J. (1998).
JPred: a consensus secondary structure prediction server. Bioinformatics
14, 892-893.
5. Fischer D., and Eisenberg D. (1996). Protein fold recognition using
sequence-derived predictions. Protein Science 5, 947-955.
6. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W.,
and Lipman D.J. (1997). Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Research
25, 3389-3402.
7. Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R.,
Griffiths-Jones S., Howe K.L., Marshall M., and Sonnhammer E.L. (2002).
The Pfam protein families database. Nucleic Acids Res 30, 276-280.
8. Sali A., and Blundell T.L. (1993). Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol 234, 779-815.
9. Colovos C., and Yeates T.O. (1993). Verification of protein structures:
patterns of nonbonded atomic interactions. Protein Sci 2, 1511-1519.
10. Eisenberg D., Luthy R., and Bowie J.U. (1997). VERIFY3D: assessment
of protein models with three-dimensional profiles. Methods Enzymol 277,
396-404.
A-166
VENCLOVAS (P0425) - 20 predictions: 20 3D
Comparative Modeling Based on a Combination of Sequence
Comparison and Assessment of Structural Fitness
C. Venclovas
Lawrence Livermore National Laboratory, Livermore, California
venclovas@llnl.gov
Comparative modeling approach used to build models for CASP5 is in many
respects similar to the one used at CASP4 and described in more detail in the
special Proteins issue [1].
Template selection
PDB templates were identified by running either BLAST or PSI-BLAST [2]
searches against the non-redundant NCBI sequence database. Usually more
than one template was used to build models.
Sequence-structure alignments
Sequence-structure alignments were generated and assessed both at the
sequence level as well as at the 3D level. For high homology targets, where
structural template(s) were among closely related sequences, the alignment was
derived directly from PSI-BLAST results with some manual adjustments
around insertions/deletions. In the case of distant homology targets, results of
an initial PSI-BLAST search were used in an intermediate sequence search
procedure (PSI-BLAST-ISS) [1]. In this procedure, a set of sequences that
bridge sequence space between target sequence and template(s) were used as
additional probes for searching the non-redundant sequence database. Targettemplate sequence alignments were extracted from resulting search data and
their consistency was analyzed. For regions where one dominant alignment
variant was produced, the alignment was considered reliable, while the regions
where the consistency of target-template alignment was lacking were deemed
unreliable. If unreliable regions were present in the alignment, multiple models
were built to explore alternative alignment variants. In some of these cases to
increase selectivity, models for less-distant homologs of the target were also
built. Alignments for some regions that were expected to be structurally
conserved, but could not be aligned by PSI-BLAST, were derived manually
using PSIPRED [3] secondary structure predictions as a guide.
The final target-template alignment was selected by taking into account
structural fitness of each of the alternative alignments. Structural fitness
(quality) was assessed by several methods including visual inspection, ProsaII
[4] profiles and Z-scores and reports from the WHATIF [5] quality evaluation
module (Whatcheck).
Loop modeling
Most of the loops for distant homology targets were assigned automatically
during model-building. In other cases loops were modeled after suitable
fragments from PDB structures. Preference was given to evolutionary related
protein structures. In their absence the conformation which was dominant in the
results of fragment searches was assigned to the targeted region.
Generating 3D structures
Models were generated with MODELLER [6]. In most cases side chains were
rebuilt using SCWRL [7]. Any strong side chain clashes after this step were
removed manually. No energy minimization procedures were used.
1.
2.
3.
4.
5.
6.
Venclovas C. (2001) Comparative modeling of CASP4 target proteins:
Combining results of sequence search with three-dimensional structure
assessment. Proteins, Suppl. 5, 47-54.
Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.
and Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res, 25,
3389-3402.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J Mol Biol, 292, 195-202.
Sippl M.J. (1993) Recognition of errors in three-dimensional structures of
proteins. Proteins, 17, 355-362.
Vriend G. (1990) WHAT IF: a molecular modeling and drug design
program. J Mol Graph, 8, 52-56.
Sali A. and Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J Mol Biol, 234, 779-815.
A-167
7.
Bower M.J., Cohen F.E. and Dunbrack R.L., Jr. (1997) Prediction of
protein side-chain rotamers from a backbone-dependent rotamer library: a
new homology modeling tool. J Mol Biol, 267, 1268-1282.
Wolynes-Schulten (P0294) - 42 predictions: 42 3D
Ab initio Structure Prediction with Associative Memory
Hamiltonians
Corey Hardin1, Michael Prentiss2, Michael P. Eastwood2, Zan
Luthey-Schulten1, and Peter Wolynes2
1
- University of Illinois - Urbana Champaign
2
- University of California - San Diego
pwolynes@chem.ucsd.edu
We initially selected sequences for ab initio prediction if there was no obvious
scaffold found by the automated comparative modeling servers for
threading/comparative modeling. For the selected sequences, we used an
Associative Memory Hamiltonian (AMH), with parameters chosen by
optimization. The optimization aims to produce an energy landscape of the
AMH that is as close to an ideal funnel as our reduced model allows without
using homology information. The AMH has been optimized separately for allalpha, and alpha-beta proteins [1-3]. We averaged the AMH potential over
multiple sequence homologues when available. Information from secondary
structure prediction was included via a potential biasing the phi-psi angles to
the appropriate region of a Ramachandran plot. A sequence dependent
hydrogen bond term was used to improve beta sheet formation. Molecular
dynamics simulations were used to select low energy candidate structures.
Subsequently, a smaller subset of structures was selected for submission using
several filters. These include agreement with secondary structure predictions
and available biochemical information as well as the energy from a second
energy function designed for threading that includes a pairwise contact,
structural profile, and backbone hydrogen bonding terms [4].
1. Eastwood M. P. et al. (2002) Statistical Mechanical Refinement of Protein
2.
3.
4.
Structure Prediction Schemes: Cumulant Expansion Approach. J. Chem.
Phys. 117 (9),4602-4615 .
Hardin C. et al. (2000) Associative Memory Hamiltonians for Structure
Prediction Without Homology: Alpha-Helical Proteins. Proc. Nat. Acad.
Sci. U.S.A.97(26), 14235-14240.
Hardin C. et al. (2002) Associative Memory Hamiltonians for Structure
Prediction Without Homology: Alpha-Beta Proteins. Proc. Nat. Acad. Sci.
U.S.A. (accepted).
Koretke, K K. et al. Self-consistently Optimized Statistical Mechanical
Energy Functions For Sequence Structure Alignment. Protein Science 5,
1043- 1059.
Yan-Research (P0069) - 60 predictions: 60 SS
tends to recur as a unit. The indivisibility of a peptide unit is therefore
analogous to the indivisibility of a prime number.
Secondary structures of proteins can be expressed in terms of their -helices
and  sheets. -helices are expressed in units of heptapeptides, while -sheet
strands are expressed in units of tripeptides [3].
Identification of -helices (heptapeptides) and -sheet strands (tripeptides) in a
protein is accomplished by scanning the amino acid sequence for prime z-sums.
A heptapeptide (-helix) scan is performed by scanning the z-sums of every
seven residues for prime numbers. The scan is carried out for all seven reading
frames. Similarly, a tripeptide (-strand) scan is performed by calculating the
z-sums of consecutive, non-overlapping tripeptides; the scan is repeated for all
three reading frames. Heptapeptides and tripeptides with prime z-sums (called
“prime heptapeptides” and “prime tripeptides”) tend to present as recurring
motifs.
This method is based upon intrinsic properties of the DNA sequence, which
prescribes the amino acid sequence—also intrinsic—through the genetic code.
Sequence analyses carried out using this method were supplemented with
statistical data compiled for individual amino acids to detail extrinsic
properties. The higher-order structure of a protein is therefore dependent on
both intrinsic (sequence) and extrinsic (environmental) factors.
The Prime Number Code: A Method of Protein Structure
Prediction Derived from the Genetic Code
Johnson F. Yan and Benjamin C. Yan
Yan Research
jfy@serv.net
A novel algorithm has been derived from number-theory principles that can
predict the secondary structures of proteins, given the primary structure (amino
acid sequence). Numerically, the amino acids are represented by 20 “znumbers” (mostly prime numbers) with which sequence patterns may be
calculated [1]. Rather than being arbitrarily assigned, an amino acid’s znumber is derived from the unique algebraic properties of the three
deoxyribonucleotides in its codons [2].
Whereas amino acids have assigned z-numbers, peptides and structural motifs
in proteins are characterized by z-sums. The z-sum of a peptide or a secondary
structural motif is the sum of the z-numbers of its constituent amino acids. If
the z-sum is a prime number, then the corresponding structural motif or peptide
A-168
The algorithm described above was applied to the amino acid sequences of the
Full Sequence Design protein of Dahiyat and Mayo [4], and of Arabidopsis
cellulose synthase [5].
1. Yan J.F. 1999, U.S. Patent No. 5,856,928.
2. Yan J.F. et al. (1991) Prime numbers and the amino acid code: analogy in
coding properties. J. Theor. Biol. 151, 333-351.
3. Yan B.C. and Yan J.F. (1999) Size and folding in globular proteins.
Internatl. J. Biol. Macromol. 24, 65- 67.
4. Dahiyat B.I. and Mayo S.L. (1997) De novo protein design: fully
automated sequence selection. Science 278, 82-87.
5. Arioli T. et al. (1998) Molecular analysis of cellulose biosynthesis in
Arabidopsis. Science 279, 717-720.
Yasara-Pushchino (P0202) - 192 predictions: 192 3D
During CASP5, human expert alignments were additionally fed into the
pipeline. These were contributed by Dmitry N.Ivankov from Alexei
Finkelstein's group, and in one third of the cases, they gave higher scores than
Eliza's suggestions. Their method is described separately under group-name
"Puschino".
WHAT IF YASARA Folds a Protein?
E. Krieger1, D.N. Ivankov2, A. Finkelstein2 and G. Vriend1
1
- CMBI, Center for Molecular and Biomolecular Informatics,
University of Nijmegen, the Netherlands.
2
- Institute of Protein Research, Puschino, Russia
Elmar.Krieger@cmbi.kun.nl
Summary
Two thirds of the 48 submitted models were built fully automatically
combining three newly developed approaches to structure prediction:
1) Self-parameterizing force fields. To achieve maximum accuracy, force
field parameters were not derived from small molecules and then applied to
proteins. Instead, the force fields were allowed to parameterize themselves
while energy-minimizing high resolution X-ray structures[1]. Model refinement
was done with the new YAMBER II force field, which uses the same energy
function as AMBER[2], but different parameters optimized in crystal space.
2) Eliza, an "artificial modeling intelligence". Previous CASPs have shown that
humans still do better than automated servers, we therefore tried to teach Eliza
the human way of thinking when correcting an alignment.
3) Distributed computing with the Models@Home screensaver[3] (available
from www.yasara.com/models) allowed to run hundreds of parallel molecular
dynamics simulations of models built from various possible alignments. The
resulting trajectories were clustered[4,5] to avoid false positives and to pick out
truly improved models.
Abstract
The YASARA/WHAT IF modeling pipeline integrates functions provided by
both programs and a variety of fold recognition and secondary structure
prediction servers into a fully automatic method for protein structure prediction.
A-169
Initial alignments were collected from the following fold recognition servers as
summarized on the CAFASP website: SAM-T02[6], FORTE1
(www.cbrc.jp/htbin/forte1-cgi/forte1_form.pl),
ORFeus,
BasicC
(grdb.bioinfo.pl), 3D-PSSM[7], GenTHREADER[8], mGenTHREADER,
FUGUE2.1[9], INBGU (www.cs.bgu.ac.il/~bioinbgu), and MPALIGN
(sunflower.kuicr.kyoto-u.ac.jp/mpalign). A consensus secondary structure
prediction was obtained from PHD[10], JPRED[11], PSIPRED[12], and SAMT02[6].
All alignments were analyzed and potentially modified by Eliza. Then the loops
and structured N- and C-termini were added with YASARA's loop modeler,
side-chains were completed by WHAT IF[13]. The models obtained for the
various alignments were scored and the best one picked for further
optimization.
In the refinement stage, the conformational space available to the model was
sampled with Bert de Groot's CONCOORD program[14], then 100 parallel allatom molecular dynamics simulations in aqueous solution (Particle Mesh
Ewald electrostatics) were run with YASARA to 'home in' further on the target.
This was done with the new YAMBER II force field (Yet Another Model
Building and Energy Refinement force field), a second-generation selfparameterizing force field optimized in crystal space. Models were ranked
based on a variety of WHAT IF quality checks[15] and clustered to avoid
isolated false positives with artificially high scores[4,5].
Due to the huge computational requirements, the entire procedure was run in
parallel using the Models@Home distributed computing system. Thanks to
everyone working here at the CMBI in Nijmegen, Netherlands, for choosing the
Models@Home screensaver. More information about the programs used is
available at www.yasara.com and www.cmbi.nl/whatif.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Krieger E., Koraimann G. & Vriend G. (2002). Increasing the precision of
comparative models with YASARA NOVA - a self-parameterizing force
field. Proteins 47, 393-402.
Wang J., Cieplak P. & Kollman P. A. (2000). How well does a restrained
electrostatic potential (RESP) model perform in calculating conformational
energies of organic and biological molecules? J. Comp. Chem. 21, 10491074.
Krieger E. & Vriend G. (2002). Models@Home: distributed computing in
bioinformatics using a screensaver based approach. Bioinformatics 18,
315-318.
Shortle D., Simons K. T. & Baker D. (1998). Clustering of low-energy
conformations near the native structures of small proteins. Proc. Natl.
Acad. Sci. USA 95, 11162Xiang Z. & Honig B. (2002). Evaluating conformational free energies: the
colony energy and its application to the problem of loop prediction. Proc.
Natl. Acad. Sci. USA 99, 7432-7437.
Karplus K., Barrett C., Cline M., Diekhans M., Grate L. & Hughey R.
(1999). Predicting protein structure using only sequence information.
Proteins 37(S3), 121-125.
Kelley L.A., MacCallum R.M. & Sternberg M.J.E. (2000). Enhanced
genome annotation using structural profiles in the program 3D-PSSM. J.
Mol. Biol. 299, 499-520.
Jones D.T. (1999). GenTHREADER: an efficient and reliable protein fold
recognition method for genomic sequences. J. Mol. Biol. 287, 797-815.
Shi J., Blundell T.L. & Mizuguchi K. (2001). FUGUE: sequence-structure
homology recognition using environment-specific substitution tables and
structure-dependent gap penalties. J. Mol. Biol. 310, 243-257.
Rost B. (1996). PHD: predicting one-dimensional protein structure by
profile-based neural networks. Methods Enzymol. 266, 525-539.
Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M. & Barton G.J. (1998).
JPred: a consensus secondary structure prediction server. Bioinformatics
14, 892-893.
McGuffin L.J., Bryson K. & Jones D.T. (2000). The PSIPRED protein
structure prediction server. Bioinformatics 16, 404-405.
Chinea G., Padron G., Hooft R.W.W., Sander C. & Vriend G. (1995). The
use of position specific rotamers in model building by homology. Proteins
23, 415-421.
A-170
14. de Groot B. L., van Aalten D. M., Scheek R. M., Amadei A., Vriend G. &
Berendsen H.J. (1997). Prediction of protein conformational freedom from
distance constraints. Proteins 29, 240-251.
15. Hooft R.W.W., Vriend G., Sander C. & Abola E. E. (1996). Errors in
protein structures. Nature 381, 272-272.
Yoon (P0262) - 35 predictions: 35 3D
Simulation of the Protein Folding Structures
Jin Kak Lee, Taesung Moon and Chang No Yoon
Korea Institute of Science and Technology
ljk@kist.re.kr
To simulate the folding structures of a protein, we used a simple off-lattice
model with the unified-residue point, which represents the alpha carbon of each
amino acid in the protein model. This model has two angle variables, one for
the angle between two consecutive virtual bonds, residues i to j and j to k, the
other for the rotational angle of the virtual bonds consisting of residues i, j, k
and l. In order to generate the protein conformations the Monte Carlo method
was used with the starting point of random coil conformations. During this
procedure the range of the i-j-k angle was limited between 60 to 150 degrees.
Among the trajectory data obtained from the navigation through the potential
surface, about half of them were accepted and stored. The knowledge-based
potential was used to obtain the potential energy surface. It was derived from
the known protein structures. The total number of the accepted conformations
was about 10E3 and the total steps for one run were about 10E8. Finally, all the
conformations were clustered using the energy and cRMS between the alpha
carbon traces. Then the obtained representative conformations were minimized
with the potential energy.
Yoon (P0262) - 35 predictions: 35 3D
Zhou-HX (P0056) - 134 predictions: 69 3D, 65 SS
Prediction of Protein Structure Using Homology Modeling
Technique
Improving Fold Recognition and Query-Template Alignment
by Combining PSI-Blast and Sequence-Structure Threading
Taesung Moon, Jin Kak Lee and Chang No Yoon
H. Chen1, 2 and H.-X. Zhou1
1
Korea Institute of Science and Technology
iris@kist.re.kr
The homology modeling technique predicts the three-dimensional structure of a
given protein sequence (target) based on an alignment of the protein to one or
more homologous proteins (templates) of known structure. This technique
become more and more important because the structural information from x-ray
crystallographic or NMR results is increased. In this study we carried out
conventional homology modeling approaches. The target protein was aligned
with the templates which selected using FASTA search against PDB (Protein
Data Bank) database. Then, the coordinates amino acids of the template of
aligned regions were transferred to target. The coordinates of the regions which
not aligned were given using small fragment amino acid library. If the matched
amino acid fragment was not found, the conformation search was carried out.
The energy minimization and molecular dynamics simulation were performed
to refine the model structure.
– Florida State University, 2 – Drexel University
hxzhou@csit.fsu.edu
Both PSI-Blast [1] and sequence-structure threading have strengths and
weaknesses in structure prediction. Our COBLATH [2] program was designed
to exploit the complementarity of the two methodologies through judicious
combination. The powerful sequence-alignment algorithm of PSI-Blast can
generate a sequence profile that is highly informative, even when it cannot by
itself identify a structural template. In particular, this sequence profile can be
incorporated into sequence-structure threading to improve the success rate of
fold recognition and the accuracy of query-template alignment.
The COBLATH program has modules for predicting the secondary structure
and the solvent accessibility. Predictions for both structural features are based
on a neural network, with sequence profile as the input. The predicted
secondary structure and solvent accessibility in turn are used as part of the
fitness function for sequence-structure threading. In addition, for a given
query, the predicted secondary structure is used to screen a pool of proteins
(consisting of ~3000 chains in the FSSP library) to obtain a reduced set of 200
potential templates. Threading between the query and the potential templates is
carried out in both directions.
Of the 67 CASP5 targets, 40 had templates identified by PSI-Blast. The
threading module was used to identify templates for the other 27 targets. In
some cases (e.g., T0149), the PSI-Blast template leaves a significant portion of
the query sequence uncovered. This portion of the sequence is singled out for
further investigation by the threading module.
Regardless how the template was identified, a round of threading specifically
designed for query-template alignment was used to obtain optimal alignment.
A-171
This was based on the recognition that the objectives in fold recognition and in
query-template alignment are different. In the former the objective is to
discriminate the true template against decoys. This does not necessarily match
the objective of finding the best alignment between the query and a particular
template. In particular, the gap penalty was reduced in the threading for querytemplate alignment.
With the fully automated COBLATH program, the identified templates had
high confidence levels for all but five of the 67 CASP5 targets.
1.
2.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Shan Y., Wang G., and Zhou H.-X. (2001) Fold recognition and accurate
query-template alignment by a combination of PSI-BLAST and threading.
Proteins 42 (1), 23-37.
A-172
CASP5 Poster Abstracts
A-173
A-174
Accelrys (P0210) - 24 predictions: 24 3D
1. Kitson et al. (2002) Functional annotation of proteomic sequences based
Structural Prediction and Functional Annotation of Proteomic
Sequences using GeneAtlasTM
Dana Haley-Vicente, Velin Spassov, Tina Yeh, Ken Butenhof,
Christoph Schneider, Lisa Yan
on consensus of sequence and structural analysis. Briefings in
Bioinformatics 3(1), 1-13.
ALAX (P0234) - 39 predictions: 39 3D
Accelrys Inc., 9685 Scranton Road, San Diego, CA 92121, USA
dhv@accelrys.com
Sequence Alignment Method for Automatic Homology
Modeling With Low Sequence Identity
We have used GeneAtlas™ to provide functional annotation of proteomic
sequence data including structural prediction. GeneAtlas is an automated,
high-throughput pipeline for the prediction of protein structure and function
using sequence similarity detection, homology modeling, and fold recognition
methods. Using template searching, GeneAtlas searches for relationships
between query sequences and known protein structures, motifs, and folds.
Subsequent inferences and assignment of the target protein’s function is based
on its homology to the experimentally derived template protein and the models
generated as part of the pipeline.
Using CASP5 targets as query sequences, we demonstrate that GeneAtlas
detects additional relationships, via its high-throughput modeling component,
in comparison with the sequence searching method PSI-BLAST only.
Furthermore, functionally related proteins with sequence identity below the
twilight zone can be recognized correctly.
In addition, some targets were selected to test two new methods that we have
developed, ChiRotor and Looper, for side-chain and loop prediction. ChiRotor
is a fast algorithm that predicts the conformation of all or part of amino-acid
side chains with an average RMSD of about 1Å for the core residues. The loopmodeling program, Looper, produces a number of energy minimized loop
backbone conformations ranked according to force-field energy terms. Both
algorithms are a combination of a discrete search in dihedral angle space and
CHARMm energy minimization.
A-175
Atsushi Hijikata1, Tosiyuki Noguti 2 and Mitiko Go1
1
Division of Biological Science, Graduate School of Science, Nagoya
University, 2 Saga Medical School
alax@bio.nagoya-u.ac.jp
The quality of homology modeling depends on the accuracy of sequence
alignment between the target and the template proteins. When the sequence
identities are low (below 30 %), the Indel (insertion/deletion) regions either
target or template sequences increase and thus the accurate alignment of Indel
is more critical, than in the pairs with high identity, for homology modeling. It
was reported that the amino acid residues with high solvent accessibility appear
more frequently in Indel regions than those with low solvent accessibility [1].
We had re-analyzed this feature using recently accumulated data and
conformed the previous result. To obtain correct assignment of Indel in
sequence alignment, we developed a new sequence alignment method using
smaller gap penalty for surface residues and the Position Specific Scoring
Matrix (PSSM) of PSI-BLAST program. We termed the method ALAX
(ALignment based on solvent ACCessibility). To evaluate the quality of
ALAX, we compared the alignment obtained by ALAX with that obtained by
PSI-BLAST [2] by taking super position of 3D structures of the target and
template as correct alignment. We show that ALAX is 11 % better than the
alignment obtained by PSI-BLAST. Furthermore, Indel regions of PSI-BLAST
alignments often exist in the interior of template 3D structures, whereas such
cases happen scarcely for ALAX. These results indicate that ALAX is useful
for full automatic homology modeling particularly when the sequence identity
between target and template proteins is low.
3.
4.
Zhu Z.Y. et al. (1992) A variable gap penalty function and feature weights
for protein 3-D structure comparisons. Protein Eng. 5(1), 43-51.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
BAKER (P0002) - 377 predictions: 377 3D
De Novo Structure Predictions Using Rosetta
P. Bradley1+, J. Meiler1+, K.M.S. Misura1+, W.R. Schief1+, J.
Schonbrun1+, W.J. Wedemeyer1+, O. Schueler-Furman1,
M. Kuhn1, P. Murphy1, C.E.M. Strauss2, and D. Baker1
1
Aligners (P0064) - 31 predictions: 31 3D
Fold Recognition Using Only Boilerplate Methods of Database
Search and Multiple Sequence Alignment
Arcady Mushegian
- University of Washington, 2 - Los Alamos National Laboratory,
+
- authors contributed equally
dabaker@u.washington.edu
See methods section
BAKER (P0002) - 377 predictions: 377 3D
Stowers Institute for Medical Research
arm@stowers-institute.org
Comparative Modeling Using Rosetta
See methods section
D. Chivian1+, C.A. Rohl1+, C.E.M. Strauss2, P. Murphy1, and
D. Baker1
1
arby-scai (P0183) - 68 predictions: 68 3D
- University of Washington, 2 - Los Alamos National Laboratory,
+
- authors contributed equally
dabaker@u.washington.edu
The Arby Automated Structure Prediction Server
See methods section
2
Niklas von Öhsen , Ingolf Sommer
1
1
– Max-Planck-Institute for Informatics, 2 – FraunhoferInstitute forScientific
Computing and Algorithms
sommer@mpi-sb.mpg.de
See methods abstract (Ingolf Sommer and Niklas von Öhsen)
A-176
Biogen (P0440) - 28 predictions: 28 3D
5.
Consensus Scoring Approach to Fold Recognition of CASP5
Targets
6.
7.
A. Lugovskoy, D. Gottlieb, H. van Vlijmen, and J. Singh
Structural Informatics Group, Biogen Inc.
Herman_van_Vlijmen@Biogen.com
To increase the strength of our predictions we used a consensus scoring
approach to fold recognition of CASP5 targets. For 30 fold recognition targets
we combined predictions of several algorithms (Discrete State Model (DSM),
pattern-embedded DSM, Genefold, Seqfold, and Loopp) [1-4] and selected the
template folds found by multiple methods or by any of the methods if the score
was significantly higher than for other folds. A set of uniform non-redundant
fold libraries based on SCOP v1.59 classification [5] was constructed to ensure
equal representation of all structural families. We defined fold recognition
targets as single domain molecules that showed no more than 25% of sequence
identity to molecules in the PDB and yielded hits in the algorithms. For
subsequent homology modeling we performed partial manual realignments of
the target and the template sequences to maximize the continuity of secondary
structure elements. Homology models were built using MODELLER [6], and
minimized using CHARMM [7] with harmonic constraints on the backbones.
We believe that a consensus scoring approach lowers the rate of false positive
hits and increases the confidence in fold recognition solutions.
1.
2.
3.
4.
Bienkowska J.R. et al. (2000) Protein fold recognition by total alignment
probability. Proteins 40 (3), 451-62.
Jaroszewski L et al. (1998) Fold predictions by a hierarchy of sequence,
threading and modeling methods. Prot. Sci. 7 (6), 1431-1440.
Olszewski K.A. et al. (1999) SeqFold - fully automated fold recognition
and modeling software -- validation and application. Theor. Chem. Acc. 11,
57-66.
Meller J. and Elber R. (2001) Linear programming optimization and a
double statistical filter for protein threading protocols. Proteins 45 (3),
241-261.
A-177
Murzin A. G. et al. (1995). SCOP: a structural classification of proteins
database for the investigation of sequences and structures. J. Mol. Biol.
247, 536-540.
Šali A and Blundell T.L. (1993) Comparative protein modelling by
satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815.
Brooks B.R. et al. (1983) CHARMM: A Program for Macromolecular
Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4,
187-217.
Bion (P0474) - 63 predictions: 63 SS
Secondary Structure Prediction with Shuffled Training by
SPAM
R. Shigeta and J.P. LeFlohic
Bion Bioinformatics Consulting
rtshigeta@yahoo.com
This instance of the Structure Prediction Application Metatool (SPAM) uses
two sequential neural networks. The first is a 15-75-3 sequence-to-structure
network which takes as input the actual residue and a PSIBLAST [2] position
specific sequence profile (PSSM). Similar to the JNET architecture [1], a
window of output from the first neural network is fed into a 15-55-3 structureto-structure network. SPAM also feeds a copy of the original residue and the
PSSM probability data to the second network.
A non-redundant set of 504 protein sequences and structures from the protein
data bank [3] set were used as the training set, with a random 114 set aside for a
non-trained test set. Proteins with out any identified secondary structure were
discarded. Upon loading, the sequences are broken into window length training
patterns and a shuffled such that the neural network is presented with each class
of secondary structure at each training step and a similar number of examples
of each structure.
Training proceeds in epochs until all the errors from the neural networks in the
application cease to change more than an epsilon value which must be assigned
by hand, between 1e-3 and 1e-5. A training epoch is defined as the
presentation of 10,000 patterns, and so the training cycles do not contain
exactly the same data.
Braun-Werner (P0024) - 65 predictions: 65 3D
Automated Generation of Property Based Motifs to Search for
Functional Neighbors and to Improve Sequence Alignments
The final prediction of beta, helix, or coil is selected by choosing the highest of
the three outputs for each residue. No weighting is applied to the outputs.
Venkatarajan Mathura, Catherine H Schein, Numan Oezguen,
Ovidiu Ivanciuc, Yuan Xu and Werner Braun
The confidence is calculated in the standard way as the difference between the
highest float output and the second highest one. CASP entries were then edited
by hand for improbable patterns in secondary structure.
Sealy Center for Structural Biology,
Department of Human Biological Chemistry and Genetics,
University of Texas Medical Branch, Galveston, TX 77555-1157
werner@newton.utmb.edu
1.
2.
3.
Cuff J. A and Barton G.J (1999) Application of enhanced multiple
sequence alignment profiles to improve protein secondary structure
prediction, Proteins 40:502-511.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H.,
Shindyalov I.N., Bourne P.E.: The Protein Data Bank. Nucleic Acids
Research, 28 pp. 235-242 (2000)
We have developed a novel automatic method, based on patterns of
conservation of physical-chemical properties (PCPs) of amino acids in aligned
protein sequences, to find distantly related proteins with low sequence identity.
Conservation of PCPs among sequences of protein families can be conveniently
defined in terms of five descriptors, E1 to E5, which represent a large number
(237) of different physical-chemical properties [1]. PCP motifs, i.e., contiguous
residues that are conserved in E1-E5, are automatically generated by our
MASIA Web server [2,3].
The MASIA tool was used to identify 12 motifs, areas of significant sequence
conservation, in an alignment of 42 apurinic/apyrimidinic endonucleases
(APE's) [4]. APE's are part of the base excision repair pathway to replace
damaged sites in DNA resulting from ionizing radiation or oxidation. The
sequence motifs contain all the residues previously shown to be essential for
APE1 function, but we also detected new motifs distinctive for APEs that are
not directly involved in cleavage, but establish protein-DNA interactions 3’ to
the abasic site. These additional bonds enhance both specific binding to
damaged DNA and the processivity of APE1.
Five of sequence motifs of the APE family are also structurally conserved in
DNase-1 and the IPP family. We call the structural segments corresponding to
the sequence motifs "molegos", molecular legos. Correcting the sequence
alignment to match the residues at the ends of two of the molegos that are
absolutely conserved in each of the three families greatly improved the local
A-178
structural alignment of APEs, DNase-1 and synaptojanin. The shared molegos
have a similar metal and DNA binding function in both APE and DNase-1 [4].
Large-scale data mining for APE motifs in the ASTRAL40 database was then
performed using a Bayesian scoring function to identify similar motifs in all
proteins of the database. All of the previously identified distantly related
members of the DNase-I superfamily scored highly. Other high scoring proteins
had no overall sequence or structural similarity to the APEs. However, all were
phosphatases and/or had a similar metal ion binding active site [3]. To test the
ability of our method to functionally annotate novel protein sequences, the
PCP-motif profiles of the APE family were then used to scan the Drosophila
genome. We anticipate that our sequence and structural decomposition of APE
related proteins from different genomes would help us to understand functional
and evolutionary aspects of this protein.
In CASP 5 we tested our method based on physical-chemical property motifs
for improving and ranking different alignments. For each target we prepared
multiple alignments of the target sequence with similar sequences from other
organisms as identified in BLAST/PSIBLAST.
Our MASIA program
generated motif profiles for each target, which were then used by our program
ALIGNSCORER to find high scoring templates and alignments from all fold
recognition servers that participated in CAFASP. For some of the targets we
combined several alignments with high scoring motifs from different fold
recognition servers and from different templates. The highest scoring sequences
were then modeled with the distance geometry based modeling suite MPACK.
1.
2.
3.
4.
Venkatarajan M.S. and Braun W. (2001) New quantitative descriptors of
amino acids based on multidimensional scaling of a large number of
physical-chemical properties. J. Mol. Modeling 7, 445-453.
Zhu H., Schein C.H and Braun W. (2000) MASIA: recognition of common
patterns and properties in multiple aligned protein sequences.
Bioinformatics 16:950-951.
Venkatarajan S.M., Schein C.H. and Braun W. (2002) Identifying Property
Based Sequence Motifs in Protein Families and Superfamilies: Application
to APE. Submitted.
Schein C.H., Oezguen N., Izumi T. and Braun W. (2002) Total sequence
decomposition distinguishes functional modules, “molegos” in
apurinic/apyrimidinic endonucleases, BMC-Bioinformatics (In press).
A-179
Burnham (P0516) - 68 predictions: 68 3D
Automated Modeling Pipeline
M.Grotthuss1, L.Knizewski1, P.Szczesny1, L.Jaroszewski2 and
A.Godzik1
1
– The Burnham Institute, 2 – JCSG Bioinformatics, UCSD
adam@burnham.org
The semi-automated modeling used to CASP5 predictions was based on the
FFAS03 server (see the FFAS03: automated profile-profile distant homology
recognition server applied to fold recognition. L.Jaroszewski and A.Godzik
abstract in the same volume).
Target sequences were submitted to the FFAS03 fold prediction server and the
top 20 predictions were analyzed, as explained below. In cases when no high
reliability FFAS03[1],[2] predictions were available, predictions and
alignments from other servers included in the Metaserver[3] were included as
well.
Top predictions from the FFAS server (or from all servers in the Metaserver)
were clustered based on the structural similarity of the predicted templates,
based on the SCOP classification. For each cluster, all PDB structures with the
same SCOP superfamily classification were also aligned with the target. All
alignments were then compared and analyzed for consistency. The most
consistent core of the alignment was used for modeling with a program
NEST[4] from Jackal package. Loops were added by a loop building procedure
LOOPY[5] from Jackal package. In most cases, several alternative alignments
were explored.
The entire group of several models based on alternative alignments with each
superfamily of potential templates was then evaluated using several energy
based methods. PSQS[6] server was used to calculate average energy of the
models. Models were analyzed to check if the function predicted from
homology, database annotation and genomic context analysis were compatible.
The final model was accepted based on a jury system, where various criteria
(energy, agreement with function prediction, completeness of the model etc.)
were weighted into a single scoring system.
1.
2.
3.
4.
5.
6.
Rychlewski L., Jaroszewski Ł., Li W. & Godzik A. (2000) Comparison of
sequence profiles. Strategies for structural predictions using sequence
information. Protein Science 9, 232-241
Jaroszewski Ł., Rychlewski L. & Godzik A. (2000).Improving the quality
of twilight-zone alignments. Protein Science 9, 1487-1496
Bujnicki J.M., Elofsson A., Fischer D., Rychlewski L. Structure prediction
meta server. Bioinformatics. 2001 Aug;17(8):750-1
Xiang Z.; Honig B. Homology model building with artificial evolution. (in
preparation).
Protein S., Xiang Z; Honig B. Evaluating configurational free energies: the
colony energy concept and its application to the problem of protein loop
prediction. Proc. Natl. Acad. Sci. USA 99:7432-7437.
equence
Quality
Score
–
model
evaluation
server,
http://www.jcsg.org/psqs/.
Bystroff (P0131) - 132 predictions: 45 3D, 40 SS, 45 RR, 2 DR
Contact Map Threading Using HMMSTR
Y. Shao and C. Bystroff
Department of Biology, Rensselear Polytechnic Institute
shaoy@rpi.edu, bystrc@rpi.edu
See methods section
Camacho-Carlos (P0098) - 46 predictions: 46 3D
Automated Consensus Method of Alignment for Confident
Comparative Modeling
Jahnavi C. Prasad, Sandor Vajda, Carlos J. Camacho
Bioinformatics Program, Boston University, Boston, MA 02215
ccamacho@bu.edu
We have developed an algorithm that consistently gives a high quality
alignment for comparative modeling, and identifies the regions of this
alignment that are reliable and structurally similar between the template and
target. In order to identify a consistent way to get an accurate alignment, ten
popular alignment methods were tested against a set of 79 pairs of homologous
proteins for alignment accuracy in the context of comparative modeling. The
top five performing methods were selected and a method for generating a
consensus by combining the alignments from these five methods has been
subsequently developed. By building on the strength of the consensus
alignment, we have identified a set of criteria that remove alignment zones
corresponding to structurally dissimilar regions and poor alignment reliability.
When applied over an independent set of 49 homologous protein structure
pairs, the average RMS deviations of the structures obtained with this
consensus based alignment is on the order of 2.5 A, while the length of the
alignment is about 80% of that found by standard structural superposition
methods. While the selected top five methods had 20-40% of the alignments
that would yield predicted structures with RMS deviations of 6A or more from
the native structure, there were such no cases at all from our method. In our
tests, the method performs consistently over a range of target-template
sequence identity spanning 5-30%. The algorithm is currently available as a
server at http://structure.bu.edu/cgi-bin/consensus.cgi
1.
A-180
Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. Confident Homology
Modeling Based On Consensus Alignment. Submitted for publication.
CaspIta (P0108) - 133 predictions: 70 3D, 63 SS
The one with the lowest energy selected as the most probable loop
conformation.
Fast Loop Modeling of Insertions and Deletions with
Integrated Side Chain Placement and Energy Minimization
1.
2.
S. C. E. Tosatto1, F. Fogolari2 , A. Cestaro1 and G. Valle1
1
2
- CRIBI Biotechnology Centre, Universita' di Padova
- Science and Technology Dept., Universita' di Verona
silvio@cribi.unipd.it
3.
An extended protocol of the fast divide and conquer loop modelling method of
Tosatto et al. [1] is used as a basis to construct insertions and deletions in
models built by homology. The initial target to template alignment is modified
to optimize the distances between the flanking regions of insertions and
deletions. For insertions, the flanking regions are preferrably distant and
exposed to the solvent. Deletions are shifted to minimize the distance between
the flanking regions. In both cases a number of residues on either flank are
selected to be modeled together with the insertion or deletion. Regular
secondary structure elements were generally chosen as boundaries for the loops
to be modelled.
Segments of the amino acid backbone chosen in this way are first generated and
ranked using the divide and conquer method [1]. This method uses a series of
artificial fragment databases generated from a Ramachandran plot distribution
of (phi,psi) torsion angles found in loops to generate different loop
conformations. Conformations showing strong steric clashes or amino acids in
disallowed regions of the Ramachandran map (e.g. Proline) are eliminated. The
remaining conformations are ranked according to a combination of geometric
fit to the flanking regions and a knowledge-based potential.
Each of the top twenty solutions is then subjected to the following steps. Side
chains are placed for the entire protein using SCWRL [2] to account for
changes in side chain rotamers induced by different loop conformations. The
CHARMM force field [3] without electrostatics is then used to minimize the
loop in the context of the protein. Hundred steps of steepest descent and five
hundred steps of conjugate gradient minimization are performed to relax the
initial model. The final models are ranked according to the CHARMM energy.
A-181
Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop
modeling. Protein Eng. 15(4), 279-286.
Bower M.J. et al. (1997). Prediction of protein side-chain rotamers from a
backbone-dependent rotamer library: A new homology modeling tool. J.
Mol. Biol. 267, 1268-1282.
MacKerell J.A.D. et al. (1998) All-hydrogen empirical potential for
molecular modeling and dynamics studies of proteins using the
CHARMM22 force field. J. Phys. Chem. B 102, 3586-3616.
CBC-FOLD (P0008) - 151 predictions: 151 3D
What’s so Good About Real Proteins?
Ajay K. Royyuru, Ruhong Zhou, Prasanna Athma, B. David
Silverman, Gelonia Dent and Rosalia Tungaraza
Computational Biology Center, IBM Thomas J. Watson Research Center,
Yorktown Heights, NY 10598, USA
ajayr@us.ibm.com
See methods section
CHIMERA (P0153) - 94 predictions: 94 3D
CIRB (P0397) - 263 predictions: 200 3D, 63 RR
Comparative Modeling Using CHIMERA Modeling System
Detecting High Quality Profile-Profile Alignments Using
Shannon Entropy.
Mayuko Takeda-Shitaka, Chieko Chiba, Hirokazu Tanaka,
Daisuke Takaya and Hideaki Umeyama
E. Capriotti1,3, P. Fariselli2, I. Rossi2,3 and R. Casadio2
,
Kitasato University
shitakam@pharm.kitasato-u.ac.jp
1
- Dept. of Physics/CIRB, University of Bologna, Italy, 2 - Dept. of
Biology/CIRB, University of Bologna, Italy, 3 - BioDec srl, Bologna, Italy
ivan@biocomp.unibo.it, casadio@alma.unibo.it
See methods section
We analyze the quality of the alignment generated by the profile-profile
alignment comparison algorithm known as BASIC [1] and compare the results
with those obtained with a structural alignment code. By this we compute that a
Shannon entropy value > 0.5 gives a sequence to sequence alignment of the
target/template couple comparable to that obtained with the structural
alignment performed with CE.
CHIMERAX (P0170) - 74 predictions: 74 3D
Full Length Protein Modeling Using CHIMERA eXtending
Procedure
In our fold recognition/threading code Tangram, the BASIC profile-profile
alignment is implemented as follows:
Genki Terashi, Ryota Yamatsu, Youji Kurihara, Mayuko TakedaShitaka, Mitsuo Iwadate and Hideaki Umeyama
Kitasato University
kuriharay@pharm.kitasato-u.ac.jp
See methods section
1)The composition profiles PA and PB for the target and template are generated
by multiple alignment of the sequences obtained from a three-iteration PSIBLAST [2] search on the Non-Redundant database (the inclusion threshold
is E=10-3).
2)the dot matrix (D) for the profile comparison of two protein sequences
D= PTA S PB , (with S=BLOSUM62 [3] substitution matrix) is computed
using linear algebra routines.
3)the D matrix is searched for high-scoring alignment by means local
Smith-Waterman dynamic programming algorithm [4].
The test set used for the evaluation is composed by 185 template/target couples
of PDB structures that share the same SCOP label, but have less than 30%
sequence identity
When the top-scoring alignments for each target protein in the test set is
considered, our BASIC implementation detects the full SCOP label for 125
A-182
couples (68%) and generates 114 (62%) alignments with a MaxSub [5] score
>=1.
Interestingly, it is found that nearly all of the high-quality alignments share a
common feature: the average Shannon entropy for the profile sections aligned
together is greater than 0.5 for both the template and the target.
If only the top scoring alignments for which this condition holds are
considered, a subset of 119 alignments is selected, and for 116 of them (97%)
the full SCOP label can be assigned to the target, while 108 (91%) gets a
nonzero MaxSub score, with an average score of 4.6 MaxSub on the subset
On the same 119 couples, the structural alignment program CE [6] computes a
nonzero MaxSub score for 116 of them, with an average of 5.7 points.
These results indicate that the Shannon entropy value can be used to
discriminate a subset of sequence profile-profile alignments of quality
comparable to that obtained by means of a structural alignment program.
1.
2.
3.
4.
5.
6.
Rychlewski J. et al. (1998) Fold and function predictions for Mycoplasma
genitalium proteins. Fold. Des. 3, 229-238
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Henikoff, S. et al. (1998). Superior performance in protein homology
detection with the BLOCKS database server. Nucleic Acids Res. 26, 309312.
Smith T. S. and Waterman M. S. (1981) Identification of common
molecular subsequences. J. Mol. Biol. 147, 145-147
Siew N. et al (2000) MaxSub: an automated measure for the assessment of
protein structure prediction quality. Bioinformatics 16(9) 776-785.
Shindyalov I. N. and Bourne P. E. (1998) Protein structure alignment by
incremental combinatorial extension (CE) of the optimal path Prot. Eng.
11(9) 739-74
DelCLAB (P0050) - 310 predictions: 310 3D
Protein Folding Prediction by Spectral Analysis Methods
Carlos A. Del Carpio-Muñoz, Hideto Shirasawa, and Kensuke
Hagino
Lab. for BioInformatics. Dept. of Ecological Eng. Toyohashi University of
Technology.
Tempaku. Toyohashi. 441-8580
carlos@translell.eco.tut.ac.jp
A novel technique for protein folding recognition is presented here, which
consists in applying a well known technique of front-end processing in robust
automatic speech recognition (ASR) to the problem of protein fold recognition.
This analysis-synthesis technique is based on the transformation of a signal into
its cepstrum which is a measure of the periodic wiggliness of a frequency
response plot. The cepstrum is calculated as the logarithm of the power
spectrum of a signal, which is the expression of the primary structure of a
protein using the physicochemical characteristics of the constituting amino
acids. This leads to a logarithmic periodgram for which the spectral envelope is
obtained as a smooth curve depicted by connecting the main local peaks of the
minute structure of the frequency spectrum[1-2].
The technique applied to the analysis of the profile of physicochemical features
of the amino acid sequence of the protein allows extraction of information in
the form of the spectral envelop which used to model the relationship between
the primary and tertiary structures of a protein.
Spectra are aligned using dynamic programming algorithms, and amino acid
sequences are expressed using a set of dominant physicochemical parameters
that are able to model a super-family in the SCOP data base[3].
The fold recognition technique is complemented with an analysis of the
secondary structure of the target, aligning it with the secondary structure
obtained as consensus of several secondary structure predicting methods.
A-183
The threading of the target sequence on the most plausible template is
performed by a genetic algorithm that has as penalty function the calculation of
the deviation derived by cutting and inserting fragments of structure into the
template structure.
The methodology presented here introduces several new concepts which can be
directly related to the function of the molecule. Thus in recognizing a particular
folding by identifying a characteristic spectrum representing an entire superfamily the target may belong to, the methodology works under the assumption
that the function and not only the structure (since the homology in sequence
with which one has worked here belongs approximately to the twilight zone)
has been encoded as an spectrum, from which not only structural homology
may be read but also protein function.
1.
2.
3.
Del Carpio C.A. and Yoshimori A. Fully Automated Protein Tertiary
Structure Prediction Using Fourier Transform Spectral Methods. Protein
Structure Prediction: Bioinformatic Approach. Edited by: Igor Tsigelny.
University of California. International University Line Inc. 173-197
(2002).
Del Carpio-Muñoz C.A. Folding Pattern Recognition in Proteins Using
Spectral Analysis Methods. Genome Informatics. In Press (2002).
SCOP( http://scop.mrc-lmb.cam.ac.uk/scop/ )
based on calculating the Jacobian such as “random tweak” [1] are slow and
sometimes do not converge. They also require a matrix inversion, which may
sometimes lead to singularities. One algorithm used in robotics that is flexible
in allowing constraints to be placed at each step, easy to program, conceptually
simple and elegant, and computationally fast is “cyclic coordinate descent”
(CCD). This algorithm was originally developed by Li-Chun Tommy Wang et.
al in 1991 [2] as an improved method for solving inverse kinematics problems
in robotics. It involves adjusting one degree of freedom at a time to move the
end effector toward the target object. This results in one equation in one
unknown for each degree of freedom, and hence is analytically very simple and
computationally fast. The method is free of singularities and it does not include
matrix inversion. It proceeds in iterative fashion along a chain of degrees of
freedom, modifying each joint so that the end effector gets as close as possible
to the desired position. The equations are able to provide both an optimum
setting for the variable and the first and second derivative of the change at the
current position so that small increments can be made in preference to large
changes, if desired. Given that the calculation of a parameter in one joint does
not depend on parameters of the other joints, one can also place constraints on
any degree of freedom, choosing to restrict their allowed values or place
probability distributions on them.
We show that CCD can close loops from nearly any starting configuration as
long as the chain is long enough to reach from N-terminus residue anchor to the
C-terminus residue anchor. In tests of over 250,000 random conformations,
CCD was able to close 99.95% of them. It fails only on a few very short,
extended loop conformations. In this case, a Monte Carlo step that moves the
end effector away from the anchor can be implemented. We have also explored
the use of Ramachandran probability maps as constraints in the CCD closure
procedure, and show that they do not effect the success rate of loop closure by
CCD.
Dunbrack (P0329) - 46 predictions: 46 3D
New Algorithms for Loop And Side-Chain Prediction
A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, Jr.
Fox Chase Cancer Center, Philadelphia PA USA
RL_Dunbrack@fccc.edu
We present two new algorithms that can be used in comparative modeling of
protein structures. The first is a new method to solve the “loop closure
problem”. In many methods of loop prediction, random loop conformations are
generated and must be adjusted to connect N and C-terminal anchors in the
secondary structures neighboring the loop to be predicted. Current algorithms
A-184
We also present a new algorithm for our side-chain prediction program
SCWRL [3]. SCWRL relies on the fact that sidechains in proteins are "sparsely
connected." If we represent the residues of a protein as the vertices in a graph
and an edge between two residues as a potential steric clash for some pair of
rotamers, then the number of edges per residue is much smaller than the
number of vertices (residues) in the graph. In practice, this graph is rarely
connected. It will consist of several clusters of interacting residues with no
connecting edges between the clusters. If the clusters are not large, they are
searched combinatorially in a branch-and-bound procedure. Frequently, the
clusters become too large to handle combinatorially. In this case, SCWRL
searches for a single residue (the "keystone") whose removal from the cluster
will break up the cluster into two graphs which are not interconnected. If such a
residue can be found, then each subgraph can be solved once for each rotamer
of the "keystone residue."
We propose a new algorithm for SCWRL that breaks up clusters of interacting
sidechains into the biconnected components of an undirected graph.
Biconnected graphs are those that can not be broken apart by removal of a
single vertex. In practice, residue clusters in the proteins are broken up entirely
into clusters of size less than 10. These clusters can be solved in a second or
less. By contrast in the "old" SCWRL, clusters of size 15-20 occur in some
proteins (and more frequently in homology modeling situations), and these can
take minutes and occasionally hours to solve.
1.
2.
3.
Shenkin P.S. et al. (1987) Predicting antibody hypervariable loop
conformation. I. Ensembles of random conformations for ringlike
structures. Biopolymers 26 (12), 2053-2085.
Wang L. T. and Chen C. C.. A combined optimization method for solving
the inverse kinematics problem of mechanical manipulators. IEEE Trans.
Robotics and Automation. 7 (4), 489-499.
Bower M.J., Cohen F.E., and Dunbrack R.L. Jr. Prediction of protein sidechain rotamers from a backbone-dependent rotamer library: a new
homology modeling tool. J. Mol. Biol. 267 (5), 1268-1282.
Dunbrack (P0329) - 46 predictions: 46 3D
Comparative Modeling of CASP5 Targets
G. Wang and R. L. Dunbrack, Jr.
Fox Chase Cancer Center, Philadelphia PA USA
RL_Dunbrack@fccc.edu
We have developed two new scoring mechanisms for profile-profile
alignments. The first is a Dirichlet mixture substitution matrix (DIMSUM)
analogous to ordinary amino acid substitution matrices, but in which the scores
represent probabilities of substituting profile columns for one another. The
columns in the profiles are represented as components of a Dirichlet mixture
developed from multiple sequence alignments and structural characteristics
(secondary structure and surface exposure). The DIMSUM matrices were
developed from structure alignments of homologous proteins using the CE
program [1] in a manner similar to the BLOSUM matrices [2]. The profileprofile alignments are performed with a standard local-alignment dynamic
programming algorithm.
The second scoring method is a combination of an amino acid substitution
matrix and a matrix that represents the probability of predicted secondary
structure in one profile (the CASP target) aligning to known secondary
structure in the PDB entry. This matrix (SSAAC) was also developed from
structure alignments by determining the substitution rates of predicted
secondary structure in one protein in each structural alignment versus known
secondary structure in the other protein. We combined both DIMSUM and
SSAAC with a structure-derived amino acid substitution matrix (SDM) [3],
applied to the two profile columns, such that the score is the sum over all i,j of
piqjSij where pi and pj are the probabilities of amino acid types i and j in the two
columns and Sij is the element from the substitution matrix. We use a gap
penalty scheme that is dependent on the evolutionary distance of the two
profiles. The scoring schemes were optimized at 50% SDM/50%DIMSUM for
the DIMSUM method and 65% SDM/35% SSAAC for the SSAAC method.
We show that both methods compare very well with other profile-profile
alignment schemes published by other groups in terms of alignment accuracy
A-185
GeneSilico.PL-servers-only (P0242) - 68 predictions: 66 3D, 2 SS
Bujnicki-Janusz (P0020) - 215 predictions: 67 3D, 58 SS, 49 RR, 41 DR
GeneSilico (P0517) - 195 predictions: 86 3D, 64 SS, 45 RR
and search sensitivity.
1.
2.
3.
Shindyalov I.N. and Bourne P.E. (1998) Protein structure alignment by
incremental combinatorial extension (CE) of the optimzl path. Protein
Eng. 11 (9), 739-747.
Henikoff S. and Henikoff J.G. (1993) Performance evaluation of amino
acid substitution matrices. Proteins 17 (1), 49-61.
Prlic A., Dominques F.S. and Sippl M.J. Structure-derived substitution
matrices for alignment of distantly related sequences. Protein Eng. 13 (8)
545-550.
From Automated Models, to Refinement by a Human Expert,
to Combination of Alternative Solutions Obtained by
Independent Predictors
M. Feder, I. Cymerman, J. Kosinski, J. Sasin, M. Kurowski,
J.M. Bujnicki
International Institute of Molecular and Cell Biology (IIMCB) in Warsaw.
Trojdena 4, 01-109 Warsaw, Poland
iamb@genesilico.pl
evolutionaries (P0180) - 99 predictions: 99 3D
A Phylogenomic Approach to Fold Prediction
The results of the last two CASP and CAFASP assessments in the foldrecognition (FR) category revealed that most of the top groups use the fully
automated predictions generated by their own servers and/or other CAFASP
servers as the starting point for protein model building and refinement.
Interestingly, the performance difference between the human experts and
computer predictors continues to narrow, which suggests that most of the
refinement procedures used by humans can be fully automated. Several
attempts have been made to quantify and monitor the “gap” between the quality
of automated predictions and the models refined by humans, but to date no
comparison has been made on a “case-to-case” basis.
Kimmen Sjölander1, Emma Hill1, David Konerding1,
Steven Brenner1, Andrej Sali2 and Andras Fiser2
1
– UC Berkeley, 2 – Rockefeller University
kimmen@uclink.berkeley.edu
See methods section
In CASP4, J.M.Bujnicki participated as a member of the BioInfo duumvirate, as
well as and one of four experts of the CAFASP-consensus group. Within
BioInfo, he was responsible for building and refinement of all targets in the HM
and FR categories, while in CAFASP-consensus, he participated in
identification of the best automated models and in inference of a rational
consensus between them. While the unrefined predictions gave CAFASPconsensus the overall ranking of 7, the refined predictions gave BioInfo even
better score. However, it was not always clear if the improvement stemmed
from the refinement or from application of different criteria for selection of the
best automated models by the two groups.
A-186
In CASP5, we attempted to assess, in a systematic way, the value added to the
automated model by the refinement carried out by a single human expert, as
well as the (possible) value of additional input from less experienced
predictors. Therefore, the GeneSilico team of the Bioinformatics Laboratory at
the International Institute of Molecular Biology in Warsaw submitted
predictions as three independent groups: i) GeneSilico-servers-only (selection
of unrefined FR models), ii) Bujnicki-Janusz (models refined by a single
experienced predictor), and iii) GeneSilico (consensus obtained after careful
evaluation and comparison of models generated independently by all members
of the team).
plausible prediction could be made. In addition to comparison of automated and
refined models, evaluation of the relative performance of a single expert
(Bujnicki-Janusz) and the expert aided by a team of less experienced predictors
(GeneSilico) will allow to assess the influence of the time constraints
(personhours/model) as well as of additional, though somewhat naïve sampling
of the alignment/model space, on the quality of the final prediction.
Selection of the best automated model by GeneSilico-servers-only in CASP5
was carried out in a similar manner to that of CAFASP-consensus in CASP4,
only in a more disciplined way. Both groups relied on automatic models
generated by the CAFASP servers. In CAFASP-consensus, modeling involved
(at least in some cases) shifting of insertions and deletions to the surfaceexposed regions and limited refinement of loops, without any changes of the
target-template alignment in the core regions. On the other hand, the human
intervention of GeneSilico-servers-only in CASP5 involved only selection of
one of the FR alignments or one of the atomic models generated by HM or ab
initio servers. Automated FR models were based on single templates and were
preferably submitted in the AL format without explicit modeling of insertions
or deletions to avoid the inevitable distortion of the raw data by automatic
homology modeling in cases, such as disruption of the protein core.
Incorporation of Constraints Derived
from Active/Functional Site Predictions in
Protein Tertiary Structure Assembly
Bujnicki-Janusz used the selected FR models as a starting point to generate
refined models. Here, no limits were placed on modification of the original
alignment and inclusion of additional templates. In several cases, large parts of
the models (>10 aa) were (re)built by hand. The major constraint on the level of
refinement was the limited amount of time available for each model, given the
large number of targets in CASP5. Additional five members of the GeneSilico
team explored alternative ways of refinement. In many cases, alternative
models, different from that submitted by Bujnicki-Janusz, could be obtained.
After evaluation of all models by knowledge-based potentials, the best model
or the hybrid comprising best fragments of several models was submitted.
Bujnicki-Janusz was rather stringent in rejecting uncertain predictions, resulting
in submission of models for only a fraction of the CASP5 targets. On the other
hand, GeneSilico attempted to submit models for all targets for which at least a
A-187
GERLOFF (P0240) - 9 predictions: 9 3D
R. Schmid, D. C. Soares, Z. A. M. Hussein, B. J. Mitchell,
R. S. Hamilton and D. L. Gerloff
Biocomputing Research Unit & Structural Biochemistry Group,
Institute of Cell and Molecular Biology,University of Edinburgh, UK
d.gerloff@ed.ac.uk
We submitted tertiary structure predictions for five CASP5 target proteins in
order to investigate the potential of knowledge and/or predictions about
functional sites in these proteins for being used in combination with established
structure prediction methods. The degrees of difficulty assigned to the
prediction targets, and the categories in which our predictions are considered,
vary - the monomer of T0132 is clearly homologous to its template whereas we
could not find any suitable template structures for T0129*. Similarly, the way
in which functional site information is used, and its impact on the final model
varies slightly from target to target.
Our primary postulates are that:
(a), the interchange between structure and function prediction (or knowledge)
leads to improvement at both ends; (b), formulation/adaptation of systematic
fold-specific heuristics and function-specific heuristics is possible, at least for
certain folds and functions; (c), prediction of structure/ function can go beyond
trying to find re-occurrences of known cases.
While we found little opportunity within the set of CASP5 targets to
demonstrate and/or test postulate (b) (CASP4 T0100 was a good example), we
attempted to use function prediction/knowledge in all predictions we submitted.
Primarily, we used predicted key residues in proteins presumed to function as
enzymes to “anchor” threading alignments (in T0130, T0173, and to an extent
in T0136 and T0132) so that their arrangement in the model would allow
catalysis. We could not find a suitable fold template for T0129 and used the
presumed proximity of presumed functional residues to guide the assembly of
helices ab initio.
Prediction of key residues from multiple sequence alignments was generally
based on complete, or high, conservation of functional type amino acids,
sometimes taking into consideration patterns of conservation similar to those
described in [1]. The choice of template structures used in our predictions was
often influenced by the publicly available CAFASP2 predictions by automated
servers, albeit not exclusively. Here again, the compatibility between the folds
and biologically sensical arrangements of predicted key residues was our
primary criteria in non-obvious cases. Secondary structure predictions by
CAFASP2-servers were used by default but often refined according to [1] and
in the course of modeling.
On our poster, we are discussing the value of our predictions in light of the
experimental structures. Particularly interesting besides the structural
discussion will be to re-assess the speculative functional roles of individual
predicted key residues that we attempted to assign in most of our submissions.
These blind predictions of functional aspects are influenced by the structure
predictions as much as vice versa.
While the “manual component” in our CASP-predictions is obviously
significant, our goal is to identify systematic aspects in the way biochemists’
knowledge influences (and quite often improves) tertiary structure predictions,
with the goal of providing “refinement modules” for existing automated
methods. Besides functional site assembly, consideration of the usually
observed pseudo-symmetry in protein quaternary structures is under-explored
in our field, and we believe that the prediction of (non-transient) quaternary
A-188
structure besides tertiary structure would be a highly relevant addition to future
CASPs. Interesting quaternary structure cases in the targets we considered were
T0132 and T0136. Again, the benefits of further developing efforts in this
direction could be mutually beneficial to either tertiary and quaternary structure
prediction.
(* Please note that this abstract was written before we had the chance to see the
experimental structures in order to comply with the deadlines.)
1.
Benner S.A., Cannarozzi G.M., Gerloff D., Turcotte M. and
Chelvanayagam G. (1997) Bona fide predictions of protein secondary
structure using transparent analyses of multiple sequence alignments.
Chem. Reviews 97, 2725-2843
Ho-Kai-Ming (P0437) - 129 predictions: 129 3D
Three Dimensional Threading Approach to Protein Structure
Recognition
Kai-Ming Ho, Haibo Cao, Yungok Ihm, Zhong Gao, Cai-Zhuang
Wang and Drena Dobbs
Iowa State University
kmh@ameslab.gov
See methods section
HOGUE-SLRI (P0267) - 254 predictions: 254 3D
virtual Ca angles and three virtual Ca dihedrals. Three atoms from each side of
the gap were placed in space, according to the takeoff angles. Alpha carbons
required to fill the gap were given arbitrary starting co-ordinates within the gap
region, and a steepest descent energy minimization consisting of virtual Ca
bond length restraints, virtual Ca angles restraints, and a van der Waals term
was carried out. The three anchoring atoms on either side of the gap were held
fixed during the minimization. Finally, the resulting loop was incorporated as a
fragment using its own Ca trace.
Semi-Automated Homology Modeling of 38 CASP5 Targets
Using a Modified TRADES Algorithm
M. Dumontier12, H. J.Feldman12 and C. W.V. Hogue12
1
Samuel Lunenfeld Research Institute, 600 University Ave. Toronto, Ontario,
Canada M5G 1X5, 2 Department of Biochemistry, University of Toronto
micheld@mshri.on.ca
Homology modeling is a powerful method for predicting the three dimensional
structure of biological macromolecules from their primary sequence given even
weak sequence similarity to a biomolecule with an experimentally determined
structure. High-quality models can provide important information regarding
the function and mechanism of a biomolecule and could be used for
rationalizing experimental data or guiding the design of new experiments.
Here, we present a modified version of the TRADES algorithm1, used in the
blind prediction of 38 protein structures from sequence for the Critical
Assessment of techniques for protein Structure Prediction (CASP) competition.
Template protein structures for homology modeling of CASP targets were
identified using BLAST against the protein structure database (PDB) and the
conserved domain database (CDD). Templates with significant sequence
similarity across the longest segment with the fewest indels and closest
functional annotation were favorably considered. In the case of multi-domain
proteins, the best hit for each domain was used as template. Where possible,
alignments were modified to ensure that indels fell on loop regions rather than
across elements of secondary structure.
Next, a new target trajectory distribution was built from the template backbone
Ca trajectory using a modification of the TRADES algorithm. A slightly
flexible single fragment from the recorded trace replaced each structurally
conserved (gapless) region of alignment. Gap-spanning fragments for variable
regions were created from 'takeoff angles' starting from one residue prior to the
gap and ending one residue following the gap. These fragments consisted of
six degrees of freedom - the distance between the start and end of the gap, two
A-189
Roughly 1000 structures were generated using the fragments obtained from the
previous steps and our Foldtraj software, with bump checking disabled. Using a
modified version of a statistical residue-based potential2, which we have termed
'crease energy', the best five structures were chosen. These were then refined
with a steepest-descent minimization using the CHARMM EEF1 force field to
resolve steric clashes but without significantly changing the structure (typically
1Ao RMSD between the refined and unrefined structures).
The modified TRADES algorithm generates realistic, all-atom protein structure
homology models of non-idealized geometry as it incorporates side chains from
a backbone dependent rotamer library and produces reasonable bond lengths,
bond angles, torsion angles, as well as minimized electrostatics and van der
Waals forces. Moreover, this method models loops for insertions and deletions
and compensates for missing template atoms.
1.
2.
Feldman H.J. and Hogue C.W.V. (2000). A fast method to sample real
protein conformational space. Proteins. 39 (2), 112-31.
Bryant S.H. and Lawrence C.E. (1993) An empirical energy function for
threading protein sequence through the folding motif. Proteins. 16 (1), 92112.
Huber-Torda (P0351) - 83 predictions: 83 3D
jive (P0506) - 37 predictions: 37 3D
Fold Recognition and Sequence to Structure Alignment:
Brobdingnagian Approximations and Lilliputian Success
JIVE: Protein structure prediction by the assembly of local
supersecondary structural motifs
David F. Burke, and Tom L Blundell
T. Huber1, B. J.B. Procter2 and A.E. Torda2
1
Department of Biochemistry, University of Cambridge,80 Tennis Court Road,
Cambridge, CB2 1GA, United Kingdom
dave@cryst.bioc.cam.ac.uk
- Department of Mathematics, The University of Queensland, Australia,
2
- Zentrum fuer Bioinformatik, University of Hamburg, Germany
huber@maths.uq.edu.au, procter@zbh.uni-hamburg.de,
torda@zbh.uni-hamburg.de
Protein fold recognition can be regarded as a search for the most compatible
known structure with a sequence of interest. Common compatibility measures
may be based on “knowledge-based” force fields, sequence profiles or hidden
Markov models, but this usually implies simple interaction models and
parameters reverse-engineered from collections of known structures.
Unfortunately, finding good models and parameters to describe protein
properties at low resolution often relies on weak assumptions and gross
approximations.
We applied data mining methods, without prior assumptions, to extract the
information from a large set of proteins down to a most parsimonious set of
protein fragments. We began with a large number of protein fragments in a
high dimensional space which was then modeled as a mixture of conditionally
independent classes. The result is a collection of sequence and structure
probability distributions. These could then be used as a score function for
sequence to structure alignments.
In the CASP 5 experiment, models of proteins which had low confidence
values across the CAFASP3 servers were selected to be modelled by JIVE.
JIVE predicts the structure of small continuous domains of proteins by the
assembly of fragments of local supersecondary motifs. Initially, homologous
sequences were identified using PSI-BLAST[1] and secondary structure
prediction was performed using PHD[2]. The conformational class of the
supersecondary fragments were predicted using SLOOP[3-5]. SLOOP uses
sequence/structure profiles derived from a database of loops clustered on the
conformation of the loop and surrounding secondary structures. These
fragments were then assembled using a Monte Carlo simulation. Unsuitable
models were rejected based on excluded volume and a distance-dependent
conditional probability function [6].
The generated structures were then searched against protein structures from
both the HOMSTRAD[7] database of homologous families and the
CAMPASS[8] database using the program SEA[9]. Potential hits were then
analysed further for validation. In all, 17 targets were submitted.
Gap penalties were adjusted by a numerical optimization process using a
penalty function which measured the structural quality of models. Finally,
structures were ranked after including a contribution from a z-score optimized,
low resolution quasi-energy function.
1.
Models were built for all sequences ranging from the highly homologous to the
totally exotic and the results are assessed in terms of model and fold
recognition quality.
3.
A-190
2.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs.Nucleic Acids Res.
25(17):3389-402.
Rost B., et al. (1994) PHD-an automatic mail server for protein secondary
structure prediction.Comput Appl Biosci.10(1):53-60
Donate L.E., et al.(1996) Conformational analysis and clustering of short
and medium size loops connecting regular secondary structures: a database
for modeling and prediction. Protein Sci. 5(12):2600-16
4.
5.
6.
7.
8.
9.
Rufino S.D. et al (1997) Predicting the conformational class of short and
medium size loops connecting regular secondary structures: application to
comparative modelling. J Mol Biol. 267(2):352-67.
Burke D.F. et al. (2001) Improved Loop prediction from sequence alone.
Protein Engineering 14 (7) 473-478
Samudrala R. et al. (1998) An all-atom distance-dependent conditional
probability discriminatory function for protein structure prediction. J Mol
Biol. 275(5):895-916
Mizuguchi K., et al. (1998) HOMSTRAD: a database of protein structure
alignments for homologous families. Protein Science 7 2469-2471.
Sowdhamini R., et. Al (1996) A database of globular protein structural
domains: clustering of representative family members into similar folds.
Fold Des 1 (3):209-20
Rufino S.D. et al. (1994) Structure-based identification and clustering of
protein families and superfamilies. J Comput Aided Mol Des 8(1):5-27
Jones (P0067) - 121 predictions: 68 3D, 53 SS
Assessing the Reliability of Transmembrane Protein Topology
Assignments by Homology
M. Pellegrini-Calace and D.T. Jones
Bioinformatics Unit, Department of Computer Science
University College London, Gower St, WC1E 6BT, London (UK)
m.pellegrini-calace@cs.ucl.ac.uk
Membrane proteins make up a wide and important class of biological
macromolecules and are interesting targets for medicinal chemistry. Moreover,
helical membrane proteins represent a total of 20%-25% of the proteins in a
typical genome, and the key role they play in cells makes crucial to increase the
ability of detecting homology-related membrane proteins to gain a quick way to
understand their functional features. [1]
A-191
PSI-BLAST (Position-Specific Iterated BLAST) is one of the most popular and
powerful homology search programs currently available and has been shown to
be more effective than most other methods in the detection of distantly related
globular proteins. However, because unrelated transmembrane (TM) segment
are more similar to each other than unrelated globular regions, PSI-BLAST has
not shown a comparable effectiveness when applied to membrane proteins. [12]
To minimize the number of false hits obtained after membrane protein
homology searches by PSI-BLAST, we performed a systematic optimization of
the E-value to calculate a restrictive cut-off value for the inclusion of proteins
in the iterative BLAST search underlying PSI-BLAST method.
Calculations were performed on a data set built by comparing three membrane
protein databases: the MPtopo database (92 -helical proteins) [3]; the TMPDB
database (189 -helical proteins) [4]; and the membrane proteins database from
Moeller et al. available at the EBI ftp site (148 non-redundant -helical
membrane protein) [5]. The databases were compared by default BLAST
searches, at an E-value cut-off of 10-3. Entries from each of the three databases
were the query sequences for 2 different runs, in which the 2 remaining
databases were searched (the total number of BLAST searches was therefore
6). Topologies of TM helices (TMHs) from the found homologues were
analyzed and compared with topologies of TMHs form query sequences. The
number of TMPDB entries showing at least one homologue with agreeing
TMH topology in both the other two databases resulted only 48, probably
because of the small size of the MPtopo database. Therefore, entries from
TMPDB showing at least one homologue with agreeing TMH topology either
in the MPtopo or in the EBI database were chosen as benchmark data set (149
proteins, 94 from prokaryotes, 48 from eukaryotes and 7 from viruses).
Two BLAST calculations with default parameters were performed on the data
set, scanning the non-redundant sequence database comprising also of the 149
sequences, in both filtered and not-filtered forms. The number of true hits (i.e.
hits among the 149 sequences having agreeing topology and percentage of
identity higher than 30%) was calculated at each E-value and the E-value with
the lowest percentage of error was chosen as the restrictive cut-off value for
PSI-BLAST calculations.
Finally, it was shown that the most reliable PSI-BLAST searches for membrane
protein query sequences are performed setting the number of iteration at 2 and a
cut-off E-value for the inclusion in the profile at 10-14.
1.
2.
3.
4.
5.
Hedman M. et al. (2002) Improved detection of homologous membrane
proteins by inclusion of information from topology predictions. Protein
Sci. 11(3), 652-658.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25(17),
3389-3402.
Jayasinghe S. et al. (2001). MPtopo: A database of membrane protein
topology. Protein Sci. 10, 455-458.
Ikeda M. et al., (2002). Transmembrane topology prediction methods: a reassessment and improvement by a consensus method using a dataset of
experimentally-characterized transmembrane topologies. In Silico Biol., 2,
19-33.
Moeller S. et al. (2000) A collection of well characterised integral
membrane proteins. Bioinformatics, 16, 1159-1160.
Jones (P0067) - 121 predictions: 68 3D, 53 SS
A Distributed Pipeline for Structure-based Proteome
Annotation Using Grid Technology
L. McGuffin1, S. Sorensen1, C. Orengo2, D. Jones1, J. Cuff3,
E. Birney4, A. Robinson4, J. Thornton4, K. Fleming5, A. Mueller5,
L. Kelley5, S. Newhouse6, J. Darlington6, M. Sternberg5
1
-Department of Computer Science, University College London,
2
-Department of Biochemistry, University College London,
3
-Sanger Centre Cambridge,
4
-European Bioinformatics Institute, Cambridge
,5-Department of Biological Sciences, Imperial College, London,
6
-Department of Computer Science, Imperial College, London
l.mcguffin@cs.ucl.ac.uk
In order to benefit from the wealth of information contained in recently
sequenced genomes it is essential that we have structure based annotation of the
proteins in terms of their 3-D conformations and their functions. This project
aims to provide a structure-based annotation of the proteins encoded by the
major genomes by linking resources at University College London (UCL),
Imperial College London (IC) and the European Bioinformatics Institute (EBI)
in Cambridge using Grid technology.
The objectives are: i - to establish local databases with structural and function
annotations, ii - to disseminate to the biological community our proteome
annotation via a single web-based distributed annotation [1], iii - to share
computing power transparently between sites using GLOBUS, iv -to use the
developed system for comparison of alternative approaches for annotation and
thereby identify methodological improvements, v - to establish a pre-prototype
at 6 months for demonstration purposes, vi - to provide a working system after
two years, and vii - to link to relevant bioinformatics and Grid resources that
will be integrated into this project.
At IC, the approach will be to use PSI-BLAST [2] to detect homologies
between the proteome and other sequences and protein structures (characterised
domains in SCOP [3]). This is followed by fold recognition using 3D-PSSM
A-192
[4] to recognise remote homologies missed by PSI-BLAST. At UCL,
GenTHREADER [5] will directly analyse the sequences to detect both obvious
and remote homologies. In addition, sequence motifs encoding the CATH
structural domains [6] will be scanned against the proteomes using IMPALA
[7], Hidden Markov Models [8] and Gene-3D [9]. To maintain links with the
widely-used sequence-based annotation methods, the proteomes will also be
scanned against INTERPRO.
The results of the above analyses will identify those regions (i.e. domains) of
the proteomes for which there is a functional annotation [10]. For regions with
sequence identity to known structures of >30%, three-dimensional models will
be constructed using 3D-JIGSAW [11] and the co-ordinate construction
package being developed within GenTHREADER.
7.
Schaffer A.A., et al. (1999) IMPALA: matching a protein sequence against
a collection of PSI-BLAST-constructed position-specific score matrices.
Bioinformatics 15, 1000-11
8. Hughey R. & Krogh A. (1996) Hidden Markov models for sequence
analysis. CABIOS 12, 95-107
9. Buchan D.W.A., et al. (2002) Gene3D: Structural assignment for whole
genes and genomes using the CATH domain structure database. Genome
Res. 12, 503-514
10. Thornton J.M., et al. (2000) From structure to function: Approaches and
limitations. Nature Struct. Biol. Supl: 991-994
KIAS (P0531) - 479 predictions: 176 3D, 303 SS
We intend to disseminate the results of this project to the both the biological
community interested in proteome annotation and the scientific community
interested in Grid technology. Within the organisation of our project, we will
explore including other groups either with relevant biological databases or with
appropriate developed Grid technology. Industrial connections will be made
through the EBI's Industry Programme which will allow the work to be
presented to bioinformatics representatives from the pharmaceutical and
biotech industry.
1.
2.
3.
4.
5.
6.
Hubbard T. & Birney E. (2000) Open annotation offers a democratic
solution to genome sequencing. Nature 403, 825
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Conte L.L., et al. (2000) SCOP: a structural classification of proteins
database.Nucleic Acid Res. 28, 257-259
Kelley L.A., et al. (2000) Enhanced genome annotation using structural
profiles in the program 3D-PSSM. J. Mol. Biol. 299, 501-522
Jones D.T. (1999) An efficient and reliable protein fold recognition method
for genomic sequences. J. Mol. Biol. 287, 797-815
Orengo C.A., et al. (1997) CATH- a hierarchic classification of protein
sequences. Structure, 5, 1093-1108
A-193
Prediction of Protein Tertiary Structure using PROFESY,
a Novel Method based on Pattern Matching and Fragment
Assembly
Julian Lee, Seung-Yeon Kim, Keehyung Joo,
Ilsoo Kim, Saejoon Kim, and Jooyoung Lee
School of Computational Scineces, Korea Institute for Advanced Study
jlee@kias.re.kr
We introduce a novel method for the tertiary structure prediction, PROFESY
(PROFile Enumerating SYstem). This method utilizes secondary structure
prediction information and fragment assembly. The secondary structure
prediction is first performed using the method PREDICT (PRofile Enumeration
DICTionary) recently developed by our group, which uses a concept of
distance between patterns. For a given protein sequence, this method uses PSIBLAST to generate profiles, which define patterns for amino acid residues.
Each pattern is compared with those in the pattern database generated from
PDB, and the patterns close to the query pattern is selected to determine the
secondary structure of the query residue. In order to construct the tertiary
structure, we also collect the backbone dihedral angles along with these
patterns. These constitute a library of the fragments for a given protein
sequence.
By construction, the secondary structure of the tertiary structure obtained from
PROFESY agrees with the ones predicted from PREDICT. In order to obtain
the optimal tertiary packing of these secondary structure elements, we define a
score function based on the number of long-range hydrogen bondings, burial of
hydrophobic residues and exposure of hydrophilic residues, the radius of
gyration, and the inter-residue Lennard-Jones interactions to avoid steric
clashes. Replacement of fragments by the ones in the library is carried out, so
that the score function is minimized. The score function minimization is
performed by a powerful global optimization method conformational space
annealing (CSA) method [1]. This method enables one to sample diverse low
lying minima of the score function.
1.
2.
3.
4.
5.
6.
7.
Lee J. et al. (1997) New optimization method for conformational energy
calculations on polypeptides : Conformational Space Annealing. J. Comp.
Chem. 18 (9), 1222-1232 ;
Lee J. et al. (1998) Conformational analysis of the 20-residue membranebound portion of Melittin by Conformational Space Annealing.
Biopolymers. 46, 103-115 ;
Lee J. et al. (1999) Conformational Space Annealing by parallel
computations: extensive conformational search of Met-enkephalin and the
20-residue membrane-bound portion of Melittin. Int. J. Quant. Chem. 75,
255-265 ;
Lee J. et al. (1999) Energy-based de novo protein folding by
conformational space annealing and an off-lattice united-residue force
field: Application to the 10-55 fragment of staphylococcal protein A and to
apo calbindin D9K. Proc. Natl. Acad. Sci. USA 96, 2025-2030 ;
Liwo A. et al. (1999) Protein structure prediction by global optimization of
a potential energy function. Proc. Natl. Acad. Sci. USA 96, 5482-5485 ;
Lee J. et al. (1999) Calculation of protein conformation by global
optimization of a potential energy function. PROTEINS: Structure,
Function, and Genetics 3:204-208 ;
Lee J. et al. (2000) Hierarchical energy-based approach to protein-structure
prediction: Blind-test evalutation with CASP3 targets. Int. J. Quant.
Chem. 77, 90-117
A-194
KIAS (P0531) - 479 predictions: 176 3D, 303 SS
Prediction of Protein Secondary Structure Using PREDICT,a
Novel Method Based on Pattern Matching
Keehyung Joo1 , Ilsoo Kim1 , Julian Lee1,
Seung-Yeon Kim1, Sung Jong Lee1,2 , and Jooyoung Lee1
1
School of Computational Scineces, Korea Institute for Advanced Study
2
Department of Physics, Suwon University
jlee@kias.re.kr
We introduce a novel method for the secondary structure prediction, PREDICT
(PRofile Enumeration DICTionary). This method uses a concept of distance
between patterns. For a given protein sequence, this method uses PSI-BLAST
(Position Specific Iterative Basic Local Alignment Search Tool ) to generate
profiles, which define patterns for amino acid residues. Each pattern is
compared with those in the pattern database generated from PDB (Protein Data
Bank), and the patterns close to the query pattern is selected to determine the
secondary structure of the query residue. This method combines the idea of the
nearest-neighbor method of Yi and Lander [1] with the profile generating
technology of PSI-BLAST [2]. We tested the method on the set of 513 nonhomologous proteins CB513, and applied it to the CASP5 targets for blind test.
Preliminary result on the Q3 value of the secondary structure prediction of
CB513 set using 7777 protein set as the database is about 80 %.
1.
2.
Yi T. et al. (1993) Protein Secondary Structure Prediction using NearestNeighbor Methods, J. Mol. Biol. 232, 1117-1129
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402.
LAMBERT-Christophe (P0035) - 131 predictions: 131 3D
network is used to assign a score to each box. Most confident boxes are taken
as anchor points for the building of the final sequence-structure alignment. A
three-dimensional model is built using MODELLER [8] on this final alignment.
Improving Target-Template Alignment Using Neural
Networks
We tested the performances of our alignment step by evaluating several
alignment programs and comparing them to the performances of our alignment
procedure. Results show an improvement of the multiple alignment quality by
at mean 30%, especially for the cases where target and template sequences
share a low rate of identities.
C. Lambert, E. Depiereux
Unité de Recherche en Biologie Moléculaire, Facultés Universitaires NotreDame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium
christophe.lambert@fundp.ac.be
ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred
The aim of our work is to propose a reliable automatic method for homology
modeling (ESyPred3D[1]), especially when the protein of interest shares a low
percentage of identities (20-30%) with the chosen template.
1.
2.
Our strategy consists in the usual steps for homology modeling: search for the
template in databanks, target-template alignment and modeling. Actually, our
method does not provide any assessment of the model.
For the search of a template in databank, we used four iterations of PSIBLAST[2] on the non redundant protein database (nr) of the NCBI. All
sequences having a expected value lower than 0.001 are included in the profile
building. The template is chosen as the sequence of known structure (PDB) that
has the lower expected value. The search in the nr databank also gives us a
large number of similar sequences.
As far as possible, two sets of sequences are built. The first one contains the 50
best hits below the expected value cutoff of 0.001. The second one contains a
subset of the sequences, after dropping too redundant ones. This method aims
at creating different conditions to run multiple alignment programs and
extracting different consensus in order to raise the confidence of the sequencestructure alignment.
The two sets are then submitted to five alignment programs: ClustalW[3],
Dialign2[4], Match-Box[5], Multalin[6] and T-Coffee[7]. A pairwise alignment
between the target and template sequences is extracted from each multiple
alignment. All the pairwise alignments including the one provided by PSIBLAST are used to generate a database of aligned positions (boxes). A neural
A-195
3.
4.
5.
6.
7.
8.
Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures.
Bioinformatics. 18 (9), 1250-1256
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of
progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice, Nucleic Acids
Res. 22, 4673-4680
Morgenstern B. et al. (1998) DIALIGN: Finding local similarities by
multiple sequence alignment. Bioinformatics 14, 290-294
Depiereux E. et al. (1997) Match-Box server: a multiple sequence
alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13,
249-256
Corpet F. (1988) Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res. 16, 10881-10890
Notredame C. et al. (2000) T-Coffee: A novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol. 302(1), 205-217
Sali A. et al. (1993) Comparative protein modelling by satisfaction of
spatial restraints. J. Mol. Biol. 234(3), 779-815.
Lomize-Andrei (P0288) - 76 predictions: 76 3D
Lund-Ole (P0391) - 39 predictions: 39 3D
New Energy Functions For Protein Modeling Derived From
G Values
X3M – a Computer Program to Extract 3D Models
A.L. Lomize, M.Y. Reibarkh, and I.D. Pogozheva
Center for Biological Sequence Analysis, Biocentrum-DTU, Building 208,
Technical University of Denmark, DK-2800 Lyngby, Denmark
lund@cbs.dtu.dk
O. Lund, M. Nielsen, C. Lundegaard and P. Worning.
College of Pharmacy, University of Michigan, Ann Arbor, MI
almz@umich.edu
Efficient methods for protein structure prediction require energy optimization.
An especially important goal here is the correct evaluation of free energy
differences, not enthalpy in vacuum that is usually calculated with molecular
mechanics potentials. The required energy functions must take into account
conformational entropy, solvation free energy, and dependence of interatomic
interactions on the environment. They must be also tested against experimental
thermodynamic stabilities of proteins or protein-ligand complexes. We have
determined van der Waals (vdW) interaction energies between different atom
types, energies of hydrogen bonds, and atomic solvation parameters from the
published free-energy differences for 106 mutants with replacements of buried
uncharged residues and available crystal structures [1]. The obtained energies
of interatomic interactions were different from that in molecular mechanics in
three important aspects: (1) they describe interactions in the protein interior
rather than in vacuum; (2) they are generally weaker and follow "like dissolves
like" rule; (3) they are related to enthalpy of melting rather than to enthalpy of
sublimation. The developed potentials can be applied for side-chain packing,
fold recognition, computational de novo design, estimation of ligand-binding
constants, and modeling of nonregular loops.
1.
Lomize A.L., Reibarkh M.Y., and Pogozheva I.D. (2002). Interatomic
potentials and solvation parameters from protein engineering data for
buried residues. Protein Sci. 11 (8), 1984-2000.
A-196
See methods section
Levitt (P0016) - 350 predictions: 350 3D
Ab Initio Structure Prediction of Target Proteins in CASP5
T.M. Raschke*, C.M. Summa*, R. Kolodny and M. Levitt
Stanford University, Department of Structural Biology,
Fairchild Building, Room D-109, Stanford, CA 94305
michael.levitt@stanford.edu
We applied the following ab initio prediction method to target proteins that
received low scores from the CAFASP3 comparative modeling servers.
Models were generated by assembling regularized backbone segments of length
9 (derived from a 2000-protein library) using Monte Carlo swap moves, as per
Jones’ method used in CASP2[1] and Baker’s method in CASP3 and CASP4
[2-3]. The energy function used for annealing consisted of terms representing
cooperative hydrogen bonds (as done by Keasar & Levitt in CASP4), residuebased hydrophobic burial propensity, and residue-based hydrophobic pair
interactions. After 50,000 steps of annealing with the segment replacement
method, the models were annealed with “refinement moves,” consisting of
small 2° rotations of the backbone torsion angles, for 10,000 steps. This
process was used to model the native sequence and homologous sequences
(where appropriate) using the predicted secondary structures from several
automated servers [4-6]. For some targets, the most likely emitted sequence
from a Hidden Markov Model built from the target sequence family was also
used [7]. A set of 1000 decoys was generated for each sequence/secondary
structure combination, and all models were combined into one large dataset for
selection. This dataset was pruned to 3000 members using a “colony energy”
score [8] that combined several energy functions (atom cluster energy,
electrostatic energy, RAPDF [9], and the energy from the decoy generation
procedure) with a measure of the structural similarities between the decoys.
The 3,000 best models were clustered with a hierarchical clustering method
using a Floyd distance metric (distance along the graph) [10]. Decoys in the
top 5 clusters were evaluated by manual inspection, and typically one decoy
from each of the top 5 clusters was submitted.
*
These authors contributed equally to this work.
1.
Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of
NK-lysin using multiple sequences and recognized supersecondary
structural motifs. Proteins: Struct. Funct. Genet. S1, 185-191.
2. Simons K.T., Bonneau R., Ruczinski I. and Baker D. (1999) Ab initio
protein structure prediction of CASP III targets using ROSETTA. Protein:
Struct. Funct. Genet. S3,. 171-176.
3. Bonneau R., et. al. (2001) Rosetta in CASP4: Progress in ab initio protein
structure prediction. Proteins: Struct. Funct. Genet. S5, 119-126.
4. PHD, http://www.embl-heidelberg.de/predictprotein/predictprotein.html
5. PSIPRED, http://bioinf.cs.ucl.ac.uk/psipred/
6. SAM-T02-STRIDE,
http://www.cse.ucsc.edu/research/compbio/HMMapps/T02-query.html
7. Gough J.and Madera M. (2002) The next generation of structural genome
analysis. CASP5 Abstract.
8. Xiang Z.X., Soto C.S. and Honig B. (2002) Evaluating conformational free
energies: The colony energy and its application to the problem of loop
prediction. Proc. Natl. Acad. Sci. USA 99 (11), 7432-7437.
9. Samudrala R. and Moult J. (1998) An all-atom distance-dependent
conditional probability discriminatory function for protein structure
prediction. J. Mol. Biol. 275 (5), 895-916
10. Tenenbaum J.B., de Silva V. and Langford J.C. (2000) A global geometric
framework for nonlinear dimensionality reduction. Science. 290 (5500),
2319.
A-197
Levitt (P0016) - 350 predictions: 350 3D
Comparative Modeling Using Structural Alignments and
Self-Consistent Mean-Field for Sidechain/Loop Prediction
E. Lindahl, M. Levitt, and P. Koehl
Department of Structural Biology, Stanford University School of Medicine,
Stanford, CA 94305 USA
koehl@csb.stanford.edu
For comparative modeling at CASP5, our group has focused on improving
sequence alignments and on the prediction of sidechains and loop regions in
proteins. All target sequences submitted to CASP5 were first screened using the
results from the CAFASP3 servers and comparative modeling was only
attempted for targets where at least one server showed intermediate or high
scores. The remaining sequences were considered ab initio targets, for which a
different method was used [1].
A consensus secondary structure prediction was derived from all the servers
available at the CAFASP3 website, giving additional weight to the PsiPred [2]
method. For all significant fold recognition hits we extracted both the original
structures and other structures in the same SCOP superfamily [3] with good
SPACI scores [4] to get high quality templates. We computed a sequence
profile based on the structural alignments of these templates, derived from the
FSSP database [5]. Position-dependent gap penalties were introduced based on
the experimental and predicted secondary structures, FSSP fragments, and the
distance between endpoints in the template structures for deletions.
We used both our alignments derived from the structural profiles and
automated alignments from CAFASP3 to create a set of manually tweaked
alignments for each target. The emphasis in this tuning process was not mainly
on matching features, but rather on manual discrimination, correcting possible
mismatches, and taking any additional knowledge about the sequence/structure
into account. For large insertions or changes in secondary structure we first
alter the backbone structure of the template and applied energy minimization
with ENCAD/SegMod [6] and Gromacs [7] to get the structure to a reasonable
state.
Starting from the manual alignments, a model backbone framework was built
by removing two residues on each side of insertions/deletions in the template.
Candidate loop fragments were selected from a set of geometrically compatible
backbone fragments. Similar fragment sets were generated for positions where
there were PRO and GLY mutations between the template and query
sequences. This approach is limited to insertions shorter than about 15 residues,
and for a couple of cases we had to apply manual modeling using the O
program [8] to generate potential loops.
In the final modeling step, we select a set of rotamers for each sidechain, and
use a self-consistent mean-field approach [9-10] to simultaneously optimize
sidechains and the altered backbone fragments. Manual inspection and the
energy of the resulting all-atom models were used to select which predictions to
submit.
1.
2.
3.
4.
5.
6.
7.
8.
9.
Raschke T., Summa C., Levitt M. (2002) Ab Initio Structure Prediction of
Targets in CASP5. CASP5 Abstract
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Murzin A.G., Brenner S.E., Hubbard T., Chothia C. (1995) SCOP: a
structural classification of proteins database for the investigation of
sequences and structures. J. Mol. Biol. 247, 536-540
Brenner S.E., Koehl P., Levitt M. (2000) The Astral compendium for
protein structure and sequence analysis. Nucleic Acids Res., 28, 254-256
Holm L., Sander C., Mapping the protein universe. (1996) Science 273,
595-602
Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential Energy
Functions and Parameters for Simulations of Molecular Dynamics of
Proteins and Nucleic Acids in Solution. Comp. Phys. Comm. 91, 215-231
Lindahl E., Hess B., van der Spoel D. (2001) GROMACS 3.0: A package
for molecular simulation and trajectory analysis. J. Mol. Mod. 7(8), 306
http://www.gromacs.org
Jones T. A, Kjeldgard M. (1998) Essential O, Software manual, Uppsala
University. http://xray.bmc.uu.se/alwyn/o_related.html
Koehl P., Delarue M. (1994) Application of a self-consistent mean field
theory to predict protein side-chains conformation and estimate their
conformational entropy. J. Mol. Biol., 239, 249-275
A-198
10. Koehl P., Delarue M. (1995) A self consistent mean field approach to
simultaneous gap closure and side-chain positioning in homology
modeling. Nature Struct. Biol., 2, 163-170
nexxus-delrio (P0370) - 7 predictions: 7 3D
Protein Structure Assessment by Matching Residues Function
and Centrality
G. del Rio1, A. Garciarrubio2 and D.E. Bredesen1
1
– Buck Institute, 2 – Biotechnology Institute (UNAM)
gdelrio@buckinstitute.org
See methods section
ORNL-PROSPECT (P0012) - 330 predictions: 330 3D
Protein Domain Decomposition Using Network
Flow Algorithms and Neural Networks
J. Guo, D. Xu, D. Kim, and Y. Xu
Oak Ridge National laboratory
xyn@ornl.gov
Structural domains are considered as the basic units of protein folding,
function, evolution, and design. Automatic decomposition of protein structures
into structural domains, though after many years of investigation, remains a
challenging and unsolved problem. Manual inspection still plays a big part in
domain decomposition of a protein structure in constructing domain databases.
We have previously developed a computer program DomainParser, using
network flow algorithms, for protein domain decomposition. The algorithm
partitions a protein structure into domains accurately when the number of
domains to be partitioned is known. However its performance drops when this
number is unclear. Through utilizing various types of structural information
including hydrophobic moment profile, we have developed an effective method
for assessing the most probable number of domains a structure may have. The
core of this method is a neural network, which is trained to rank different
possible domain decompositions. By combining this neural network with our
previous network flow algorithms, our new algorithm achieves 82%
decomposition accuracy on a data set of 1317 protein chains while the old one
has an accuracy of 75%, when compared to the manual decomposition results
given in SCOP database.
ORNL-PROSPECT (P0012) - 330 predictions: 330 3D
A Computational Pipeline for
Large-Scale Protein Structure Predictions
Manesh Shah1, Sergei Passovets1, Li Wang1, Dongsup Kim1
Kyle Ellrott1, 3, Dawei Lin4, Bi-Cheng Wang4
Dong Xu1,3, Ying Xu1,2,3
1
Life Sciences Division, 2Computer Science and Mathematics Division,
Oak Ridge National Laboratory, Oak Ridge, TN 37831
3
UT-ORNL Graduate School of Genome Science and Technology,
Oak Ridge, TN 37831
4
Department of Biochemistry & Molecular Biology, University of Georgia,
Athens, GA 30602
xyn@ornl.gov
The main components of the pipeline are: (a) a toolkit consisting of essential
protein analysis tools, (b) a client/server system which provides access to the
tools, (c) a pipeline manager which coordinates the processing tasks for a given
analysis request, and (d) a web interface for query submission. The pipeline can
be used through the command lines or a Web interface. Major computation of
the pipeline is carried on a 64-node supercomputer at Oak Ridge National
Laboratory.
The pipeline operations can be categorized into three distinct phases: (1)
protein triage, (2) threading-based structure prediction and (3) sequence based
function determination. Protein triage phase uses PRODOM (for domain
parsing), SOSUI (for classification into globular or membrane protein), SignalP
(for identifying signal peptide cleavage sites) and PSI-BLAST (for sequence
homology in PDB, Swissprot and other databases). Structure prediction phase
uses SSP (a secondary structure prediction tool that we developed),
PROSPECT (a protein fold recognition that we developed), MODELLER (for
atomic model construction) and WHATIF (for structure quality assessment).
Sequence based function determination phase (not yet implemented) will use
protein family classification tools Pfam, Motif and PRINTS. The pipeline
manager invokes different tools depending on the user input and logic of the
prediction process and controls the data and analysis flow of the pipeline. XML
technology is used for data exchange between the web interface, the pipeline
manager and the tools.
We have used this pipeline for the CAFASP2 predictions. The results also
helped us for CASP5 predictions. We have applied the pipeline on
Rhodopseudomonous palustris, where 799 soluble proteins and 281 membrane
proteins were classified and predicted. It only took about a day using 64-node
supercomputer at Oak Ridge National Laboratory. In addition, we predicted
structure for more than 2500 proteins in Pyrococcus furiosus for the SouthEast
Collaboratory for Structural Genomics (SECSG). The predicted structures are
used for target selections and initial models for structure determinations.
We have recently developed a computational pipeline for automated protein
structure predictions. It can be used for high-throughput protein structure
prediction, including genome-scale prediction. The pipeline has capacities in
both homology modeling and threading-based protein fold recognition.
A-199
Protfinder (P0282) - 222 predictions: 222 3D
number of contacts per residue is smaller than a threshold. Allowed gaps
receive an energetic penalty G0 plus a penalty G1 for each residue in the gap.
POSTER: Sequence-structure Alignments with the Protfinder
Algorithm
U. Bastolla
Centro de Astrobiologia (INTA_CSIC), Madrid, Spain
bastollau@inta.es
The Protfinder algorithm predicts protein structures by aligning the query
sequence to candidate structures in the PDB. Alignments are evaluated through
a minimal model of protein folding, which reproduces approximately some key
features of protein thermodynamics and is very convenient for rapid
computation. Information on sequence homology is not used in the scoring
function.
Protein structures are represented as contact maps and their effective
intramolecular interactions are modeled as a sum of contact interactions. We
use the contact energy function optimized in Ref. [1], which assigns lowest
energy to the experimentally known native structure for almost every sequence
of monomeric protein whose structure has been determined by X-ray
crystallography, except small fragments and chains with large cofactors.
Moreover, it generates well-correlated energy landscapes, in the sense that
structures very dissimilar from the native one have energies much higher than
the native energy. This property is crucial for protein structure prediction. The
effective energy function is also able to estimate the folding free energies of a
set of small proteins folding with two-state thermodynamics, with reasonable
agreement with experimental data [2].
The scoring function consists of three elements: the effective energy function
described above, a chain entropy term estimated in Ref. [2] and a term
penalizing gaps in the alignment. Gaps in secondary structure elements are
strictly forbidden. Gaps in the structure are allowed only if the two residues that
are shortcut are close in space and the angles characterizing their pseudopeptidic bond lie within a predefined range. Gaps in the sequence are allowed
only on the surface of the protein, which is identified by the fact that the
A-200
To speed up the computation, each structure in the PDBSELECT [3] nonredundant subset of the PDB was preprocessed to produce its contact map and
the list of allowed shortcuts in the structure. Secondary structure was obtained
from the DSSP file [4] when available, otherwise from the PDB file. The few
structures for which no secondary structure assignment could be obtained were
discarded. Preprocessing, together with the fact that the code uses mostly
integer arithmetic, speed up the computation considerably.
To search for the optimal alignment, we use a stochastic version of the
deterministic Build-up algorithm developed by Park and Levitt to look for low
energy configurations of discrete protein models [5]. The algorithm is very
efficient at finding high-scoring alignments, although it is not guaranteed to
find the best optimum.
The algorithm starts by generating all possible gapless alignments of length l
between the query sequence and the test structure and stores the M alignments
with maximum score. At each subsequent step, an attempt is made to add a new
residue to each alignment. There are three possibilities: either the residue is
aligned to the next structural position, or it is aligned introducing a gap in the
structure (if allowed), or the residue is not aligned, initiating a gap in the
sequence. All possible continuations are generated, and the M best scoring
alignments are stored in memory and used as seeds for the next step. The
algorithm is iterated until residues can not be added anymore.
To improve the efficiency, instead of using the deterministic algorithm
described above, we select the M alignments at each step based on the sum of
their score plus a random number. The relative importance of the randomness is
large in the first steps, allowing the algorithm to visit a larger fraction of the
alignment space. The randomness decreases as the alignments get longer, so
that the complete alignment is chosen on the basis of the deterministic score.
The algorithm is first applied using a small value M=50 to scan rapidly the
whole database. The 200 proteins with the best alignments are then stored in
memory and used for a second more accurate search with M=800.
Each candidate structure receives the score of its best alignment. The best
scoring structure is used as prediction. The goodness of the prediction is
estimated through the normalized energy gap, which measures the difference
between the best score and the score of an alternative structure in units of the
best score, divided by the structural distance between the best scoring structure
and the alternative structure. If the minimal value of the normalized energy gap
over all alternative structures is large the prediction is considered reliable, if it
is small alignments with very different structure have scores quite similar to the
best one and reliability is very low.
1.
2.
3.
4.
5.
Bastolla U. et al. (2000) A statistical mechanical method to optimize
energy functions for protein folding. Proc. Natl. Acad. Sci. USA 97, 39773981
Bastolla U. Testing the thermodynamics of a minimal model of protein
folding, in preparation
Hobohm U. and Sander C. (1994) Enlarged representative set of protein
structures. Protein Sci. 3, 522-524
Kabsch W. and Sander C. (1983) Dictionary of protein secondary
structure: pattern recognition of hydrogen-bonded and geometrical
features. Biopolymers 22 (12), 2577-2637
Park B.H. and Levitt M. (1995) The complexity and accuracy of discrete
state models of protein structure. J. Mol. Biol. 249, 493-507
A-201
Pushchino (P0203) - 263 predictions: 263 3D
Cunning Simplicity of a Hierarchical Folding and of Protein
Folding Funnels
A.V. Finkelstein
Institute of Protein Research, Russian Academy of Sciences, 142290,
Pushchino, Moscow Region, Russia
afinkel@vega.protres.ru
A hierarchic scheme of protein folding, as well as simple funnel models of
protein folding do not solve the Levinthal paradox, since they cannot provide a
simultaneous explanation for major features observed for protein folding: (i)
folding within non-astronomical time, (ii) independence of the native structure
on large variations in the folding rates of a given protein under different
conditions, and (iii) co-existence, in a visible quantity, of only the native and
the unfolded molecules during folding of moderate size (single-domain)
proteins. On the contrary, a nucleation mechanism of folding can account for
all these major features simultaneously and resolves the Levinthal paradox.
The author is grateful to N.S. Bogatyreva for discussions and assistance, and
acknowledges a support of an International Research Scholar's Award from the
Howard Hughes Medical Institute and of the Russian Foundation for Basic
Research.
Pushchino (P0203) - 263 predictions: 263 3D
Pushchino (P0203) - 263 predictions: 263 3D
Common Features in Structures and Sequences of SandwichLike Proteins
1
2
A.V. Finkelstein , A.E. Kister and I.M. Gelfand
Protein Folding: Theoretical and Experimental Study
A.V. Finkelstein1, O.V. Galzitskaya1, D.N. Ivankov1,
N.S. Bogatyreva1, S.A. Garbuzinskii1, M.Yu. Lobanov1,
D.A. Dolgikh2 and M. Oliveberg3
2
1
- Institute of Protein Research, Russian Academy of Sciences, 142290,
Pushchino, Moscow Region, Russia, 2 - Department of Mathematics, Rutgers
University, Piscataway, NJ, 08854, USA
afinkel@vega.protres.ru
1
- Institute of Protein Research, Russian Academy of Sciences, 142290,
Pushchino, Moscow Region, Russia, 2 - Shemyakin and Ovchinnikov Institute of
Bioorganic Chemistry, Russian Academy of Sciences, 117871, Moscow, Russia,
3
- Department of Biochemistry, Umeå University, S-901 87 Umeå, Sweden
afinkel@vega.protres.ru
The goal of this work is to define the structural and sequence features common
to sandwich-like proteins (SP) – a group of very different proteins comprising
now 69 superfamilies in 38 protein folds. Analysis of the arrangements of
strands within main sandwich sheets revealed a rigorously defined constraint on
the supersecondary substructure that holds true for 94% of known SP
structures. The invariant substructure consists of two interlocked pairs of
neighboring -strands. It is even more typical for centers of SP than the wellknown ‘Greek key’ strands arrangement [1] for their edges.
We present a theory for calculating refolding and unfolding rates and for
finding the folding nuclei of globular proteins from their 3D structures and
stabilities.
As homology among these proteins is usually not detectible even with most
powerful sequence-comparing algorithms, we employ a structure-based
approach to sequence alignment. Within the interlocked strands we found 12
positions with fixed structural roles in SP. A residue at any of these positions
possesses similar structural properties with residues in the same position of
other SP. The 12 positions lie at the center of the interface between the -sheets
and form the common geometrical core of SP. Of the 12 positions, 8 are
occupied by only four hydrophobic residues in 80% of all SP.
On this basis, we predicted the folding and unfolding rates for protein S6 and
two of its engineered circular permutants which have been designed so as to
have increased rates of transitions between the folded and unfolded forms. The
experimental study of these proteins confirmed the predictions.
Authors are grateful to C. Chothia, P. Ehrlich, M. Goldman and Yu. Vasiliev
for stimulating discussions, and to L. Pogost N.S. Bogatyreva for assistance.
A.V.F. acknowledges a support of an International Research Scholar's Award
from the Howard Hughes Medical Institute and of the Russian Foundation for
Basic Research.
1.
Richardson J.S. (1981) The Anatomy and Taxonomy of Protein Structure.
Adv. Prot. Chem, 34, 167-339.
A-202
The method is based on solution of kinetic equations for networks of foldingunfolding pathways. The theoretical results obtained for a large set of small and
middle-size proteins under various conditions are in a good correlation the
available experimental observations.
The obtained results emphasize a combined action of protein topology and
stability in controlling the rate of protein folding.
The work was supported by the Russian Foundation for Basic Research and by
an International Research Scholar’s Award from the Howard Hughes Medical
Institute.
Rokko (P0327) - 109 predictions: 109 3D
Method of Team Rokko: Multicanonical Ensemble Reversible
Fragment Assembly and Physico-chemical Energy Function
Yoshimi Fujitsuka1, George Chikenji1, Nobuyasu Koga1,
Akira R. Kinjo2, and Shoji Takada12
1
Kobe University, 2Japan Science and Technology Corporation
stakada@kobe-u.ac.jp
For CASP5, we use SimFold, a protein simulation program that we have been
developing recently [1,2]. We briefly describe a) the energy function, b) the
sampling method in SimFold, and c) how we did in CASP5.
a) SimFold uses a coarse-grained protein model that has explicit backbone
atoms and a sphere at the center of mass of sidechain. Each sidechain can take
one of several rotamer states. The energy function is based on physico-chemical
consideration and consists of many terms such as hydrophobic interaction,
hydrogen bonds, vdW interactions, and so on. In particular, hydrogen bond
interactions include dependence on local dielectric constant and correlation in
neighboring two bonds in beta sheet. Many of length-parameters are
determined from database survey. For the energetic parameters that need to be
accurate, we optimized them on the basis of the energy landscape theory. For
each of a 40 training protein structure set, we maximize |Z| score, the
normalized difference between native energy and average energy in decoy
structures.
original fragment library structures and hybrid ones. Hybrid fragment structures
of (i-1, i, i+1) residue segment consist of the latter half fragment structures of
(1-2, i-1, i) segment and the first half of (i, i+1, i+2). Because reversible FA
fulfills detailed balance condition, we could combine reversible FA with the
multicanonical ensemble Monte Carlo method which is known to be highly
powerful conformational sampling method for protein systems. Indeed, this
approach is used in CASP5 and helps conformational sampling very
significantly. Structures either with the lowest energy or at the center of large
clusters are chosen as predicted models. We also perform MD-based replica
exchange simulation, where each replica has the same protein with different
temperature and exchanges of replicas are tried at a certain frequency. The
lowest energy structure is searched in the replica at the lowest temperature,
while high temperature replica is useful for escaping from misfolded traps.
c) In CASP5, for all targets that have no homologous sequences of known
structures, we submitted structures predicted by SimFold. For chains shorter
than ~120, starting from random structures, we performed FA sampling either
with multicanonical ensemble method or with simulated annealing. We chose
either structures with low energies or those at the center of large clusters. For
longer sequences, we started from models in CAFASP server and performed
replica exchange MD for sampling and chose structures in the lowest
temperature replica. For some targets, other information such as annotation was
used too.
1.
2.
b) For conformational sampling, SimFold uses either the fragment assembly
(FA) method or the replica exchange MD method. We emphasize that, very
uniquely, both FA and MD methods are available in a single program SimFold.
Our FA is different from what has been developed by Baker's group in two
respects. First, we only use three-residue-fragments, instead of nine residue
ones. Second, we have developed an algorithm of "reversible FA method"
(Chikenji, Fujitsuka, & Takada unpublished). We note that the typical FA
protocol does not obey the detailed balance, but our algorithm does. In
reversible FA method, we prepare new fragment libraries which contain
A-203
Takada S., (2001) Protein Folding Simulation With Solvent-Induced Force
Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins,
Proteins 42, 85-98.
Fujitsuka Y., Takada S., Luthey-Schulten Z.A., & Wolynes P.G., (2002)
Optimizing Physical Energy Functions for Protein Folding, submitted.
Ron-Elber (P0300) - 259 predictions: 259 3D
SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D
Protein Structure Prediction With Threading Using the
LOOPP2 Algorithm
A Hybrid Method Combining Sequence and Chemical Shift
Data to Predict Secondary Structure
T. Galor1, C. Lowe1, J. Meller2, J. Pillardy3, O. Teodorescu1 and
R. Elber1
1
2
L-H. Hung and R. Samudrala
Dept of Microbiology, University of Washington
lhhung@compbio.washington.edu
–Department of Computer Science, Cornell University, Ithaca, N.Y., 14853;
–Cincinnati Children’s Medical Center, Pediatric Informatics, 3333 Burnet
Avenue, Cincinnati, OH 4522; 3– Computational Biology Service Unit,
Cornell University, Ithaca, N.Y., 14853
loopp@tc.cornell.edu
The recent increase in the amount of available experimental structural data has
been of considerable help to the structure prediction field. Strangely, the
converse is not true - sequence and homology based methods have had
relatively little impact on experimental methodologies. We are in the process of
developing hybrid methods using de novo techniques to facilitate and automate
NMR protein structure determinations.
See methods section
As a first step towards this goal, we describe a new method for assigning
secondary structure by using neural networks to combine sequence based
prediction (Psipred [1] ) with chemical shift information. The resulting hybrid
method (PsiCSI) achieves a Q3 accuracy of 89%, an increase of 5.5% and 6.2%
(or equivalently, 33% and 36% fewer errors) over methods that use sequence
information (Psipred) or chemical shifts (CSI [2] ) alone. In addition, errors
made by PsiCSI almost exclusively involve the interchange of helix or strand
with coil and not the helix with strand. The increase accuracy and automation
of PsiCSI will be of use in NMR experiments where assignment of secondary
structure is a useful intermediate step to the final determination of the tertiary
structure. The increased accuracy and the elimination of gross errors should
also be of use in protein structure predictions.
SAM-T02-server (P0189) - 221 predictions: 221 3D
SAM-T02 Protein Structure Prediction Webserver
Kevin Karplus, Rachel Karchin, and Richard Hughey
Center for Biomolecular Science and Engineering,
University of California, Santa Cruz
karplus@soe.ucsc.edu
See methods section
1.
2.
A-204
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Wishart D. S., et al. (1992) The chemical shift index: A fast and simple
method for the assignment of protein secondary structure through NMR
spectroscopy. Biochemistry 31, 1647-1651
SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D
the decoy sets. Further analysis of the extended potential on more decoy sets is
currently under way.
An Extension to the All-Atom Distance-Dependent Potential
For Ab Initio Protein Structure Prediction Based on Local
Sequence Similarity
1.
2.
3.
Shing-Chung Ngan and Ram Samudrala
Dept of Microbiology, University of Washington
ngan@compbio.washington.edu
Knowledge-based statistical potentials based on pairwise distances between
residues (e.g. [1-2]) and atoms (e.g [3-4]) have been widely used in protein
structure prediction. The determination of parameters for these potentials
involves extracting the distance distributions for pairs of residue-types (or
atom-types) from a set of proteins with known structures. A common drawback
in the construction of the potentials is that the connectivity of residues in the
protein chains is usually ignored. Hence, the influences of residues not local to
the residue (or atom) pair under consideration are not fully captured by the
resulting statistical model.
To provide a partial remedy to this shortcoming, the all-atom distancedependent potential as described in [4] is extended in a manner analogous to the
procedure described in [5], where a residue-residue distance dependent
potential was augmented. Essentially, in determining the distance distribution
of a pair of atom-types that is present in a given protein sequence whose
structure is to be predicted, a window of amino acid sequence surrounding each
of the two atoms is noted. Among the same pairs of atom types that are present
in the set of proteins with known structures, only those pairs with local amino
acid sequences similar to the noted amino acid sequences are to be used in
forming the distance distribution. The similarity measure is defined through the
BLOSUM 62 substitution matrix.
To evaluate the utility of the extended all-atom potential, it is tested on decoy
sets from the Decoys 'R' Us database [6]. Performance of the potential is
measured using the standard receiver-operating characteristic (ROC) analysis.
We observe that the new potential outperforms the all-atom potential in most of
A-205
4.
5.
6.
Wodak S. and Rooman M. (1993). Generating and testing protein folds.
Curr. Opin. Struct. Biol. 3, 247-259.
Sippl M. (1995). Knowledge based potentials for proteins. Curr. Opin.
Struct. Biol. 5, 229-235.
Subramaniam S., Tcheng D.K. and Fenton J.M. Knowledge-based methods
for protein structure refinement and prediction. In Proceedings of the
Fourth International Conference on Intelligent Systems in Molecular
Biology, St. Louis, 1996, Ed. David States et al., AAAI Press, California.
p. 218-229, 1996.
Samudrala R. and Moult J. (1998). An all-atom distance-dependent
conditional probability discriminatory function for protein structure
prediction. J. Mol. Biol. 275, 895-916.
Skolnick J., Kolinski A. and Ortiz A. (2000). Derivation of protein-specific
pair potentials based on weak sequence fragment similarity. Proteins 38:316.
Samudrala R, Levitt M. (2000). Decoys 'R' Us: A database of incorrect
protein conformations for evaluating scoring functions. Protein Science, 9:
1399-1401.
SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D
3.
The Bioverse: a Framework for Exploring the Relationships
among the Molecular and Organismal Worlds
4.
Sonnhammer E.L.L. et al. (1997) Pfam: A comprehensive database of
protein domain families based on seed alignments. PROTEINS: Structure,
Function and Genetics 28: 405-420
Hoffman K. et al. (1999) The PROSITE database, its status in 1999.
Nucleic Acids Res. 27: 215-219
Jason McDermott and Ram Samudrala
University of Washington, Department of Microbiology
mcdermottj@compbio.washington.edu
SBC (P0084) - 94 predictions: 94 3D
The large number of sequencing efforts underway has driven the need for ways
to better organize, visualize and use the vast amounts of genomic data being
generated. The Bioverse is an extensible framework for representing the
structural and functional data pertaining to single protein sequences in a
genome and the relationships between these proteins in inter- and intragenomic contexts. Predictions in the Bioverse are assigned confidence values
that can be used to combine information from different sources using neural
network-based approaches. For example, secondary structure in the Bioverse is
predicted by combining standard methods such as Psipred [1], sequence
similarity with proteins of known structure, and transmembrane region
prediction methods, then using a neural network to derive the final prediction.
The framework allows functional annotation of proteins using standard
sequence similarity methods (BLAST [2], HMMer [3], PROSITE [4]) as well
as through protein-protein interaction and evolutionary network context.
Prediction of protein structure using comparative modeling and/or ab intio
methods and structural comparison with databases of known structures provides
another powerful tool. In this way we are able to provide information about
proteins that show little or no sequence similarity with proteins of known
function. The Bioverse currently includes sequence, structure and function
information and protein-protein interaction and evolutionary network
representations of 12 genomes at http://bioverse.compbio.washington.edu.
1.
2.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new
generation of protein database search programs. Nucleic Acids Res. 25
(17), 3389-3402
A-206
The Pcons And Pmodeller Consensus Fold Recognition
Servers
Björn Wallner, Fang Huisheng and Arne Elofsson
Stockholm Bioinformatics Center, Stockholm University,
106 91 Stockholm, Sweden
arne@sbc.su.se
See methods section
Solovyev-Softberry (P0270) - 242 predictions: 177 3D, 65 SS
SoftPM: Softberry tools for protein structure modelling
V. Solovyev, D. Affonnikov, A. Bachinsky, I. Titov Ivanisenko
and Y. Vorobjev
Softberry Inc., 116 Radio Circle, Suite 400
Mount Kisco, NY 10549, USA
victor@softberry.com
See methods section
SUPERFAMILY (P0065) - 925 predictions: 925 3D
allows to select potential templates from GTDD library of templates (that is, to
perform fold recognition).
Structural domain predictions for all genomes
J. Gough
Structural Biology, School of Medicine,
Stanford University, CA94305-5126, U.S.A.
gough@stanford.edu
See methods section
THW-FR (P0377) - 241 predictions: 241 3D
Net Charge Center for Protein Fold Recognition: the LargeScale Fold Recognition in the Framework of CASP-5.
I. Torshin1,2,3, R. Harrison2 and I. Weber3
1
– Chair of Physical Chemistry, Chem. Dept., Moscow State University, 2 –
Comp. Sci. Dept., GSU, Atlanta, GA, 3 – Biol. Dept., GSU
biotiy@suez.cs.gsu.edu
Net charge center (NCC) is a novel physico-chemical model developed for
analysis of the relationship between protein structure and function
[unpublished] and is likely to determine the location of functional regions and
sequences if spatial structure of a protein is known. Sequences around positive
and negative charge centers (PNCC) are likely to be folding cores or folding
intermediates [1-3]. These two properties of a native protein can be used for
fold recognition using pre-compiled non-redundant “library” of templates. As
the template library we have used non-redundant domain database GTDD
(Gestalt Theory Domain Database [unpublished]). Gestalt theory [4], though
being proposed over 50 years ago, is still one of the best theories that describe
principles of perception. The gestalt principles can be computerized and were
applied to construct a database of domains. In short, using NCC + PNCC
A-207
Complete 3d-models for all of the CASP-5 targets (T0129-T0195, 67 proteins)
were prepared, 2-5 models for each target were submitted. Although structural
data were not made available at the time of preparation of this abstract, some
preliminary conclusions still can be made. Many targets had distinct sequence
identities or otherwise apparent similarities to a known structure: T0137 T0143
T0144 T0150 T0151 T0153 T0154 T0155 T0158 T0160 T0165 T0168 T0169
T0171 T0175-T0179 T0183 T0188 T0189 T0193 (24 proteins) and thus our
submitted fold predictions for those targets are likely to be reliable. The 24/67
ratio gives 37% as the assessment of the least reliability of the FoldRec-CC
method. This minimal 37% reliability is, at least, comparable to the CASP-4
results. Analysis of the preliminary results for these 24 proteins also suggests
that application of the NCC model alone can predict correct fold for at least 15
of these 24 proteins. An analysis of the same set of proteins also suggests that,
unexpectedly, using multiple sequence alignment of the target protein for fold
recognition does not significantly improve these preliminary results. Some of
the targets were recognized as being very likely to belong to a new fold: T0129,
T0148, T0161, T0184, T0186 (5 proteins). Although the results for the rest of
the targets cannot be assessed at this moment, we have preferred, in general, to
submit at least some of the prepared models rather than to increase the number
of “new fold” (or, in other words, empty) submissions. The method is fully
automated, though, of course, visual inspection of the final 10-20 models for
each target as well as using secondary structure predictions [5] can improve the
results of fold recognition.
1.
2.
3.
4.
5.
Torshin I. et al (2002) Charge centers and formation of the protein folding
core. Proteins, 43:353-364.
Torshin I. et al (2002) Identification of protein folding cores and nuclei
using charge center model of protein structure. TheScientificWorld
Journal, 2:84-86.
Torshin I et al. (2002) Protein folding: search for basic physical models,
submitted.
W. Kőhler (1947), Gestalt Psychology, 136-279.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292, 195-202.
Tsai (P0061) - 105 predictions: 105 3D
motif database. Similarity to the database structures conferred a better score,
and we chose the structure with the highest tertiary motif score as our primary
submission.
Using a Clustered 9mer Fragment Library and Evaluating
Tertiary Interactions in a De Novo, Fragment Based
Prediction Method
J. B. Holmes, H. C. Hodges, R. Swanson, D. Schell, R. Bliss, and
J. Tsai
Texas A&M University, Department of Biochemistry & Biophysics
JerryTsai@tamu.edu
Our approach to de novo structure prediction was naturally influenced by the
successful procedures of the Rosetta algorithm created in the Baker lab;
however, we developed our own method named Mosaix. Our methods differed
from Rosetta in using 1) only a clustered, 9mer move set, and 2) tertiary motif
scoring both internally, in the evaluation of moves, and externally, in the final
filtering of the decoys.
Fragment library: Our fragment library was created starting with the culled
pdb structure list (now PISCES) [1]. Each pdb file in the list was split into all
of its possible overlapping 9mers (9 adjacent residues) and the 9mers were
grouped initially into super-clusters based on a ProMotif determination of 2°
structure [2]. We considered three types of 2° structure: helix, sheet, and other
(coil & turn). This created 135,298 9mers in 1,982 super-clusters, and in all,
clustered into 34,952 clusters.
Tertiary motifs: We have constructed a library of tertiary motifs (TerMo) and
have used them in two ways. For quick analysis during construction, we
developed a gross TerMo score. Based on statistics of mean distances and
vector torsion angles from the TerMo library, we assumed a normal distribution
for both of these measures and installed a Gaussian scoring function to score
the tertiary structure around contacts for similarity to a database entry for each
newly built structure. In post-filtering of structures, we compared directly to
the TerMo Library for a refined TerMo score. The peptide was cleaved
between 2° structure components, and contacts were determined. These contact
pairs were then compared via RMSD to the contact pairs in our local tertiary
A-208
Heuristical Approach: We took 2° structure predictions from the CAFASP
results (PHD [3], PSIPRED [4], Sam-T99-2d [5]), and we split long sequences
into smaller pieces because of a power-law dependence of calculation time on
sequence length. Long sequences were split within regions that lacked strong
helical or strand propensities. In deciding how to split sequences we also
looked at the CAFASP{Fischer, 2001 #8}results and some PSI-Blast results
[6]. Tertiary structures were re-assembled from the split sequences manually
(see below). While Mosaix is a derivative of the Rosetta method, we used only
used the clustered 9mer library described above. Based on the 2° structure
predictions, we chose 150 fragments (50/prediction) and added fifty more
fragments that represented all types of 2° structure elements for wild-card
rescue from possible error in 2° structure prediction. For each prediction target,
Mosaix was run 1000 or more times for ne/e random fragment insertions
(n=sequence length). Each new decoy created was kept or thrown out based on
a Boltzmann-like Monte Carlo system. The potential function used
incorporated the Rosetta environment, pair and bump check, along with the
gross TerMo score for 2° structure pairing. After every 1% of the total
insertion iterations, the current, best-scored structure was output in pdb format,
as long as it met fairly relaxed contact order requirements, yielding about 50
decoys per run. In total, ~50K decoys were generated for each target, which
were clustered by a multi-centered clustering method [7]. Linked lists and
intensive use of memory allowed the clustering algorithm to process 84,000
decoys of 109 residues in 5.5 hours. The 30 cluster centers with the most
members were minimized using ENCAD [8] and scored with the refined
TerMo score. The top-scoring model was our primary submission, and the
remaining were reviewed and scored manually by intuition. The final four
submissions were based on a combination of the intuition-based ranking and
the refined TerMo score. For sequences that were split into domains, the
structures were then joined by first calculating the phi-psi angles and then
building the protein in phi-psi space according to these angles.
1.
Wang G. and Dunbrack R.L. Jr., (2002) PISCES: a protein sequence
culling server. Bioinformatics.. (submitted).
2.
3.
4.
5.
6.
7.
8.
Hutchinson E.G. and Thornton J.M. (1996) PROMOTIF--a program to
identify and analyze structural motifs in proteins,. Protein Sci. 5(2): p. 21220.
Przybylski D. and Rost B., (2002), Alignments grow, secondary structure
prediction improves,. Proteins 46(2): p. 197-205.
McGuffin L.J., Bryson K., and Jones D.T., (2000), The PSIPRED protein
structure prediction server. Bioinformatics 16(4): p. 404-5.
Karplus K. and Hu B., (2001), Evaluation of protein multiple alignments
by SAM-T99 using the BAliBASE multiple alignment test set.
Bioinformatics. 17(8): p. 713-20.
Altschul S.F. and Koonin E.V. (1998), Iterated profile searches with PSIBLAST--a tool for discovery in protein databases, Trends Biochem Sci
23(11): p. 444-7.
Shortle D., Simons K.T., and Baker D., (1998) Clustering of low-energy
conformations near the native structures of small proteins. Proc Natl Acad
Sci U S A. 95(19): p. 11158-62.
Levitt M. and Lifson S.,(1969) Refinement of protein conformations using
a macromolecular energy minimization procedure. J Mol Biol, 46(2): p.
269-79.
Zhou-HX (P0056) - 134 predictions: 69 3D, 65 SS
Improving Fold Recognition and Query-Template Alignment
by Combining PSI-Blast and Sequence-Structure Threading
H. Chen1, 2 and H.-X. Zhou1
1
– Florida State University, 2 – Drexel University
hxzhou@csit.fsu.edu
See methods section
A-209
A-210
CASP5 Software Demonstration
Abstracts
A-211
A-212
harrison (P0188) - 43 predictions: 43 3D
Head-Gordon (P0271) - 93 predictions: 93 3D
Robust Molecular Modeling
ProtoShop: Interactive Design of Protein Structures
O. Kreylos1, N. Max1 and S. Crivelli2
John Petock1, Ping Liu1, Irene T. Weber1, and Robert W.
Harrison2,1
1
1- Department of Biology, 2- Department of Computer Science, Georgia State
University
rharrison@cs.gsu.edu
Robust molecular modeling is the problem of ensuring that a molecular model
can be built that both satisfies a minimal set of input data as well as ensuring
that the model explores the range of possible structures which meet those data.
Input data typically consist of interatomic distances and partial structures in
homology modeling, but can consist of solely distance data in NMR structure
determination and ab initio modeling.
Two randomized algorithms for robust modeling are implemented in the
computer program AMMP. These include both a self-assembling neural
network[1], and simulated annealing distance geometry. The self-assembling
neural network uses a Kohonen neural network to mimic the natural selfassembly of polymers. It is quite capable of taking a limited description of a
polymer and generating sets of models that satisfy those data. The other
algorithm uses Floyd's algorithm (iterated triangle inequality) to fill in the full
distance matrix for distance geometry. Standard distance geometry algorithms
perform this calculation to estimate interatomic distances. However, unlike
standard distance geometry algorithms, it is capable of treating the distances
derived via Floyd's algorithm as strict upper bounds rather than distance
estimates. The distance geometry equations are solved by a straightforward
Metropolis simulated annealing algorithm.
1.
Harrison R.W. (1999) A self-assembling neural network for modeling
polymer structure. J. Math. Chem 26, 125-137.
Department of Computer Science, University of California, Davis, 2 NERSC,
Lawrence Berkeley National Laboratory
SNCrivelli@lbl.gov
We demonstrate ProtoShop, a software tool that geometrically creates protein
structures from amino acid sequence and secondary structure prediction files
and allows interactive visualization and manipulation of those structures to
design protein configurations. The program has two major stages: In the first
stage, an initial protein structure is created from an input file; in the second
stage, protein structures are visualized and can be manipulated interactively.
Input to the program is either a PDB file, or a "prediction file" in FASTA
format. Prediction files contain the amino acid sequence for a given protein,
and specify each residue's secondary structure type as one of -helix, -strand
or coil. When reading a prediction file, protein structures are created one amino
acid residue at a time. Each residue's type is read from the input file, and atom
positions are read from residue template files. As the protein is assembled, the
program sets the dihedral angles of each added residue according to its
specified secondary structure type, and attaches the created residue to the end
of the existing protein. This way, proteins are created with secondary structures
already assembled. Typically, creating a protein is instantaneous.
Once a protein structure has been created, the program can visualize it using
several rendering styles. The main purpose for visualization is to aid a user in
the manipulation of the protein structure. Therefore, visualization includes
manipulation guides that are not part of the protein itself, such as indicators for
hydrogen bonds. Also, since collisions between atoms are not prohibited during
manipulation, they are visualized to call them to the user's attention.
Protein structures can be manipulated in two main ways: First, structure types
can be changed for individual residues on-the-fly, and secondary structures can
be manipulated as a whole, e.g., by twisting or curling beta strands or re-
A-213
forming alpha helices. Second, entire partial structure assemblies or secondary
structures can be dragged to form tertiary structure. Dragging is achieved by
automatically adjusting dihedral angles of selected coil regions using an Inverse
Kinematics (IK) method [1]. IK allows the manipulator to translate a user’s sixdegree-of-freedom motions into changes of a chain segment’s dihedral angles 
and . This gives a user great flexibility in aligning parts of proteins without
breaking the entire protein. The main application of dragging is to form
arbitrary beta sheet alignments, either manually, or assisted by selecting
residues for automatic bonding. Manipulation guides such as hydrogen bond
indicators and potential bond site indicators were specifically designed for the
purpose of forming beta sheets. These guides are updated in real-time during
manipulation, including forming/breaking of hydrogen bonds and visualization
of atom collisions.
Created protein structures can be saved in PDB format at any time during
manipulation, either to serve as input to other programs, or to reload structures
at a later time. ProtoShop has been used to create initial configurations for the
global optimization method used by the Head-Gordon group. The tool has
allowed this group to tackle proteins of any size and topology.
1.
Welman C. (1993) Inverse kinematics and geometric constraints for
articulated figure manipulation. Master’s Thesis, Simon Fraser University,
Vancouver, Canada.
HOGUE-SLRI (P0267) - 254 predictions: 254 3D
The Distributed Folding Project
H.J. Feldman1,2 and C.W.V. Hogue1,2
1
– Samuel Lunenfeld Research Institute, Mount Sinai Hospital
2
– Department of Biochemistry, University of Toronto
hogue@mshri.on.ca
The number of users connected to the internet is growing faster than ever
before. High speed connections are becoming more and more common, and
will soon be the norm in many countries across the world as the modem goes
the way of the dinosaurs. This, combined with the fact that the average
computing power of a home machine is now comparable to that of
supercomputers just a few decades ago, means that there are massive amounts
of computing resources becoming available, all linked through one common
medium – the internet.
A total of 13 targets were predicted with the help of distributed computing
using an ab initio approach. Using a modified version of our highly
parallelizable TRADES algorithm [1] we developed a distributed computing
application, The Distributed Folding Project, to sample protein conformational
space. We incorporated secondary structure prediction from PsiPred [2] and
performed all-atom kinetic random walks in Ramachandran space on client
CPUs, biased by the 3-state secondary structure prediction. Sidechains were
placed probabilistically using Dunbrack's backbone dependent rotamer library
[3]. All residues are chirally and sterically valid, having a minimum of nonhydrogen van der Waal collisions.
Users download and run the software in the form of a Windows screensaver or
a Windows/UNIX ASCII art text client. The software generates probabilistic
conformers as described above and submits the results to our central server for
analysis and storage. Every time a new protein is begun, the software
automatically updates itself, downloading the information on the new protein
from our server, and the data is digitally signed for security purposes.
A-214
Up to one billion structures were generated for each target using the Distributed
Folding Project framework (http://www.distributedfolding.org/). This allowed
us to make use of spare CPU cycles on thousands of computers from volunteers
across the world to sample vast amounts of conformational space.
From the pool of generated structures various statistics were collected including
radius of gyration, exposed surface area, exposed hydrophobic surface area, and
energy score according to three different scoring functions: the EEF1 solvation
term, a modified version of a statistical residue-based potential [4] which also
compared actual secondary structure content to predicted content, and a
species-specific contact potential developed in our lab. Structures with radii of
gyration greater than 120% * 2.59 * N^0.346, where N is the number of
residues in the protein, were all discarded. This ensured only compact
structures were retained. The best structures were chosen based on their energy
scores.
The Distributed Folding Project serves as a rapid testing ground for evaluation
of new sampling algorithms and scoring functions, limited only in that they
must remain parallelizable. When used with proteins of known structure, it can
reveal how well different scoring functions are able to distinguish near-native
structures from a large pool of decoy conformers, and how quickly different
sampling algorithms converge towards native-like structure.
1.
2.
3.
4.
Feldman H.J. and Hogue C.W.V. (2000) A Fast Method to Sample Real
Protein Conformational Space. Proteins 39 (2), 112-131.
Jones D.T. (1999) Protein secondary structure prediction based on
position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.
Dunbrack R.L., Jr. and Karplus M. (1993) Backbone-dependent rotamer
library for proteins. Application to side-chain prediction. J.Mol.Biol. 230
(2), 543-574.
Bryant S.H. and Lawrence C.E. (1993) An Empirical Energy Function for
Threading Protein Sequence through the Folding Motif. Proteins 16 (1),
92-112.
A-215
Osgdj (P0292) - 100 predictions: 100 3D
The PROTSCAPE Protein Folding WEB Server
D.J. Osguthorpe & N. WhiteLegg
University of Bath in Swindon
djosg@mgu.bath.ac.uk
The PROTSCAPE protein folding WEB server is a CGI interface to the
PROTSCAPE protein folding algorithm. The sequence is simply pasted into the
submission page and the generated conformations returned by e-mail.
A Beowulf cluster based on 20 Dual 1.2 Ghz Athlon processors is used to
perform the simulated annealing calculations required to generate the
structures.
The resulting simplified model is converted to an all-heavy atom model using a
combined procedure of RMS fitting and building The RMS fitting generates the
all-atom backbone and is followed by a side chain building procedure to build
the all-atom side chains using the simplified model side chain atoms as guide
points.
Because of the heavy computational nature of the problem it takes about 24
hours per 100 residues.
A-216
CASP5 Other Abstracts
A-217
A-218
Evaluation of Blind Predictions of Protein-Protein
Interactions Made in the CAPRI Experiment
Raul Mendez, Raphael Laplae, Leonardo DeMaria and
Shoshana J. Wodak
Service de Conformation de Macromolécules Biologiques et Bioinformatique,
Cp263, Université libre de Bruxelles, Blv du Triomphe, 1050 Brussels,
Belgium.
shosh@scmbb.ulb.ac.be
Tens of thousands of gene products are known or suspected to interact with
many others, based on genetic, biochemical or bioinformatics methods, forming
millions of putative complexes. A very small fraction of these complexes will
be characterised, let alone have their 3D structure determined, in the near
future. Procedures for predicting the modes of association from the structures
of the components, docking procedures, have therefore received renewed
attention recently. But before predicted modes of association can serve as a
guide in genetic and biochemical experiments, the performance of the
prediction methods must be systematically assessed.
The Critical Assessment of PRedicted Interactions experiment (CAPRI) is a
community-wide blind test, similar to CASP, but devoted to docking
procedures (http://capri.ebi.ac.uk/Charleston.html). It aims at assessing the
state of the art of methods for predicting protein-protein interactions from the
3D structure of the unbound components.
Here we report the results of the evaluation that we conducted on 535
predictions submitted in two rounds of the CAPRI experiment by 19 different
groups for a total of 7 target complexes. Several of the complexes were large
multisubunit assemblies, and some featured conformational changes between
the bound and unbound species. We recently assessed these predictions and
presented the results to the predictors during the 1 st CAPRI evaluation meeting
held in France Sep. 19-21, 2002. Here we would like to present highlights from
this evaluation.
To perform the evaluation we computed for each predicted complex the
fraction of native residue-residue contacts (those observed between the
interacting molecules in the target complex) that is recovered in the predicted
structure and the fraction of native interface residues that is recovered on each
face of the contacting proteins. We also quantified the rigid-body
A-219
transformation (center of coordinates translation and rigid-body rotation) that
are required to bring the predicted complex into register with the target and
computed two different rmsd. One between the main chain of interface residues
in the target versus the prediction, and another between the main chain of
molecules B in the prediction versus the target, after molecules A of both
complexes were optimally superimposed. Although overall, the predictions
cannot be qualified as successful, for each target a few groups succeeded in
coming close and sometimes very close to the right answer. But different
groups contributed successful predictions for different targets. It was very
encouraging that near correct predictions were made also in difficult cases
where the components undergo some conformational changes upon binding. It
appeared that in these and other cases, using biochemical information to guide
the calculations provided a clear advantage, but in other cases such information
was misleading.
The evaluation and ensuing discussions were also useful in pointing out
directions for future progress. Amongst the different docking procedures, a few
were clearly computationally very efficient in sampling potential docking
solutions, whereas others had better criteria for scoring these solutions.
Approaches that combine the more efficient search algorithms with the best
scoring functions should therefore be a good way forward. Another avenue for
progress will undoubtedly come from several novel procedures, tested for the
first time in CAPRI, with quite promising results.
The results of the CAPRI experiment will be published in a special issue of
Proteins, Structure, Functions and Genetics, in the spring of 2003.
The Pittsburgh Supercomputing Center and Hewlett Packard
Support of CASP5
Troy Wymore1, Angela Loh2, David Deerfield II1 , Ralph
Roskies1 and Ken Hackworth1
1
Pittsburgh Supercomputing Center, 2Hewlett Packard Life and Material
Sciences Division
wymore@psc.edu
The National Science Foundation Partnerships for Advanced Computational
Infrastructure program allocated computing time on PSC’s Terascale
Computing System (TCS) for researchers participating in CASP5. The TCS is
comprised of 3,000 HP Alpha Server EV68 processors and has a peak
capability of six teraflops (six trillion operations per second) making it the most
powerful system in the world for open research. This large-scale computational
resource was intended to advance structure predictions and refinements by
allowing researchers to use more accurate potentials and/or better sample
conformational space. This presentation will detail the computational resources
made available for the prediction season, the process by which time was
allocated, the resource usage and plans for future CASP experiments.
A-220
Abstract Author Index
A-221
A
Adamczak ................................. 107
Affonnikov ....................... 153, 206
Akiyama ....................... 37, 61, 122
Akutsu ...................................... 110
Albrecht ...................................... 33
Alexandrov ...................................3
Amaro ....................................... 146
An ............................................... 62
Arai ....................................... 54, 55
Arakaki ..................................... 152
Arnautova ................................. 145
Athma ................................. 36, 181
Autenrieth ................................. 146
Avbelj ......................................... 10
B
Bachinsky ......................... 153, 206
Badretdinov ..................................3
Baker ...................... 11, 12, 13, 176
Bakker ........................................ 30
Baldi ........................................... 14
Bass ............................................ 16
Bastolla ..................................... 124
Bates ........................................... 16
Benner .................................. 18, 19
Bienkowska .............................. 156
Bindewald .................................. 34
Birney ....................................... 192
Bliss .......................................... 208
Blundell .................. 30, 66, 90, 190
Bogatyreva ....................... 130, 202
Bolanos-Garcia ........................... 30
Boniecki ................................... 152
Boojala ..................................... 147
Bourne ............................. 147, 154
Boxall ...................................... 106
Bradley .............................. 12, 176
Braun ................................. 25, 178
Bredesen .......................... 114, 198
Brenner .............................. 52, 186
Brewerton .................................. 30
Brooks ....................................... 26
Brown ........................................ 49
Bujnicki ....................... 27, 71, 186
Burke ........................... 30, 90, 190
Butenhof .............................. 3, 175
Byrd ........................................... 76
Bystroff ........................ 28, 86, 180
C
Camacho ...................... 31, 32, 180
Canutescu .......................... 48, 184
Cao..................................... 80, 188
Capriotti ............................. 44, 182
Casadio .................43, 44, 100, 182
Casper ...................................... 136
Catherinot ............................ 9, 164
Cestaro ......................... 33, 34, 181
Chen H. ............................ 171, 209
Chen L. ...................................... 30
Chen W. ............................. 41, 148
Cherukuri ................................. 156
Chiba ................................. 41, 182
Chikenji ........................... 133, 203
Chinchio .................................. 145
Chivian ........................ 11, 13, 176
Choi ........................................... 76
Colubri ....................................... 59
Combet ...................................... 68
Coveney ................................... 152
A-222
Crivelli ............................... 76, 213
Cubellis ...................................... 30
Cuff A.L. .................................. 106
Cuff J. ...................................... 192
Cymerman .......................... 71, 186
Czaplewski ............................... 145
D
Damien ....................................... 99
Danzer ........................................ 24
Darlington ................................ 192
Day............................................. 62
De Kee ................................. 18, 19
Deane ....................................... 165
Debe ........................................... 24
Deerfield II ............................... 220
Del Carpio-Muñoz ............. 45, 183
del Rio .............................. 114, 198
DeMaria ................................... 219
Dent.................................... 36, 181
Depiereux ..................... 51, 99, 195
DePristo ..................................... 30
Diekhans .................................. 136
Diemand ..................................... 68
Dobbs ................................. 80, 188
Dolgikh .................................... 202
Doniach ...................................... 47
Douguet................................ 9, 164
Drake.......................................... 30
Draper ...................................... 136
Dumontie ................................. 189
Dumontier .................................. 81
Dunbrack R.L............. 48, 184, 185
Dunker ....................................... 49
E
Eastwood .................................. 167
Ehebauer..................................... 30
Eisenberg .................................. 165
Elber ................................. 134, 204
Ellrott ............................... 116, 199
Elofsson ............................ 144, 206
Eskow ......................................... 76
Eyrich ......................................... 62
F
Fang.......................................... 150
Farid ........................................... 62
Fariselli................. 43, 44, 100, 182
Favrin ......................................... 88
Feder .................................. 71, 186
Feig ............................................ 26
Feldman ...................... 81, 189, 214
Fernández ................................... 59
Finkelstein ........ 130, 169, 201, 202
Fischer ........................................ 57
Fiser .................................... 52, 186
Fitzjohn ...................................... 16
Fleming .................................... 192
Flohil .................................... 56, 57
Floudas ....................................... 58
Fogolari .............................. 34, 181
Fooks ........................................ 106
Friesner....................................... 62
Fujita ........................................ 110
Fujitsuka ........................... 133, 203
G
Galor ................................ 134, 204
Galzitskaya ....................... 130, 202
Gao ..................................... 80, 188
Garbuzynskiy ................... 130, 202
Garciarrubio ..................... 114, 198
Garner ......................................... 49
Garnier........................................ 67
Gelfand ..................................... 202
Gerloff ................................ 73, 187
Gibrat.......................................... 65
Gibson K.D. ............................. 145
Gibson R.C. .............................. 106
Ginalski ...................................... 74
Giordanetto ............................... 152
Go ......................................... 5, 175
Godzik ................................ 56, 179
Goede ....................................... 123
Gottlieb ..................................... 177
Gough ............................... 159, 160
Graña ........................................ 100
Grotthuss .................................. 179
Gunn ........................................... 62
Guo ................................... 116, 198
Gweon .................................. 30, 66
H
Hackworth ................................ 220
Hagino ...................................... 183
Haley-Vicente....................... 3, 175
Hamilton ............................. 73, 187
Han ........................................... 120
Hardin ....................................... 167
Harmer........................................ 30
Harrington .................................. 62
Harrison ...................... 75, 207, 213
Head-Gordon .............................. 76
Heger .......................................... 82
Hibi............................................. 20
Hijikata ................................. 5, 175
Hill ..................................... 52, 186
Hirokawa ................................... 37
Ho ...................................... 80, 188
Hodges ..................................... 208
Hogue ........................ 81, 189, 214
Holm .......................................... 82
Holmes ..................................... 208
Honig ......................................... 83
Horimoto.................................. 110
Huber ................................. 85, 190
Hubner ..................................... 148
Hughey .....................136, 137, 204
Huisheng .......................... 144, 206
Hung ........................................ 204
Hussein .............................. 73, 187
Hutchinson ............................... 106
I
Ihm..................................... 80, 188
Imbert ........................................ 41
Irbäck ......................................... 88
Ishida ......................................... 20
Ishizuka...................................... 20
Ivanciuc ............................. 25, 178
Ivanisenko........................ 153, 206
Ivankov .....................130, 169, 202
Iwadate ...................42, 54, 55, 182
J
Jacobson .................................... 62
Jager........................................... 89
Jagielska .................................. 145
Januszyk .................................. 146
Jaroszewski ........................ 56, 179
Jernigan...................................... 67
Jha ............................................ 152
Jones .....................91, 92, 191, 192
Joo ........................96, 97, 193, 194
Joseph ...................................... 162
A-223
Juan .......................................... 100
K
Kalisman .................................... 94
Kang ......................................... 145
Karchin .................... 136, 137, 204
Karplus ..................... 136, 137, 204
Katta ......................................... 150
Kaźmierkiewicz ....................... 145
Kaznessis ................................... 93
Keasar ........................................ 94
Kelley ............................... 156, 192
Khalili ...................................... 145
Khandelia ................................... 93
Kihara ...................................... 152
Kilkenny .................................... 30
Kim D. ..................... 116, 198, 199
Kim D.E. .................................... 13
Kim H. ....................................... 98
Kim I. ................... 96, 97, 193, 194
Kim S. ................................ 97, 193
Kim S.-Y. ............. 96, 97, 193, 194
Kinjo ................................ 133, 203
Kister........................................ 202
Klepeis ....................................... 58
Kloczkowski .............................. 67
Knizewski ................................ 179
Kochupurakkal ........................... 30
Koehl................................ 108, 197
Koga ................................. 133, 203
Kolinski.................................... 152
Kolodny ........................... 108, 196
Kolossváry ................................. 22
Kondratova............................... 130
Konerding .......................... 52, 186
Koretke .......................... 68, 69, 70
Kornev ....................................... 78
Kosinski ............................. 71, 186
Kotlovyi ..................................... 78
Kreylos ............................... 76, 213
Krieger ..................................... 169
Krishnaswamy .......................... 150
Kuhn ................................... 12, 176
Kumar....................................... 150
Kurihara ............................. 42, 182
Kurowski ............................ 71, 186
Kussell ...................................... 148
L
Labesse ................................. 9, 164
Lai .............................................. 30
Lambert ........................ 51, 99, 195
Laplae ....................................... 219
Lattimore .................................. 106
Lebedev .................................... 149
Lee J.K. ............................ 170, 171
Lee Jo. .................. 96, 97, 193, 194
Lee Ju. .................. 96, 97, 193, 194
Lee S.J. ............................... 96, 194
Leeuw ................................... 56, 57
LeFlohic ............................. 23, 177
Léonard ...................................... 99
Leplae ....................................... 112
Levitt ........................ 108, 196, 197
Li M. ........................................ 132
Li W. ........................................ 154
Li Xia. ........................................ 49
Li Xin ......................................... 62
Li Y. ........................................... 48
Lin D. ....................................... 199
Lin K.X. ................................... 161
Lindahl ............................. 108, 197
Litvinov .................................... 130
Liu ...................................... 75, 213
Liwo ......................................... 145
Lobanov ........................... 130, 202
Lobley ........................................ 30
Loh ........................................... 220
Lomize.............................. 101, 196
Lovell ......................................... 30
Lowe ................................. 134, 204
Luethy ................................ 16, 101
Lugovskoy ................................ 177
Lund ................................. 102, 196
Lundegaard ....................... 102, 196
Lupas .................................... 69, 70
Luthey-Schulten ............... 146, 167
Mitchell.............................. 73, 187
Mizuguchi ............................ 30, 66
Montaluoa .................................. 30
Moon ............................... 170, 171
Moreira ...................................... 16
Mosberg ................................... 101
Mueller .................................... 192
Murphy E.F. ............................. 106
Murphy P. .................... 11, 12, 176
Murzin ..................................... 110
Mushegian ........................... 6, 176
M
N
MacCallum ............................... 104
Madera...................................... 160
Mallick ..................................... 165
Malmstrom ................................. 13
Mande ....................................... 162
Mandel-Gutfreund .................... 136
Marin .......................................... 65
Martin A.C.R. ........................... 106
Martin L. ......................................9
Mathura .............................. 25, 178
Mavropulo-Stolyarenko............ 149
Max .......................................... 213
McAllister ................................ 156
McCormack .......................... 18, 19
McDermott ............................... 206
McGuffin ...................... 91, 92, 192
Mehta........................................ 150
Meiler ........................... 12, 13, 176
Meller ....................... 107, 134, 204
Mendez ..................................... 219
Michalsky ................................. 123
Migliavacca ................................ 68
Miguel ........................................ 30
Miki ............................................ 20
Misura ................................ 12, 176
Nakamura .................................. 20
Nanias ...................................... 145
Newhouse ................................ 192
Ngan .........................126, 142, 205
Nielsen ............................. 102, 196
Nishimura .................................. 20
Noguchi ............................. 37, 122
Noguti .................................. 5, 175
Pas ............................................ 121
Passovets .......................... 116, 199
Pazos ........................................ 100
Pellegrini-Calace ...................... 191
Pellequer .................................... 41
Peng ........................................... 88
Petock ................................ 75, 213
Petrey ......................................... 83
Pible ........................................... 41
Pillardy ............... 38, 134, 145, 204
Pincus ......................................... 62
Pogorelov ................................. 146
Pogozheva ........................ 101, 196
Pollastri ...................................... 14
Pons...................................... 9, 164
Popovic ...................................... 30
Pothier ........................................ 65
Prasad ................................. 31, 180
Preissner ................................... 123
Prentiss ..................................... 167
Procter ................................ 85, 190
O
R
O’Donoghue ............................ 146
Obradovic .................................. 88
Oezguen ............................. 25, 178
Offman ....................................... 16
Ogata ....................................... 112
Oliveberg ................................. 202
Olmea ........................................ 43
Onizuka........................................ 4
Orengo ..................................... 192
Osguthorpe ...................... 117, 215
Ota ........................................... 122
Radivojac ................................... 49
Raghava ........................... 131, 132
Rapp ........................................... 62
Raschke ............................ 108, 196
Raval .......................................... 95
Reddy ....................................... 147
Reibarkh ................................... 196
Reva ......................................... 135
Ripoll ................................. 38, 145
Robertson ................................... 13
Robinson .................................. 192
Rohl.............................. 11, 13, 176
Romero ...................................... 49
Roskies ..................................... 220
Rossi .................................. 44, 182
P
Pan ........................................... 120
Park ............................................ 98
A-224
Rotem ....................................... 123
Roytberg ................................... 130
Royyuru .............................. 36, 181
Rychlewski ................................. 21
Rykunov ................................... 135
S
Saigo ........................................ 110
Sali ..................................... 52, 186
Salim ........................................ 161
Samudrala 126, 127, 129, 139, 140,
142, 204, 205, 206
Samuelsson................................. 88
Saqi .................................... 95, 152
Sasaki ......................................... 20
Sasin ................................... 71, 186
Sasson ...................................... 143
Saunders ................................... 145
Sawicka ...................................... 47
Scheib ............................. 68, 69, 70
Schein ................................. 25, 178
Schell ........................................ 208
Scheraga ................................... 145
Schief ................................. 12, 176
Schmid ............................... 73, 187
Schnabel ..................................... 76
Schneider .............................. 3, 175
Schonbrun .......................... 12, 176
Schueler-Furman ................ 12, 176
Shah.................................. 116, 199
Shakhnovich B. ........................ 148
Shakhnovich E.I. ...................... 148
Shao .............................. 28, 86, 180
Sharikov ..................................... 78
Shelenkov ................................. 184
Shestopalov .............................. 149
Shetty ......................................... 30
Shi ........................................ 39, 66
Shigeta ................................ 23, 177
Shimizu ...................................... 20
Shindyalov................................ 154
Shirasawa ................................. 183
Shitaka ........................................ 41
Shortle ...................................... 150
Siew ............................................ 57
Silverman ........................... 36, 181
Singh ........................................ 177
Sjölander ............................ 52, 186
Sjunnesson.................................. 88
Skolnick.................................... 152
Smith .......................................... 16
Soares ................................. 73, 187
Solovyev ........................... 153, 206
Sommer ................................ 7, 176
Sorensen ................................... 192
Soto ............................................ 83
Spassov ................................. 3, 175
Standley ...................................... 62
Stebbings .............................. 30, 66
Sternberg .......................... 156, 192
Strauss .......................... 11, 12, 176
Suenaga ...................................... 37
Summa.............................. 108, 196
Sundaram K. ............................. 158
Sundaram S. ............................. 158
Swanson ................................... 208
Szczesny ................................... 179
Szilagyi ..................................... 152
T
Takada .............................. 133, 203
Takaya ................................ 41, 182
Takeda-Shitaka ................... 42, 182
Talbot ....................................... 106
Tanaka ................................ 41, 182
Tang............................................ 83
Tarakanov ................................ 135
Taylor ...................................... 161
Ten Eyck .................................... 78
Teodorescu ...................... 134, 204
Terashi ....................42, 54, 55, 182
Tereshchenko ............................. 87
Thornton .................................. 192
Thorpe ....................................... 30
Titov ................................ 153, 206
Tomii ........................... 37, 61, 122
Toppo......................................... 33
Torda ................................. 85, 190
Torshin ............................. 163, 207
Tosatto ......................... 33, 34, 181
Tsai .......................................... 208
Tsigelny ..................................... 78
Tungaraza .......................... 36, 181
U
Umeyama ..........41, 42, 54, 55, 182
W
Wallin ........................................ 88
Wallner ............................ 144, 206
Wang B.-C. .............................. 199
Wang C.-Z. ........................ 80, 188
Wang G. ............................. 48, 185
Wang L. ................................... 199
Ward .......................................... 92
Weber ......................... 75, 207, 213
Wedemeyer ........................ 12, 176
Weiss........................................ 165
WhiteLegg ............................... 215
Wild ........................................... 95
Wills ......................................... 106
Wodak .............................. 112, 219
Wolynes ........................... 146, 167
Word .......................................... 68
Worning ........................... 102, 196
Wymore ................................... 220
X
V
Vajda ................................. 31, 180
Valencia ............................. 43, 100
Valle ............................ 33, 34, 181
Venclovas ................................ 166
Veretnik ................................... 154
Vert .......................................... 110
Vicatos ....................................... 93
Vidyasagar ............................... 162
Vila .......................................... 145
Vlijmen .................................... 177
von Öhsen ............................ 7, 176
Vorobjev .......................... 153, 206
Vriend ...................................... 169
Vucetic ....................................... 88
Xiang.......................................... 83
Xu D......................... 116, 198, 199
Xu J. ......................................... 132
Xu Yi. ...................... 116, 198, 199
Xu Yu................................. 25, 178
Y
Yamatsu ................. 42, 54, 55, 182
Yan B.C. .................................. 168
Yan J.F. .................................... 168
Yan L. .................................. 3, 175
Yeh ....................................... 3, 175
Yoon ................................ 170, 171
Z
Zarina ....................................... 161
A-225
Zemla ........................................... 8
Zhang F. ................................... 148
Zhang Y.................................... 152
Zheng ......................................... 47
Zhou H.-X. ....................... 171, 209
Zhou R................................ 36, 181
Zimmermann .............................. 65
A-226
Abstract Contents
(by abstract type & group)
A-227
CAMACHO-CARLOS (P0099) - 184 PREDICTIONS: 184 3D ..................... 32
CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS ........................... 33
CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS ........................... 34
CBC-FOLD (P0008) - 151 PREDICTIONS: 151 3D ............................... 36
CBRC (P0041) - 385 PREDICTIONS: 279 3D, 105 SS, 1 DR ................ 37
CBSU (P0417) - 173 PREDICTIONS: 173 3D ........................................ 38
CELLTECH (P0028) - 347 PREDICTIONS: 347 3D .................................. 39
CHEN-WENDY (P0264) - 37 PREDICTIONS: 37 3D ............................. 41
CHIMERA (P0153) - 94 PREDICTIONS: 94 3D ..................................... 41
CHIMERAX (P0170) - 74 PREDICTIONS: 74 3D................................... 42
CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR ............................. 43
CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR ............................. 44
DELCLAB (P0050) - 310 PREDICTIONS: 310 3D .................................. 45
DONIACH (P0401) - 42 PREDICTIONS: 42 3D ........................................ 47
DOROTA (P0589) - 1 PREDICTION: 1 3D ............................................ 47
DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D ..................................... 48
DUNKER-KEITH (P0355) - 195 PREDICTIONS: 195 DR .......................... 49
ESYPRED3D (P0034) - 36 PREDICTIONS: 36 3D .................................. 51
EVOLUTIONARIES (P0180) - 99 PREDICTIONS: 99 3D ............................ 52
FAMS (P0168) - 324 PREDICTIONS: 324 3D ........................................ 54
FAMSD (P0169) - 322 PREDICTIONS: 322 3D ..................................... 55
FFAS03 (P0309) - 314 PREDICTIONS: 314 3D .................................... 56
FLOHIL (P0545) - 3 PREDICTIONS: 3 3D ............................................... 56
FLOHIL (P0545) - 3 PREDICTIONS: 3 3D ............................................... 57
FISCHER (P0427) - 161 PREDICTIONS: 161 3D .................................. 57
FLOUDAS-C.A. (P0011) - 15 PREDICTIONS: 15 3D ............................... 58
FM-AF (P0571) - 17 PREDICTIONS: 17 3D ........................................... 59
FORTE1 (P0290) - 276 PREDICTIONS: 276 3D ................................... 61
FRIESNER (P0112) - 174 PREDICTIONS: 174 3D................................... 62
FROST-MIG (P0047) - 72 PREDICTIONS: 72 3D ................................. 65
FUGUE2 (P0014) - 330 PREDICTIONS: 330 3D ................................... 66
FUGUE3 (P0226) - 330 PREDICTIONS: 330 3D ................................... 66
GARNIER-KLOCZKOWSKI (P0396) - 91 PREDICTIONS: 91 SS ................. 67
GEM (P0359) - 76 PREDICTIONS: 76 3D ............................................. 68
GEM (P0359) - 76 PREDICTIONS: 76 3D ............................................. 69
GEM (P0359) - 76 PREDICTIONS: 76 3D ............................................. 70
GENESILICO (P0517) - 195 PREDICTIONS: 86 3D, 64 SS, 45 RR.......... 71
GERLOFF (P0240) - 9 PREDICTIONS: 9 3D ......................................... 73
Methods Abstracts
123D_SERVER (P0476) - 68 PREDICTIONS: 68 3D ................................. 3
ACCELRYS (P0210) - 24 PREDICTIONS: 24 3D ........................................ 3
ADVANCED-ONIZUKA (P0214) - 92 PREDICTIONS: 92 3D ...................... 4
ALAX (P0234) - 39 PREDICTIONS: 39 3D............................................... 5
ALIGNERS (P0064) - 31 PREDICTIONS: 31 3D ......................................... 6
ARBY-SCAI (P0183) - 68 PREDICTIONS: 68 3D ........................................ 7
AS2TS (P0081) - 26 PREDICTIONS: 26 3D ............................................ 8
ATOME (P0464) - 318 PREDICTIONS: 318 3D ....................................... 9
AVBELJ-FRANC (P0341) - 25 PREDICTIONS: 25 3D ............................... 10
BAKER (P0002) - 377 PREDICTIONS: 377 3D...................................... 11
BAKER (P0002) - 377 PREDICTIONS: 377 3D...................................... 12
BAKER-ROBETTA (P0029) - 199 PREDICTIONS: 199 3D ................... 13
BALDI (P0021) - 61 PREDICTIONS: 61 3D ............................................. 14
BALDI-CONPRO (P0022) - 62 PREDICTIONS: 62 RR ............................. 14
BALDI-SSPRO (P0023) - 63 PREDICTIONS: 63 SS ................................ 14
CMAP23DPRO (P0253) - 1 PREDICTION: 1 3D ..................................... 14
CMAPPRO (P0255) - 0 PREDICTIONS ................................................... 14
SSPRO2 (P0254) - 65 PREDICTIONS: 65 SS......................................... 14
BASS-MICHAEL (P0384) - 51 PREDICTIONS: 51 3D ............................... 16
BATES-PAUL (P0096) - 72 PREDICTIONS: 72 3D................................... 16
BENNER-STEVE (P0524) - 35 PREDICTIONS: 18 3D, 17 SS ................... 18
BENNER-STEVE (P0524) - 35 PREDICTIONS: 18 3D, 17 SS ................... 19
BILAB (P0080) - 200 PREDICTIONS: 200 3D ......................................... 20
BIOINFO.PL (P0006) - 75 PREDICTIONS: 75 3D .................................... 21
BIOKOL (P0258) - 23 PREDICTIONS: 23 3D........................................... 22
BION (P0474) - 63 PREDICTIONS: 63 SS .............................................. 23
BIONOMIX (P0475) - 61 PREDICTIONS: 61 3D ....................................... 24
BRAUN-W ERNER (P0024) - 65 PREDICTIONS: 65 3D ............................ 25
BROOKS (P0373) - 252 PREDICTIONS: 252 3D ..................................... 26
BUJNICKI-JANUSZ (P0020) - 215 PREDICTIONS: 67 3D, 58 SS, 49 RR, 41
DR...................................................................................................... 27
BYSTROFF (P0131) - 132 PREDICTIONS: 45 3D, 40 SS, 45 RR, 2 DR ... 28
CAM-BIOCHEM (P0447) - 74 PREDICTIONS: 74 3D ................................ 30
CAMACHO-CARLOS (P0098) - 46 PREDICTIONS: 46 3D ......................... 31
A-228
GINALSKI (P0453) - 71 PREDICTIONS: 71 3D ........................................ 74
HARRISON (P0188) - 43 PREDICTIONS: 43 3D....................................... 75
HEAD-GORDON (P0271) - 93 PREDICTIONS: 93 3D .............................. 76
HMMSPECTR (P0025) - 285 PREDICTIONS: 285 3D .......................... 78
HO-KAI-MING (P0437) - 129 PREDICTIONS: 129 3D ............................. 80
HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D ........................... 81
HOLM (P0090) - 38 PREDICTIONS: 38 3D ............................................. 82
HONIG (P0110) - 113 PREDICTIONS: 113 3D ........................................ 83
HUBER-TORDA (P0351) - 83 PREDICTIONS: 83 3D ............................... 85
I-SITES/BYSTROFF (P0132) - 64 PREDICTIONS: 64 3D .......................... 86
INFORMAX (P0326) - 24 PREDICTIONS: 24 3D ................................... 87
IRBACK (P0559) - 20 PREDICTIONS: 20 3D ........................................... 88
IST-ZORAN (P0454) - 195 PREDICTIONS: 195 DR .............................. 88
JAGER (P0582) - 7 PREDICTIONS: 7 3D ................................................ 89
JIVE (P0506) - 37 PREDICTIONS: 37 3D ................................................ 90
JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS .............................. 91
JONES-NEWFOLD (P0068) - 214 PREDICTIONS: 87 3D, 63 SS, 64 DR .. 92
KAZNESSIS (P0548) - 15 PREDICTIONS: 15 3D ..................................... 93
KEASAR (P0429) - 90 PREDICTIONS: 90 3D .......................................... 94
KGI-QMW (P0015) - 19 PREDICTIONS: 19 3D ..................................... 95
KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS ............................ 96
KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS ............................ 97
KIM-PARK (P0442) - 65 PREDICTIONS: 65 SS ...................................... 98
LAMBERT-CHRISTOPHE (P0035) - 131 PREDICTIONS: 131 3D ............ 99
LIBELLULA (P0230) - 216 PREDICTIONS: 216 3D ............................. 100
LOMIZE-ANDREI (P0288) - 76 PREDICTIONS: 76 3D ............................ 101
LUETHY (P0419) - 240 PREDICTIONS: 240 3D .................................... 101
LUND-OLE (P0391) - 39 PREDICTIONS: 39 3D .................................... 102
MACCALLUM (P0393) - 130 PREDICTIONS: 130 SS ............................ 104
MARTIN-ANDREW (P0471) - 55 PREDICTIONS: 55 3D.......................... 106
MELLER-ADAMCZAK (P0441) - 23 PREDICTIONS: 23 3D ...................... 107
LEVITT (P0016) - 350 PREDICTIONS: 350 3D...................................... 108
MPALIGN (P0135) - 327 PREDICTIONS: 327 3D................................ 110
MURZIN (P0448) - 21 PREDICTIONS: 21 3D ........................................ 110
MZ-BRUSSELS (P0246) - 54 PREDICTIONS: 54 3D ............................. 112
NEXXUS-DELRIO (P0370) - 7 PREDICTIONS: 7 3D ................................ 114
ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D ................. 116
OSGDJ (P0292) - 100 PREDICTIONS: 100 3D ..................................... 117
PAN (P0032) - 164 PREDICTIONS: 99 3D, 65 SS ................................ 120
PAS (P0513) - 73 PREDICTIONS: 73 3D ............................................. 121
PILOT (P0378) - 146 PREDICTIONS: 146 3D ..................................... 122
POMI (P0465) - 46 PREDICTIONS: 46 3D .......................................... 123
PREISSNER (P0488) - 20 PREDICTIONS: 20 3D................................... 123
PROTFINDER (P0282) - 222 PREDICTIONS: 222 3D ............................ 124
PROTINFO-AB (P0140) - 260 PREDICTIONS: 260 3D ....................... 126
PROTINFO-CM (P0138) - 251 PREDICTIONS: 251 3D ...................... 127
PROTINFO-FR (P0139) - 325 PREDICTIONS: 325 3D ....................... 129
PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D .............................. 130
RAGHAVA-GAJENDARA (P0054) - 482 PREDICTIONS: 224 3D, 258 SS 131
APSSP/RAGHAVA-GAJENDRA (P0137) - 65 PREDICTIONS: 65 SS ...... 131
APSSP2/RAGHAVA-GAJENDRA (P0055) - 65 PREDICTIONS: 65 SS .... 132
RAPTOR (P0144) - 227 PREDICTIONS: 227 3D................................. 132
ROKKO (P0327) - 109 PREDICTIONS: 109 3D ..................................... 133
RON-ELBER (P0300) - 259 PREDICTIONS: 259 3D.............................. 134
RYKUNOV-REVA-TARAKANOV (P0529) - 198 PREDICTIONS: 198 3D .... 135
SAM-T02-HUMAN (P0001) - 203 PREDICTIONS: 138 3D, 65 SS ......... 136
SAM-T02-SERVER (P0189) - 221 PREDICTIONS: 221 3D ................... 137
SAMUDRALA-COMPARATIVE-MODELLING (P0053) – ............... 139
SAMUDRALA-FOLD-RECOGNITION (P0052) - 315 PREDICTIONS: 315
3D .................................................................................................... 140
SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D ...... 142
SASSON-IRIS (P0265) - 66 PREDICTIONS: 66 3D ................................ 143
SBC (P0084) - 94 PREDICTIONS: 94 3D ............................................ 144
SCHERAGA-HAROLD (P0314) - 135 PREDICTIONS: 135 3D ................. 145
SCHULTEN-W OLYNES (P0093) - 118 PREDICTIONS: 118 3D ............... 146
SDSC2:REDDY-BOURNE (P0347) - 54 PREDICTIONS: 54 3D .............. 147
SHAKHNOVICH-EUGENE (P0459) - 26 PREDICTIONS: 26 3D ................ 148
SHESTOPALOV (P0044) - 159 PREDICTIONS: 79 3D, 80 SS ........... 149
SHORTLE (P0349) - 32 PREDICTIONS: 32 3D...................................... 150
SK-LAB (P0403) - 2 PREDICTIONS: 2 3D ............................................. 150
SKOLNICK-KOLINSKI (P0010) - 361 PREDICTIONS: 361 3D ................. 152
SMD-CCS (P0249) - 4 PREDICTIONS: 4 3D ....................................... 152
SOLOVYEV-SOFTBERRY (P0270) - 242 PREDICTIONS: 177 3D, 65 SS. 153
SPAM1 (P0400) - 87 PREDICTIONS: 87 3D ....................................... 154
SRBI (P0331) - 109 PREDICTIONS: 109 3D ....................................... 156
STERNBERG (P0105) - 71 PREDICTIONS: 71 3D ................................. 156
A-229
SUNDARAMS (P0381) – 0 PREDICTIONS ......................................... 158
SUPERFAMILY (P0065) - 925 PREDICTIONS: 925 3D ...................... 159
SUPFAM_PP (P0086) - 728 PREDICTIONS: 728 3D .......................... 160
SZED-ASMAT (P0515) - 6 PREDICTIONS: 6 3D .................................... 161
TAYLOR (P0423) - 113 PREDICTIONS: 113 3D .................................... 161
TCS-BIOINFORMATICS (P0404) - 40 PREDICTIONS: 40 SS .................. 162
THW-FR (P0377) - 241 PREDICTIONS: 241 3D.................................. 163
TOME (P0450) - 260 PREDICTIONS: 260 3D ..................................... 164
UCLA-DOE (P0301) - 59 PREDICTIONS: 59 3D ................................. 165
VENCLOVAS (P0425) - 20 PREDICTIONS: 20 3D .............................. 166
W OLYNES-SCHULTEN (P0294) - 42 PREDICTIONS: 42 3D ................... 167
YAN-RESEARCH (P0069) - 60 PREDICTIONS: 60 SS ........................... 168
YASARA-PUSHCHINO (P0202) - 192 PREDICTIONS: 192 3D ................ 169
YOON (P0262) - 35 PREDICTIONS: 35 3D ........................................... 170
YOON (P0262) - 35 PREDICTIONS: 35 3D ........................................... 171
ZHOU-HX (P0056) - 134 PREDICTIONS: 69 3D, 65 SS ....................... 171
Poster Abstracts
ACCELRYS (P0210) - 24 PREDICTIONS: 24 3D.................................... 175
ALAX (P0234) - 39 PREDICTIONS: 39 3D .......................................... 175
ALIGNERS (P0064) - 31 PREDICTIONS: 31 3D..................................... 176
ARBY-SCAI (P0183) - 68 PREDICTIONS: 68 3D .................................... 176
BAKER (P0002) - 377 PREDICTIONS: 377 3D ................................... 176
BAKER (P0002) - 377 PREDICTIONS: 377 3D ................................... 176
BIOGEN (P0440) - 28 PREDICTIONS: 28 3D ........................................ 177
BION (P0474) - 63 PREDICTIONS: 63 SS ............................................ 177
BRAUN-W ERNER (P0024) - 65 PREDICTIONS: 65 3D .......................... 178
BURNHAM (P0516) - 68 PREDICTIONS: 68 3D..................................... 179
BYSTROFF (P0131) - 132 PREDICTIONS: 45 3D, 40 SS, 45 RR, 2 DR. 180
CAMACHO-CARLOS (P0098) - 46 PREDICTIONS: 46 3D ....................... 180
CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS ......................... 181
CBC-FOLD (P0008) - 151 PREDICTIONS: 151 3D ............................. 181
CHIMERA (P0153) - 94 PREDICTIONS: 94 3D ................................... 182
CHIMERAX (P0170) - 74 PREDICTIONS: 74 3D................................. 182
CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR ........................... 182
DELCLAB (P0050) - 310 PREDICTIONS: 310 3D ................................ 183
DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D ................................... 184
DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D ................................... 185
EVOLUTIONARIES (P0180) - 99 PREDICTIONS: 99 3D .......................... 186
GENESILICO.PL-SERVERS-ONLY (P0242) - 68 PREDICTIONS: 66 3D, 2 SS
........................................................................................................ 186
BUJNICKI-JANUSZ (P0020) - 215 PREDICTIONS: 67 3D, 58 SS, 49 RR, 41
DR ................................................................................................... 186
GENESILICO (P0517) - 195 PREDICTIONS: 86 3D, 64 SS, 45 RR ......... 186
GERLOFF (P0240) - 9 PREDICTIONS: 9 3D ....................................... 187
HO-KAI-MING (P0437) - 129 PREDICTIONS: 129 3D ........................... 188
HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D ......................... 189
HUBER-TORDA (P0351) - 83 PREDICTIONS: 83 3D ............................. 190
JIVE (P0506) - 37 PREDICTIONS: 37 3D .............................................. 190
JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS ............................ 191
JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS ............................ 192
KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS .......................... 193
KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS .......................... 194
A-230
LAMBERT-CHRISTOPHE (P0035) - 131 PREDICTIONS: 131 3D .......... 195
LOMIZE-ANDREI (P0288) - 76 PREDICTIONS: 76 3D ............................ 196
LUND-OLE (P0391) - 39 PREDICTIONS: 39 3D .................................... 196
LEVITT (P0016) - 350 PREDICTIONS: 350 3D...................................... 196
LEVITT (P0016) - 350 PREDICTIONS: 350 3D...................................... 197
NEXXUS-DELRIO (P0370) - 7 PREDICTIONS: 7 3D ................................ 198
ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D ................. 198
ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D ................. 199
PROTFINDER (P0282) - 222 PREDICTIONS: 222 3D............................. 200
PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D .............................. 201
PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D .............................. 202
PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D .............................. 202
ROKKO (P0327) - 109 PREDICTIONS: 109 3D ..................................... 203
RON-ELBER (P0300) - 259 PREDICTIONS: 259 3D .............................. 204
SAM-T02-SERVER (P0189) - 221 PREDICTIONS: 221 3D ................... 204
SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D ...... 204
SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D ...... 205
SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D ...... 206
SBC (P0084) - 94 PREDICTIONS: 94 3D ............................................ 206
SOLOVYEV-SOFTBERRY (P0270) - 242 PREDICTIONS: 177 3D, 65 SS . 206
SUPERFAMILY (P0065) - 925 PREDICTIONS: 925 3D ...................... 207
THW-FR (P0377) - 241 PREDICTIONS: 241 3D.................................. 207
TSAI (P0061) - 105 PREDICTIONS: 105 3D ......................................... 208
ZHOU-HX (P0056) - 134 PREDICTIONS: 69 3D, 65 SS ....................... 209
Demonstration Abstracts
HARRISON (P0188) - 43 PREDICTIONS: 43 3D .................................... 213
HEAD-GORDON (P0271) - 93 PREDICTIONS: 93 3D ............................ 213
HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D ......................... 214
OSGDJ (P0292) - 100 PREDICTIONS: 100 3D ..................................... 215
Other Abstracts
EVALUATION OF BLIND PREDICTIONS OF PROTEIN-PROTEIN INTERACTIONS
MADE IN THE CAPRI EXPERIMENT ..................................................... 219
THE PITTSBURGH SUPERCOMPUTING CENTER AND HEWLETT PACKARD
SUPPORT OF CASP5 ........................................................................ 220
A-231
Download