Second meeting in Bruxelles

advertisement
Second meeting in Bruxelles
Consortium funded by the EU
Under the contract QLG2-CT-2002-01298
The second meeting of the consortium Protein Folding Fragment hold in the Université Libre
de Bruxelles, on behalf of Marianne Rooman, on November 24th and 25th 2003.
1. Presentation of the advancement of work during the last semester
1.1. Group 1
How to hopefully get automatic topohydrophobic positions from structures
Jacques Chomilier, Anne Lopes
The Paris team has moved during the past summer from Jussieu to Boucicaut, another
campus in downtown Paris. From the common data bank of 116 PDB entries, about one half
of the entries have been analysed in terms of topohydrophobic positions. Thus the second half
remains to be determined. The major challenge is to find a way to have both fast access to
these positions (which means as much as possible automatic) and secure (which in turn means
slow and carefull check of the results). The bottleneck in this procedure is to be able to
perform reliable structural alignment by blocks with allowed gaps. It seems from a glance at
CKAAP, the Conserved Key Amino Acid Position server from Phil Bourne at San Diego
Super Computing Centre (http://ckaaps.sdsc.edu/perl/browser.pl), that it is designed to
produce the longest possible blocks of superposition, not to determine the largest amount of
deep core positions. Thus we have been interested in an algorithm that is the baically the
transposition to the 3D structures of the BLAST algorithm. It first tries to retrieve words of
four to five amino acids, and from these seeds it then tries to extend them to form SHSSP,
structural high scoring segment pairs. Then one can deduce a sequence alignment and further
derive the highly conserved regions. This algorithm, called Yakusa, has been developped by
Joël Pothier and al. It allows to treat a list of protein, thus one can fix an upper limit to the
sequence identity (30% to be coherent with previous data) of any pair of entries. It is fairly
fast and has the advantage to compare internal coordinates instead of the more common rmsd
for comparison of structures, as we do agree with group 3 to think that rms is no good to find
evolutionary relatedness.
1.2. Group 2
Calculations of Mostly Interacting Residues and comparison to topohydrophobic
positions and Tightened End Fragments
Nikolaos Papandreou, Elias Eliopoulos
Group 2 is currently involved in WPs 1 and 2. Concerning WP1, the common protein
dataset is complete and the server that will host the different aspects of the project is
operational at the address http://biotech.aua.gr/LIFE/. Mirrors of the server should be shortly
developed by a number of participants.
Concerning WP2, the Monte-Carlo algorithm that is used to calculate the MIR (Mostly
Interacting Residues) has been refined and a first set of definitive results obtained. They
concern a subset of 43 proteins of the common dataset, for which both topohydrophobic
residues and closed loops are known. The comparison between the MIR with
topohydrophobic positions and loop ends confirmed the preliminary calculations and showed
a very clear correspondence between them. Thus The MIR algorithm should be considered as
a possible method to predict the critical residues that stabilize the protein hydrophobic
nucleus. In the frame of WP1, the Monte-Carlo calculations on the rest of the dataset will be
completed after the calculation of all topohydrophobic residues.
Discussions with group 5 started in order to prepare the next WP in which groups 1, 2,
3 and 5 are involved.
1.3. Group 3
Closed loops of TIM barrel proteins
Z Frenkel. In the abscence of Z. Frenkel, who could not get his passport/visa on time, these
results have been presented by Edward Trifonov.
In our effort to describe TIM barrel family by the closed loops of, presumably, limited
number of types, we got across supervariability of the sequences belonging to apparently the
same structural family of the loops. The similarity of the sequences within each family
becomes obvious when instead of 20-letter amino-acid alphabet we used 2-letter alphabet,
derived from the reconstruction of the origin and evolution of triplet code (Trifonov, Gene
261:139, 2000). The effect is so dramatic that for some of the closed loops of the same
structure the 20-letter alphabet sequences are completely different (1-2 matches at most)
while in the binary representation only 1-2 letters of 25-30 do not match. We suggest
whenever the relatedness of structures is questionable sequence-wise, to use the binary code
that would reveal the relatedness.
Second observation is that some sequence-wise related segments (in 20-letter
alphabet) turn out to be structurally very much different (by RMS). Yet, if the structures are
presented by torsional angles (Angle walk) they turn out to be almost identical, with only 2-3
angles different, that completely distorts the closed loop appearance and RMS difference, but
the rest of the angles are all the same for the two compared structures.
Two letter alphabet for protein sequence comparisons
Edward Trifonov
The earlier published reconstruction of the origin and evolution of the triplet code
(Trifonov, Gene 261, 139, 2000) suggests that the new codons appeared as point-mutated
earlier codons, with conservation of purines and, respectively, pyrimidines in the central
positions of the codons. This introduces two independent amino-acid alphabets: A, F, I, L, M,
P, S, T, V – Ala-family, with pyrimidine-central codons, and C, D, E, G, H, K, N, Q, R, S, W,
Y – Gly-family, with purine-central codons. It turns out, that, apparently, even after the triplet
code was completed, the conservation of the two alphabets is still in place. Indeed, the
tabulated replacements (PAM- and BLOSSUM-matrices show very strong separation of the
replacements in the Ala-Ala-type and Gly-Gly-type replacements. The observed separation of
the amino acids in two alphabets is of fundamental value both for protein evolution studies
and for practicalities of sequence alignments.
1.4. Group 4
Progress with fingerprints
Terri Attwood, Manuel Corpas
The Manchester team has begun to make a systematic analysis of the common dataset
of 116 structures provided earlier in the year. The dataset was divided first into 2 sets: those
that already have some kind of fingerprint in the PRINTS database, and those that have not –
the latter was made a priority for which to derive new fingerprints. The approach to producing
new fingerprints is 2-fold – automatic and manual. The automatic approach will also be
approached from 2 perspectives: (i) using the PDB domain as the seed for the fingerprint
process, and (ii) using the equivalent full Swiss-Prot sequence as the seed for the fingerprint.
Once new fingerprints have been created manually and automatically in this way,
those fingerprints that are already in PRINTS in some form will be revisited. Here it is
necessary to decide if the existing fingerprint falls entirely in the PDB domain, or if it
contains motifs outside the domain. In the latter case, the fingerprint will need to be revised to
better represent the structure. An overview of the process is shown in the Figure below.
Fingerprinting
Overview
117 PDB entries
56
61
Those in PRINTS
Those NOT
***priority***
~40
Auto
43
Match PDB
domain
Manual
13
***priority***
Partial match
PDB
Swiss-Prot
***priority***
Objective Comparison
RESULTS
117 Manual and < 117x2 Automatic
Data integration
& visualisation
To date, we have done ~40 new fingerprints manually and have some kind of
meaningful representation for 43; thus, we have completed almost half of the manual effort.
During the coming months, we will concentrate on completing both the manual and automatic
efforts. For the future, we are working with Steve Pettifer (Dept. of Computer Science) to
determine how best to store our results and integrate them with those of the other teams.
Dr.Pettifer is an expert on data integration and visualisation, and will help us to extend our
current integration/visualisation software (UTOPIA, of which the CINEMA alignment editor
is a core component), to handle the data emerging from each of the teams, as illustrated
below.
Using CINEMA & UTOPIA as
a framework for visualising
fingerprints, MIRs, LIRs, TEFs,
Ts, etc., in 2- & 3D
1
2
3
4
1.5. Group 5
Prelude, Fugue and PoPMuSiC: are they in harmony with other methods?
Dimitri Gilis, Marianne Rooman, Jean-Marc Kwasigroch, Yves Dehouck, Christophe Biot,
René Wintjens
The first 6 months of this project have been dedicated to: (1) the energy functions used
to evaluate the compatibility between a sequence and a conformation, and (2) the comparison
between the results obtained with our programs on the common data bank of 116 PDB entries,
and the topohydrophobic positions and the limits of the TEF's.
The dependence of distance-dependent database-derived potentials on the size of the
proteins belonging to the database used to derive them is a drawback that has been identified
some years ago and that is not fully understood yet. We have addressed this issue by probing
the theoretical validity of these potentials as mean force potentials that take the solvent
implicitly into account and involve entropic contributions due to atomic degrees of freedom
and solvation. The results of this analysis have been used to devise new corrective functions
that take into account the size of the protein studied. We have shown that these corrected
potentials perform better than their more classical version to retrieve the correct sequencestructure association among a decoy set.
We have also assembled a collection of decoy sets to evaluate the performance of
energy functions used in the field of protein tertiary structure prediction. There exists a large
number of decoy sets that are available on the web, but their quality is variable. In a first step,
we have analyzed all these sets. We have then selected some of them in order to propose a
collection of sets that are accurate, that have been created for proteins belonging to several
structural classes and that contain non-native structures that are representative of the
conformational space, with structures close to the native. The results of this analysis can be
found at the URL: http://babylone.ulb.ac.be/decoys.
Another part of our work has consisted of running Prelude, Fugue and PoPMuSiC on
the 116 pdb's of our common data bank. We have identified with Fugue regions of the protein
sequences that show a strong preference towards the native conformations, whereas the
PoPMuSiC results give the positions along the sequence that are (not) optimized with respect
to the thermodynamic stability of the protein. We have correlated these results with the
topohydrophobic positions and the limits of the TEF's, for one protein. Our future work will
consist of correlating systematically the results of Fugue and PoPMuSiC with the
topohydrophobic positions and the TEF's. Moreover, we will compare the regions predicted
by Fugue and the foldons derived from the 3D structures by the group of P. Wolynes.
Finally, we plan to analyze, in collaboration with the Group 6, a large database of mutated
proteins, for which the experimental folding free energy difference has been measured, with
PoPMuSiC and FoldX. This collaboration has been initiated via a one month stay of D. Gilis
in the laboratory of Group 6.
1.6. Group 6
TANGO : an algorithm to predict protein aggregation
Frédéric Rousseau, Luis Serrano
We have modified the FOLD-X algorithm so now it includes heteroatoms like Ca, Zn,
Mg, Mn etc… with great accuracy, as well as the Kds for Ca ions. Also we have done a
refinement of the force field and we are in the process of comparing our predictions of
mutants with those of Partner: Rooman, with the idea of finding out complementarity and
sinergisms. Regarding Protein Folding we have yet not been able to sort out the problem of
doing a correct estimation of loop entropy in folding which be believe is necessary to accurate
describe the folding pathways of proteins. In parallel we have developed independently of the
proposal a software package called TANGO that predicts the tendency of a protein to
aggregate. TANGO predicts with surprisingly good accuracy the regions experimentally
described to be involved in the aggregation of 176 peptides of over 20 proteins. The
predictive capacities of TANGO are further illustrated by two examples: the prediction of the
aggregation propensities of A1-40 and A1-42 and in several disease-related mutations of
the Alzheimer’s -peptide as well as the prediction of the aggregation profile of human acyl
phosphatase. Thus, by capturing the energetics of structural parameters observed to contribute
to protein aggregation and taking into account competing conformations, like -helix and turn formation, it is possible to identify with surprising accuracy protein regions susceptible
of promoting protein aggregation. The success of TANGO shows that the underlying
mechanism of cross- formation aggregates is universal. Logically this type of prediction is
essential to understand protein folding, as well as for protein design, since it takes into
account the effect of mutations on the denatured state of proteins. It is our intention to link
TANGO to FOLD-X, so that when designing a protein or modifying a folding pathway by
mutagenesis we could see the possible effect on the aggregation properties of the target
molecule.
For next year we plan to run FOLD-X on the PDB database generated by the
consortium, producing an output that contains the energy per position, which could be
compared to the results produce by other members of the consortium. Also we hope to finally
have the loop problem solved and add to the new FOLD-X web server the possibility not only
of predicting point mutations, but also folding pathways.
1.7. Group 7
Off Lattice molecular dynamics folding simulations
David Perahia, Charles Robert, Liliane Mouawad
We continued to develop the off-lattice molecular dynamics program (MMSIM) that
was started at the beginning of our participation in the European consortium, by adding new
modules allowing to increase its predictive power for finding the native state of a protein from
the sole knowledge of its sequence. The protein is represented by a chain in which the
residues are represented by single points located at the C positions. The force field contains
secondary structure propensity potentials depending of the nature of residues within segments
of 3 and 4 residues, and residue-residue contact potentials between the 20 types of residues
that are used in a Lennard-Jones function. Our efforts were directed towards developing an
optimisation procedure for energy parameters in such a way that the native state corresponds
to the lowest energy in a set composed of decoy structures and the native one. A second
condition was the introduction of a gap energy between the native structure and structures
beyond 3Å root mean square deviation (rmsd) from the native structure. The optimization
procedure consisted of generating decoy structures of a given protein by molecular dynamics
simulation at various temperatures ranging from 10 to 1000K, starting from the native
structure. Structures collected every 100ps from these trajectories were quenched by energy
minimization. The optimized parameters are the energy-term weighting factors and the
residue-residue contact energies. The optimization procedure was based on the minimization
of an error rate function taking values from 0 to 1; the lowest value corresponds to the
requirement that the native structure is energetically the most favoured structure. A Monte
Carlo method was used to change the parameter values. An iterative scheme was designed
consisting of successive generation of decoy sets with energy parameter optimization at each
iteration step.
The energy parameter optimization was carried out on a small 5-helix protein (1r69) in
order to test the performance of our procedure and evaluate the limits of our potential energy
function in discriminating the native structure. The first conclusions obtained were that
secondary-structure propensity potentials discriminate helices well (all 5 helices are well
predicted), but the residue-residue contact energies using only C atoms had no
discriminating power in favoring native-like contacts between distant residues along the
chain. Among 2000 structures obtained by molecular dynamics folding simulations with the
optimized parameters, starting from an unfolded structure, only six displayed native-like
structures. However, the introduction of an energy constraint favoring a TEF-like structure in
the folding simulations increased the number of native-like structures appreciably.
We now have the tools developed that permit us to find the optimal parameters of a
given energy function and to evaluate its performance. Our next step is to consider a model
with two points per residue, one corresponding to the C atom and the other to the sidechain, and to include a solvation energy term. We should thus increase the probability for
obtaining the native-like structures for a given protein. Our second objective is to extend these
structures to a full-atomic model in order to further approach the native-structure and to
facilitate comparisons to assessment methods of other groups. All these developments should
be especially useful for finding structures corresponding to sequences for which existing
homologues have less than 20% identity.
2. Common realisations
The web server front page has been done and is available to everybody. It might be
interesting to put links to other web sites that members of the consortium estimate relevant.
Nikolaos Papandreou will take in charge this gateway. It was previously decided to link our
entries of the protein set to the PDB entries; instead it is much better to use PQS (Protein
Quaternary Structure) at EBI as we are interested in the quaternary structures (one has to
check if there is a problem of copyright).
The logo is still waiting proposals.
After discussion, we decided to change the format of the dataset that was decided in
Paris. It appears that Fasta format is difficult to handle as long as structure is concerned, and
we decided to adopt the DSSP format. Jean-Marc Kwasigroch will send within a couple of
weeks, a test set of ten PDB files in DSSP format. There will be a certain number of colums
already used, and each group will have to make its decision upon the number of columns they
do need. This format will have to be definitly fixed at the Manchester meeting.
3. Discussion of the Work Packages
We have split in three parts to be more efficient. Groups 1, 2, 3, 4 and 5 discussed
about WP2, fragments and structures. Groups 5, 6 and 7 discussed of nucleus and structure,
i.e. WP3, and groups 4 and 6 discussed about fingerprints of WP4.
Groups 1, 2, 3 and 4
So far, on a reasonable data set, it seems a good correlation between topohydrophobic
positions and MIR or between topohydrophobic positions and TEF. It will have to be
extended to the full database. Correlation will have to be done between foldons and
protofragments and TEF. We also will have to find a consensus which might be based on the
best guess.
Edward reported about work of his student, E. Aharonovsky, about vocabulary of
three-letter words that statistically display preference to be at distance 25-30 residues one
from another – presumably, the sequences of the locks closing the loops. The work is close to
completion, and the words (about 200 of total 8000 triplets) soon will be open for the
collaborating groups.
Once we receive the 200 words from Edward, one has to check wether or not they map
the ends of the TEF.
Groups 5, 6 and 7
One has to discriminate the structures to find the native ones. One wishes to extend the
database of a few entries (1ubq and 1cro, for instance). This has to be proposed fairly rapidly
to Nikolaos. Natively non folding sequences might be interesting to study; although there is
no available structure, there is some experimental information that one could use.
The alignment viewers such as CINEMA and UTOPIA, freely available for the
consortium, will have to be improved. The last version of Fold-X should be used. Actually
group 4 will gather all the data from other groups and will see how they match to the
fingerprints.
4. Next meeting
It will be held in Manchester on May 2004. Terri will organise it and she will
determine the date in the early days of January.
It will be important for Manchester to start to produce pair wise correlation of the
different methods.
5. People present at Bruxelles meeting
This meeting was attended by the following people :
- Jacques Chomilier and Anne Lopes from Group 1
- Nikolaos Papandreou and Elias Eliopoulos from Group 2
- Edward Trifonov from Group 3
- Therese Attwood and Manuel Corpas from Group 4
- Marianne Rooman, Christophe Biot, Yves Dehouck, Jean-Marc Kwasigroch, Dimitri
Gilis and René Wintjens from Group 5
- Luis Serrano, Raphael Guerois, Frédéric Rousseau from Group 6
- David Perahia, Charles Robert and Liliane Mouawad from Group 7
Download