Ab Initio Structure Prediction

advertisement
Ab Initio Structure Prediction
Ab Initio Documentation in v2.2.0
Ab initio protein folding performs task of predicting 3-D structural models for a protein molecule from its sequence
information. The simulation is done in low-resolution mode. Then the resulting models are clustered (see Clustering Scripts)
and the cluster centers are subjected to full-atom structure refinement (see Relax).
For more information about how this method works in Rosetta and relevant publications, please refer to the following
reference:
Rohl, C. A., Strauss, C. E., Misura, K. M., Baker, D. (2004). Protein
structure prediction using Rosetta Methods Enzymol 383, 66-93.
-----------------------------------------------------------------------In order to run Rosetta Ab initio mode, you need to prepare the
following input files:
(1) protein sequence fasta file;
(2) fragment library files;
(3) secondary structure prediction file (psipred,psipred_ss2 or jones).
Please see ../fragments/README.fragments for how to make fragment libraris
and by doing that you can also get psipred files needed.
The command line for Rosetta Ab initio is very simple:
$> rosetta.exe aa 2ptl _ -nstruct 9000 -silent
This requires 2ptl_.fasta, aa2ptl_.03(or 09)_05.200_v1_3, and
2ptl_.psipred (or 2ptl_.psipred_ss2 or 2ptl_.jones) at locations
specified accordingly in paths.txt.
The output from this command line is the score file aa2ptl_.sc and the
silent model output aa2ptl_.out. If you omit -silent options, you will
instead of get normal pdb format for each model, such as aa2ptl_0001.pdb
through aa2ptl_9000.pdb. Compared to this, the silent output format
records only backbone torsion angles and CA atom coordinates for each
model and since Rosetta folds protein sturcture models using ideal bond
geometries, the silent output file can be later extracted to normal pdb
files. The size of silent output file is much smaller than normal pdbs
and the next stage of clustering also requires the silent format as
input. Note that Rosetta has a hard-coded limit on the number of digits
of model index number( four currently) which means that at most 9999
models can be output in each run. To get more models, change the series
code (aa in the above example) to other two-letter code and you will
have a different prefix to the score file and the silent output
file. Afterwards, you can combine several silent out files together into
one file for next stage of clustering.
------------------------------------------------------------------------Pose-based Ab initio protein structure prediction
References: Bradley, P. and Baker, D. (2006). Improved beta-protein
structure prediction by multilevel optimization of nonlocal strand
pairings and local backbone conformation. Proteins. 65(4): 922-999.
In addtion to functionalities of the classic Rosetta Ab initio folding,
pose-based ab initio protocols allows users to specify certain
beta-strand pairing topology to start from. This procedure is usually
performed after identifying several prefered folding topologies from
model population by standard Ab initio folding in order to increase the
sampling desity for certain strand pairings.
A. To perform a standard ab initio folding using pose-based represention
$> rosetta.exe aa 2ptl _ -pose_abinitio -nstruct 2000
B. To perform a ab initio folding using specified strand-pairing
$> rosetta.exe aa 2ptl _ -jumping -nstruct 2000 -pairing_file pairing-file
-close_chainbreaks -apply_filters
----------------------------------------------------------------------------PAIRINGS FILE FORMAT:
one line per pairing, 4 numbers per line, white-space delimited
format: "pos1 pos2 orientation pleating"
pos1 and pos2 are the sequence numbers of the beta-paired positions
orientation is the beta-strand orientation:
"1" for antiparallel, "2" for parallel or just "A" or "P"
pleating determines the pleat of the beta-carbons
"1" if the N and O of pos1 are pointed away from pos2
"2" if the N and O of pos1 are pointed toward pos2
eg, in the antiparallel case, a pleat of 2 would mean that
there are two backbone-backbone hydrogen bonds between pos1 and pos2
In the parallel case a pleat of 2 means that pos1 is Hbonding with
pos2-1 and pos2+1
if you check out rosetta_benchmarks, in the directory 1d3z/
is a pairing file "pairings.dat" with two pairings in it.
These are native pairings, ie they match the pdb structure
1d3z.pdb in the same directory. In Fortran you had to tell
Rosetta how many pairings to expect; that's why the first line
has a "2" on it. This is no longer necessary.
COMMAND LINE SYNTAX
-pairing_file <pairings-file>
with no other arguments and it will try to build decoys with
*all* the pairings in the file. You can also specify what
kind of sheet topology Rosetta should try to construct:
-sheet1 <N1> -sheet2 <N2> ... -sheetk <Nk>
Here Nj is the number of strands in sheet number j. So the
number of forced pairings will be (Nj-1). The sheet can
get other strands during the folding simulation -- this is
just specifying how many Rosetta should actual build from
the start using the broken chain stuff.
So the total number of forced pairings will be:
N1-1 + N2-1 + N3-1 + ... + Nk-1
For example, to specify two strand pairings in two different
sheets, use the args "-sheet1 2 -sheet2 2"
------------------------------------------------------------To extract decoys from the silent *.out file
$> rosetta.gcc -extract -s your.out -all
or you can replace -all by -l <tag_file> to only extract a subset of
models in the silent out file. A tag is the last colunm in the line
starting with "SCORE". If a tag starts with "F_", it means the model
fails to pass certain score filters or have some bad features, and
therefore should be dropped. "S_" models are valid models generated by
Rosetta simulation.
Ab initio structure prediction uses the sequence of a protein to generate models of its final folded structure.
Constraints
Constraints exist to allow you to set hard limits on particular features of accepted decoys are accepted in particular the
conformation of particular residues and atoms. The energy function sets no hard boundaries on what decoys are accepted.
All decoys get a score, but even the worst ones are scored. This can lead to a proliferation of too many decoys with bad
scores, which obscure better decoys. Additionally you often know information about the environment your protein exists or
will exist in that is not yet taken into account by the energy function, such as boundaries on size and position of particular
residue or atoms. Constrainst allow you to take advantage of such specialized knowledge.
It is worth emphasizing that constraints can be an art more than a science. They exist precisely to allow you to work in
areas that are not well defined enough in the energy function, but it is often counter-productive to try and outsmart the
energy function. Correct use of constraints is often a matter of experience.
There are a variety of different types of constraints, each of which come with particular files for describing them. The types
are described below, while the file formats are discussed under the section called “Running Ab Initio” along with the
command lines that take advantage of them.
Barcode Constraints
Filters
Filters are a miscellany of conditions applied during postprocessing to eliminate decoys which don't fit some criterion of
scoring. As with constraints they exist to handle specific situations which are not being accounted for by the general energy
function, and are one of the tools of a researcher working with a protein whose precise context is not well known, or for
which the standard ab initio run is not producing good enough results. They differ from constraints in that particular filters
are associated with particular terms in the scoring output, and they are applied after the scoring is complete.
While some filters are built in to Rosetta, you may want to write external filters or alter the Rosetta source to add filters for
postprocessing.
Algorithm
The overall process of ab initio structure prediction is as follows:
1.
2.
3.
Fragment generation.
Fragment insertion.
Scoring.
The Monte Carlo fragment assembly protocol used to generate structures, and the knowledge based potential used to guide
simulations are described in Simo1997, and Simo1999. The bond angles and bond lengths are kept fixed, and sidechains
are approximated by interaction centers at their centroids, so the conformation of a protein is completely specified by the
backbone torsion angles phi, psi, and omega. The energy function used in this phase of ab initio prediction is called the
centroid energy function or the low-resolution energy function (see the section called “Centroid resolution”)
The general algorithm is as follows: :
1.
2.
The starting configuration is a completely extended chain.
Backbone sampling by fragment insertion is carried out by:
1. Randomly selecting a position in the protein chain.
2. Randomly selecting a fraqment from the library at this position.
3. Replacing the previous values of the backbone torsion angles for this part of the protein with
those from the fragment.
4. During the first phase of the calculation, 9 residue fragments are used to build a broad
skeleton; during the second phase, 3 residue fragments are used to refine that structure.
After an ensemble of structures have been generated using the low-resolution algorithm, several of them are selected for
full-atom refinement. Full-atom refinement is necessary to identify the native structure, as the centroid energy function
does not have the necessary information or discriminative power to distinguish between native and non-native structures.
Put another way, discrimination between the lowest energy native structure and alternative structures requires a high
degree of accuracy for both the backbone and side chain atoms. A more detailed discussion of the issues in ab initio
structure prediction is provided in Brad2005. It is common to take a selection of the lowest 5% of structures by centroid
energy and use these as starting points for full-atom refinement.
As has been noted in many publications on ab initio structure prediction, the vast size of the conformational landscape for a
given proteins makes searching this landscape very difficult. Rosetta allows searches through this landscape to be
constrained through a search process known as a barcode (see the section called “Barcode constraints”).
Fragment Generation
Running Ab Initio
This section describes the basic elements of running the ab initio protocol, and containes example command lines which
illustrate common uses.
The first step is to make fragment libraries based on the sequence of the protein whose structure is being predicted. This
can be done using the rosetta fragment server or the make_fragments.pl script that comes with the
rosetta_fragments package. The full process of fragment generation is described in the section entitled Fragment
Generation.
Once the fragment files are available, the paths.txt file must be edited so that the proper paths to the input sequence,
rosetta_database, and fragment files are specified. The fragment file names at the bottom of the file must
correspond to the fragment files made above.
ab initio folding can now be invoked using:
rosetta <series> <protein> <chain>
<series> is a two letter identifier for the run, <protein> is a four letter PDB code for the protein, and
<chain> is a one letter identifier for the chain. A *.fasta format file with the query sequence called
<protein>.fasta must be present in the directory specified in paths.txt
Where
The resulting models will be named <series><protein>_NNN.pdb where NNN is a number starting at 001,
increasing monotonically to 999. If you want to generate more than 1000 decoys for one particular sequence you will need
to modify the
<series>.
Common options Command-line options
-increase_cycles <factor>
Scale the number of cycles at each stage of ab initio by
<factor>.
-number_3mer_frags <# fragments>
Changes the number of 3-mer fragments used to build the model.Default: 200. This number was selected as a
cutoff point below which the quality of the fragments decays.
-number_9mer_frags <# fragments>
Changes the number of 9-mer fragments used to build the model.Default: 25. This number was selected as a
cutoff point below which the quality of the fragments decays.
-abinitio_temperature <temperature>
Temperature used in simulated annealing step.Default: 2.0
-fast
Runs protocol without minimization or gradients, trading accuracy for speed. For NOE data only, -fast, yields
essentially the protocol published in Bowe2000. For RDC data only,
examples published in Rohl2002.
-fast omits the refinement step included in
Inputs - Required
<PDB file> (.pdb)
The PDB file describing the protein to derive the structure of. Specified by
line argument.Format: See the section called “PDB (decoys)”.
-protein or as the 2nd command-
<FASTA (sequence) file> (.fasta)
The sequence description. Specified by the 9th line of paths.txt (see the section called “Paths
(text)”).Format: See the section called “Sequences (FASTA)”.Default: <PDB code><chain>.fasta
in the directory specified by
paths.txt. There is no way to change the name expected.
Inputs - Optional
<Sequence file> (.dat)
The sequence description. Specified by the 9th line of paths.txt (see the section called “Paths (text)”).
Sequence files ending in .dat are Rosetta's internal sequence format. They are deprecated: prefer .fasta
files.Note: You should either use .dat or .fasta sequence files, not both. . Format: (see the section called
“Sequences (.dat)”).Default: <PDB code><chain>.dat in the directory specified by
There is no way to change the name expected.
paths.txt.
Outputs
<PDB file>
The PDB file containing the generated decoy. This file's name is the series, protein code, and chain
concatenated together and suffixed with a number indicating which iteration of the decoy this is. So for the
command-line:
rosetta aa 2ptl A
the resulting output files will be named aa2ptlA_001.pdb up until aa2ptlA_999.pdb. This implies that
there can only be up to 1000 decoys generated for each input PDB within one directory, but this is precisely what
the series is for: by altering it you can have considerably more. A PDB file generated as output from ab initio
contains not only the plain PDB but also extended data containing the overall scores, an energy table for each
residue, the average energies for each residue, the average energies for each secondary structure, decoy angles v
starting chi angles, fraction of chi angles viewed as correct, docking positions, et al. (see 1brs.pdb for
more)Format: (see the section called “PDB (decoys)”).
Reweighting the ScoringCommand-line? options
All of these options modify the effects of one or more scoring terms on the final outputted scores. For more information on
the scoring term see Scoring
-vdw_reweight
Scale the contribution of the van der Waals term to the score by
<factor>
-env_reweight
Scale the contribution of the environment term to the score by
<factor>
-pair_reweight
Scale the contribution of the pair term to the score by
<factor>
<factor>
-cb_reweight
-sheet_reweight
<factor>
<factor>
1.0
1.0
Default:
1.0
Scale the contribution of the sheet term to the score by
<factor>
-hs_reweight
<factor>
Scale the contribution of the helix-strand term to the score by
<factor>
Default:
<factor>
Scale the contribution of the strand-strand term to the score by
-rg_reweight
1.0
Scale the contribution of the C-beta (packing density) term to the score by
<factor>
<factor>
Default:
Default:
<factor>
-ss_reweight
-rsigma_reweight
<factor>
<factor>
<factor>
Scale the contribution of the R-sigma (strand pair distance/register) term to the
score by <factor>
Scale the contribution of the radius of gyration term to the score by
<factor>
Default:
1.0
Default:
1.0
Default:
1.0
Default:
1.0
Default:
1.0
Secondary Structures Command-line options
All of these options favor different secondary structures by altering the effect on the score given to residues which become
part of particular secondary structures.
-rsd_wt_helix
<factor>
Scale the environment, pair and c-beta terms for helix residues by
-rsd_wt_strand
<factor>.
<factor>
Scale the environment, pair and c-beta scores for strand residues by
-rsd_wt_loop
<factor>.
<factor>
Scale the environment, pair and c-beta scores for loop residues by
<factor>.
-rand_envpair_res_wt<factor>
Scale the environment, pair, and c-beta scores for all residues by random factors (between 0.5 and 1.2).
-rand_SS_wt
<factor>
Scale the helix-strand, strand-strand, sheet and rsigma scores for all residues by random factors (between 0.5 and 1.5).
-random_parallel_antiparallel
For each decoy, randomly choose whether to drastically upweight long-range parallel strand pairings by a random factor of
up to 10-fold, and downweight anti-parallel pairings by a similar amount, or vice versa.
-strand_dist_cutoff <distance>
Specify the distance cutoff in Angstroms between strand dimers within which they are designated paired.
Contacts Command-line options
All of these options modify how contacts are scored. (see Contacts).
-score_contact_flag
Turn contact scoring on.
-score_contact_file
<filename>
Specify the name of a file containing the probabilities of Default: <protein><chain>.contact
contacts forming between particular residues. Format:
the section called “Contacts”
-set_contact_weight
Scale the contribution of successful contacts to the score by
<factor>
-score_contact_threshold
<threshold>
-score_contact_seq_sep
<separation>
<factor>.
Prediction probabilities above <threshold> result in Default:
a bonus to the overall score. Probabilities before this
threshold result in a penalty.
Residues separated in sequence by a count of at least
<separation> are the only ones scored for
contacts.
0.5.
Default: 2. Note: You can lower this,
but not raise it.
-score_contact_calpha
Use distances between alpha carbons, not between centroids, to assign bonuses to score for
contacts. This can radically reinterpret the meaning of the distance entry in the contact file, so
it is advisable to use a different contact file if particular distances are being weighted for.
-score_contact_distance
Give distances greater than <distance> in
Angstroms a bonus to the overall score.
<distance>
-score_contact_readindist
Default: 8 (centroid/centroid),
(C-alpha/C-alpha)
11
Instead of using -score_contact_distance, read in distance from the fourth column
of .contact (see the section called “Contacts”). This permits specific residues to be given
particular bonuses or penalties. The absence of a fourth column will result in using the defaults
set by score_contact_distance.
Interpreting Results
Even with a relatively high level of accuracy in model building, most decoys generated by the ab initio process are not
going to be accurate enough. The roughness of the search space allows for many decoys to get trapped in local energy
minima, in configurations which turn out not to be stable. This means that you will generally need to do many runs in order
to produce a large enough sample to find a stable enough model. Out of a large sample one would expect a small fraction
to have low (good) enough score to be a likely structure. There is no tried and true method for finding reliable decoys, but
this section will attempt to get you started in the right direction.
Having been designed to model actual proteins as closely as possible, the energy function tends to produce the lowest
(best) scores for naturally occurring proteins. When the final score of a known structure is plotted against another measure
of structural accuracy (such as RMSD or radius of gyration) the value will appear in the bottom right of the graph. Most
predictions will not fare so well, having both higher scores and deviating from the native along the other axis as well. Their
plots will tend to appear in clouds along the upper regions of the graph. Decoys which are approaching the actual structure
will form a funnel in down the left side of the graph as their external measure comes very close to the native's (e.g. as
RMSD approaches a reliable threshold), and the final score gets lower and lower.
In most cases of course you are using structure prediction precisely because you do not know the actual structure. However
the funnel described is one of the major signs that you are getting models that are likely to be stable. The important thing
is to select an axis (and in fact more than one) which is known to be a good indicator of stability to plot against the final
Rosetta score. Low scores can be deceptive, but low scores that appear in a funnel against several axes are likely to be
indicative of a stable decoy.
In addition to the final energy value, the scoring of abinitio generates a great deal of auxiliary information to help guide you
in the process of selecting possible axes.
Protocol used in CASP6
Download