Ab Initio Structure Prediction Ab Initio Documentation in v2.2.0 Ab initio protein folding performs task of predicting 3-D structural models for a protein molecule from its sequence information. The simulation is done in low-resolution mode. Then the resulting models are clustered (see Clustering Scripts) and the cluster centers are subjected to full-atom structure refinement (see Relax). For more information about how this method works in Rosetta and relevant publications, please refer to the following reference: Rohl, C. A., Strauss, C. E., Misura, K. M., Baker, D. (2004). Protein structure prediction using Rosetta Methods Enzymol 383, 66-93. -----------------------------------------------------------------------In order to run Rosetta Ab initio mode, you need to prepare the following input files: (1) protein sequence fasta file; (2) fragment library files; (3) secondary structure prediction file (psipred,psipred_ss2 or jones). Please see ../fragments/README.fragments for how to make fragment libraris and by doing that you can also get psipred files needed. The command line for Rosetta Ab initio is very simple: $> rosetta.exe aa 2ptl _ -nstruct 9000 -silent This requires 2ptl_.fasta, aa2ptl_.03(or 09)_05.200_v1_3, and 2ptl_.psipred (or 2ptl_.psipred_ss2 or 2ptl_.jones) at locations specified accordingly in paths.txt. The output from this command line is the score file aa2ptl_.sc and the silent model output aa2ptl_.out. If you omit -silent options, you will instead of get normal pdb format for each model, such as aa2ptl_0001.pdb through aa2ptl_9000.pdb. Compared to this, the silent output format records only backbone torsion angles and CA atom coordinates for each model and since Rosetta folds protein sturcture models using ideal bond geometries, the silent output file can be later extracted to normal pdb files. The size of silent output file is much smaller than normal pdbs and the next stage of clustering also requires the silent format as input. Note that Rosetta has a hard-coded limit on the number of digits of model index number( four currently) which means that at most 9999 models can be output in each run. To get more models, change the series code (aa in the above example) to other two-letter code and you will have a different prefix to the score file and the silent output file. Afterwards, you can combine several silent out files together into one file for next stage of clustering. ------------------------------------------------------------------------Pose-based Ab initio protein structure prediction References: Bradley, P. and Baker, D. (2006). Improved beta-protein structure prediction by multilevel optimization of nonlocal strand pairings and local backbone conformation. Proteins. 65(4): 922-999. In addtion to functionalities of the classic Rosetta Ab initio folding, pose-based ab initio protocols allows users to specify certain beta-strand pairing topology to start from. This procedure is usually performed after identifying several prefered folding topologies from model population by standard Ab initio folding in order to increase the sampling desity for certain strand pairings. A. To perform a standard ab initio folding using pose-based represention $> rosetta.exe aa 2ptl _ -pose_abinitio -nstruct 2000 B. To perform a ab initio folding using specified strand-pairing $> rosetta.exe aa 2ptl _ -jumping -nstruct 2000 -pairing_file pairing-file -close_chainbreaks -apply_filters ----------------------------------------------------------------------------PAIRINGS FILE FORMAT: one line per pairing, 4 numbers per line, white-space delimited format: "pos1 pos2 orientation pleating" pos1 and pos2 are the sequence numbers of the beta-paired positions orientation is the beta-strand orientation: "1" for antiparallel, "2" for parallel or just "A" or "P" pleating determines the pleat of the beta-carbons "1" if the N and O of pos1 are pointed away from pos2 "2" if the N and O of pos1 are pointed toward pos2 eg, in the antiparallel case, a pleat of 2 would mean that there are two backbone-backbone hydrogen bonds between pos1 and pos2 In the parallel case a pleat of 2 means that pos1 is Hbonding with pos2-1 and pos2+1 if you check out rosetta_benchmarks, in the directory 1d3z/ is a pairing file "pairings.dat" with two pairings in it. These are native pairings, ie they match the pdb structure 1d3z.pdb in the same directory. In Fortran you had to tell Rosetta how many pairings to expect; that's why the first line has a "2" on it. This is no longer necessary. COMMAND LINE SYNTAX -pairing_file <pairings-file> with no other arguments and it will try to build decoys with *all* the pairings in the file. You can also specify what kind of sheet topology Rosetta should try to construct: -sheet1 <N1> -sheet2 <N2> ... -sheetk <Nk> Here Nj is the number of strands in sheet number j. So the number of forced pairings will be (Nj-1). The sheet can get other strands during the folding simulation -- this is just specifying how many Rosetta should actual build from the start using the broken chain stuff. So the total number of forced pairings will be: N1-1 + N2-1 + N3-1 + ... + Nk-1 For example, to specify two strand pairings in two different sheets, use the args "-sheet1 2 -sheet2 2" ------------------------------------------------------------To extract decoys from the silent *.out file $> rosetta.gcc -extract -s your.out -all or you can replace -all by -l <tag_file> to only extract a subset of models in the silent out file. A tag is the last colunm in the line starting with "SCORE". If a tag starts with "F_", it means the model fails to pass certain score filters or have some bad features, and therefore should be dropped. "S_" models are valid models generated by Rosetta simulation. Ab initio structure prediction uses the sequence of a protein to generate models of its final folded structure. Constraints Constraints exist to allow you to set hard limits on particular features of accepted decoys are accepted in particular the conformation of particular residues and atoms. The energy function sets no hard boundaries on what decoys are accepted. All decoys get a score, but even the worst ones are scored. This can lead to a proliferation of too many decoys with bad scores, which obscure better decoys. Additionally you often know information about the environment your protein exists or will exist in that is not yet taken into account by the energy function, such as boundaries on size and position of particular residue or atoms. Constrainst allow you to take advantage of such specialized knowledge. It is worth emphasizing that constraints can be an art more than a science. They exist precisely to allow you to work in areas that are not well defined enough in the energy function, but it is often counter-productive to try and outsmart the energy function. Correct use of constraints is often a matter of experience. There are a variety of different types of constraints, each of which come with particular files for describing them. The types are described below, while the file formats are discussed under the section called “Running Ab Initio” along with the command lines that take advantage of them. Barcode Constraints Filters Filters are a miscellany of conditions applied during postprocessing to eliminate decoys which don't fit some criterion of scoring. As with constraints they exist to handle specific situations which are not being accounted for by the general energy function, and are one of the tools of a researcher working with a protein whose precise context is not well known, or for which the standard ab initio run is not producing good enough results. They differ from constraints in that particular filters are associated with particular terms in the scoring output, and they are applied after the scoring is complete. While some filters are built in to Rosetta, you may want to write external filters or alter the Rosetta source to add filters for postprocessing. Algorithm The overall process of ab initio structure prediction is as follows: 1. 2. 3. Fragment generation. Fragment insertion. Scoring. The Monte Carlo fragment assembly protocol used to generate structures, and the knowledge based potential used to guide simulations are described in Simo1997, and Simo1999. The bond angles and bond lengths are kept fixed, and sidechains are approximated by interaction centers at their centroids, so the conformation of a protein is completely specified by the backbone torsion angles phi, psi, and omega. The energy function used in this phase of ab initio prediction is called the centroid energy function or the low-resolution energy function (see the section called “Centroid resolution”) The general algorithm is as follows: : 1. 2. The starting configuration is a completely extended chain. Backbone sampling by fragment insertion is carried out by: 1. Randomly selecting a position in the protein chain. 2. Randomly selecting a fraqment from the library at this position. 3. Replacing the previous values of the backbone torsion angles for this part of the protein with those from the fragment. 4. During the first phase of the calculation, 9 residue fragments are used to build a broad skeleton; during the second phase, 3 residue fragments are used to refine that structure. After an ensemble of structures have been generated using the low-resolution algorithm, several of them are selected for full-atom refinement. Full-atom refinement is necessary to identify the native structure, as the centroid energy function does not have the necessary information or discriminative power to distinguish between native and non-native structures. Put another way, discrimination between the lowest energy native structure and alternative structures requires a high degree of accuracy for both the backbone and side chain atoms. A more detailed discussion of the issues in ab initio structure prediction is provided in Brad2005. It is common to take a selection of the lowest 5% of structures by centroid energy and use these as starting points for full-atom refinement. As has been noted in many publications on ab initio structure prediction, the vast size of the conformational landscape for a given proteins makes searching this landscape very difficult. Rosetta allows searches through this landscape to be constrained through a search process known as a barcode (see the section called “Barcode constraints”). Fragment Generation Running Ab Initio This section describes the basic elements of running the ab initio protocol, and containes example command lines which illustrate common uses. The first step is to make fragment libraries based on the sequence of the protein whose structure is being predicted. This can be done using the rosetta fragment server or the make_fragments.pl script that comes with the rosetta_fragments package. The full process of fragment generation is described in the section entitled Fragment Generation. Once the fragment files are available, the paths.txt file must be edited so that the proper paths to the input sequence, rosetta_database, and fragment files are specified. The fragment file names at the bottom of the file must correspond to the fragment files made above. ab initio folding can now be invoked using: rosetta <series> <protein> <chain> <series> is a two letter identifier for the run, <protein> is a four letter PDB code for the protein, and <chain> is a one letter identifier for the chain. A *.fasta format file with the query sequence called <protein>.fasta must be present in the directory specified in paths.txt Where The resulting models will be named <series><protein>_NNN.pdb where NNN is a number starting at 001, increasing monotonically to 999. If you want to generate more than 1000 decoys for one particular sequence you will need to modify the <series>. Common options Command-line options -increase_cycles <factor> Scale the number of cycles at each stage of ab initio by <factor>. -number_3mer_frags <# fragments> Changes the number of 3-mer fragments used to build the model.Default: 200. This number was selected as a cutoff point below which the quality of the fragments decays. -number_9mer_frags <# fragments> Changes the number of 9-mer fragments used to build the model.Default: 25. This number was selected as a cutoff point below which the quality of the fragments decays. -abinitio_temperature <temperature> Temperature used in simulated annealing step.Default: 2.0 -fast Runs protocol without minimization or gradients, trading accuracy for speed. For NOE data only, -fast, yields essentially the protocol published in Bowe2000. For RDC data only, examples published in Rohl2002. -fast omits the refinement step included in Inputs - Required <PDB file> (.pdb) The PDB file describing the protein to derive the structure of. Specified by line argument.Format: See the section called “PDB (decoys)”. -protein or as the 2nd command- <FASTA (sequence) file> (.fasta) The sequence description. Specified by the 9th line of paths.txt (see the section called “Paths (text)”).Format: See the section called “Sequences (FASTA)”.Default: <PDB code><chain>.fasta in the directory specified by paths.txt. There is no way to change the name expected. Inputs - Optional <Sequence file> (.dat) The sequence description. Specified by the 9th line of paths.txt (see the section called “Paths (text)”). Sequence files ending in .dat are Rosetta's internal sequence format. They are deprecated: prefer .fasta files.Note: You should either use .dat or .fasta sequence files, not both. . Format: (see the section called “Sequences (.dat)”).Default: <PDB code><chain>.dat in the directory specified by There is no way to change the name expected. paths.txt. Outputs <PDB file> The PDB file containing the generated decoy. This file's name is the series, protein code, and chain concatenated together and suffixed with a number indicating which iteration of the decoy this is. So for the command-line: rosetta aa 2ptl A the resulting output files will be named aa2ptlA_001.pdb up until aa2ptlA_999.pdb. This implies that there can only be up to 1000 decoys generated for each input PDB within one directory, but this is precisely what the series is for: by altering it you can have considerably more. A PDB file generated as output from ab initio contains not only the plain PDB but also extended data containing the overall scores, an energy table for each residue, the average energies for each residue, the average energies for each secondary structure, decoy angles v starting chi angles, fraction of chi angles viewed as correct, docking positions, et al. (see 1brs.pdb for more)Format: (see the section called “PDB (decoys)”). Reweighting the ScoringCommand-line? options All of these options modify the effects of one or more scoring terms on the final outputted scores. For more information on the scoring term see Scoring -vdw_reweight Scale the contribution of the van der Waals term to the score by <factor> -env_reweight Scale the contribution of the environment term to the score by <factor> -pair_reweight Scale the contribution of the pair term to the score by <factor> <factor> -cb_reweight -sheet_reweight <factor> <factor> 1.0 1.0 Default: 1.0 Scale the contribution of the sheet term to the score by <factor> -hs_reweight <factor> Scale the contribution of the helix-strand term to the score by <factor> Default: <factor> Scale the contribution of the strand-strand term to the score by -rg_reweight 1.0 Scale the contribution of the C-beta (packing density) term to the score by <factor> <factor> Default: Default: <factor> -ss_reweight -rsigma_reweight <factor> <factor> <factor> Scale the contribution of the R-sigma (strand pair distance/register) term to the score by <factor> Scale the contribution of the radius of gyration term to the score by <factor> Default: 1.0 Default: 1.0 Default: 1.0 Default: 1.0 Default: 1.0 Secondary Structures Command-line options All of these options favor different secondary structures by altering the effect on the score given to residues which become part of particular secondary structures. -rsd_wt_helix <factor> Scale the environment, pair and c-beta terms for helix residues by -rsd_wt_strand <factor>. <factor> Scale the environment, pair and c-beta scores for strand residues by -rsd_wt_loop <factor>. <factor> Scale the environment, pair and c-beta scores for loop residues by <factor>. -rand_envpair_res_wt<factor> Scale the environment, pair, and c-beta scores for all residues by random factors (between 0.5 and 1.2). -rand_SS_wt <factor> Scale the helix-strand, strand-strand, sheet and rsigma scores for all residues by random factors (between 0.5 and 1.5). -random_parallel_antiparallel For each decoy, randomly choose whether to drastically upweight long-range parallel strand pairings by a random factor of up to 10-fold, and downweight anti-parallel pairings by a similar amount, or vice versa. -strand_dist_cutoff <distance> Specify the distance cutoff in Angstroms between strand dimers within which they are designated paired. Contacts Command-line options All of these options modify how contacts are scored. (see Contacts). -score_contact_flag Turn contact scoring on. -score_contact_file <filename> Specify the name of a file containing the probabilities of Default: <protein><chain>.contact contacts forming between particular residues. Format: the section called “Contacts” -set_contact_weight Scale the contribution of successful contacts to the score by <factor> -score_contact_threshold <threshold> -score_contact_seq_sep <separation> <factor>. Prediction probabilities above <threshold> result in Default: a bonus to the overall score. Probabilities before this threshold result in a penalty. Residues separated in sequence by a count of at least <separation> are the only ones scored for contacts. 0.5. Default: 2. Note: You can lower this, but not raise it. -score_contact_calpha Use distances between alpha carbons, not between centroids, to assign bonuses to score for contacts. This can radically reinterpret the meaning of the distance entry in the contact file, so it is advisable to use a different contact file if particular distances are being weighted for. -score_contact_distance Give distances greater than <distance> in Angstroms a bonus to the overall score. <distance> -score_contact_readindist Default: 8 (centroid/centroid), (C-alpha/C-alpha) 11 Instead of using -score_contact_distance, read in distance from the fourth column of .contact (see the section called “Contacts”). This permits specific residues to be given particular bonuses or penalties. The absence of a fourth column will result in using the defaults set by score_contact_distance. Interpreting Results Even with a relatively high level of accuracy in model building, most decoys generated by the ab initio process are not going to be accurate enough. The roughness of the search space allows for many decoys to get trapped in local energy minima, in configurations which turn out not to be stable. This means that you will generally need to do many runs in order to produce a large enough sample to find a stable enough model. Out of a large sample one would expect a small fraction to have low (good) enough score to be a likely structure. There is no tried and true method for finding reliable decoys, but this section will attempt to get you started in the right direction. Having been designed to model actual proteins as closely as possible, the energy function tends to produce the lowest (best) scores for naturally occurring proteins. When the final score of a known structure is plotted against another measure of structural accuracy (such as RMSD or radius of gyration) the value will appear in the bottom right of the graph. Most predictions will not fare so well, having both higher scores and deviating from the native along the other axis as well. Their plots will tend to appear in clouds along the upper regions of the graph. Decoys which are approaching the actual structure will form a funnel in down the left side of the graph as their external measure comes very close to the native's (e.g. as RMSD approaches a reliable threshold), and the final score gets lower and lower. In most cases of course you are using structure prediction precisely because you do not know the actual structure. However the funnel described is one of the major signs that you are getting models that are likely to be stable. The important thing is to select an axis (and in fact more than one) which is known to be a good indicator of stability to plot against the final Rosetta score. Low scores can be deceptive, but low scores that appear in a funnel against several axes are likely to be indicative of a stable decoy. In addition to the final energy value, the scoring of abinitio generates a great deal of auxiliary information to help guide you in the process of selecting possible axes. Protocol used in CASP6