PRO_458_sm_suppinfo

advertisement
Supplement 1: Packing quality of Rosetta Refined Models
SDECOY Score
The figure above shows the distribution of RosettaHoles2 SDECOY scores for 120 proteins used as the
Rosetta benchmark set. The green line shows the scores for the crystal structures from the PDB
(“Native”). The black dashed line shows the scores for a population of Rosetta ab-initio structure
predictions (“Decoys”). The blue lines show the scores for structure predictions which include some
information about the correct structure (“Cheater Decoys”), and are thus of higher quality than standard
ab-initio models. The blue dashed line shows scores for the entire population of cheater decoys, while the
solid blue line shows the scores for cheater decoys that are closer than 1Å from the correct structure. The
red lines show scores for crystal structures that have been refined using the Rosetta force field (“Related
Native”). The two solid red lines show scores for structures that first had bond lengths and angles set to
ideal values, while the dashed red line shows scores for structures refined with the original bond
geometry.
Supplement 2: Cavity Volume vs. RosettaHoles (version 1)
The above figure shows the RosettaHoles (version 1) scores and fractional cavity volumes for a set of
crystal structures and corresponding ab-initio structure predictions. The fractional cavity volume is
defined as the total cavity volume divided by the total volume of the molecule (both using a 1.4Å radius
probe). There is a significant difference between void volume in crystals and Rosetta decoys, but the
cavity volume measurement does not separate the two populations in the same way as the RosettaHoles
score.
Supplement 3: Illustration of Volumetric Energy Function
Volume
The four images above illustrate the basic volumetric data used in RosettaHoles2 in a 2D cartoon. In
black is a hypothetical group of atoms shown as black filled circles. The second images shows how a
volumetric shell extending a certain radius from the atomic surface is defined. The third figure shows
many concentric shells, illustrating how volume can be partitioned according to distance to the atomic
surface. The fourth figure shows the region occupied by a particular atom, with space occupied by other
atoms faded. The figure below illustrates how the volumetric shells correspond to the vectors of volumes
used in RosettaHoles2. Two atoms in different environments are pictured, one buried and one on the
surface. Note: in these images, the shells have the same thickness, but in RosettaHoles2, the shells are not
evenly spaced.
Volume
Radius
Radius
Supplement 4: Details of the Sdecoy score.
The parameters of the RosettaHoles Sdecoy score are shown above for the 28 atom different types
modeled. A numerical data file is available as part of Rosetta, or upon request.
The parameters were obtained via SVM discrimination using linear kernels. Positive training examples
were drawn from a random subset of protein data bank structures of sub 1.28Å resolution. Negative
examples were taken from a random subset of the August 2008 Rosetta optE decoy set. Testing was
performed on the decoys not used in training. To produce these smooth parameter sets, it was necessary to
train many independent SVM models on subsets of the training set. Because linear kernels were used, the
average prediction of the trained models is exactly the same as the prediction from the average model,
which is simply a linear combination of all the models. Averaging over many models does not improve
discrimination performance, but it does provide a cleaner set of model parameters. Because the input
volumetric features are highly correlated, especially for adjacent spherical shells, each individual model
may upweight one volume and downweight a correlated nearby volume. When thousands of separate
linear models are averaged, these local variations average out to reveal the smooth underlying trends in
the parameters, as seen above. This approach is made possible by the vast surfeit of data available in the
training set.
The most pronounced feature of the parameters is the favorability of volume within 0.2Å of the atomic
radius and the dis-favorability of volume 0.4Å-0.6Å from the atomic surface. This effect is nearly
universal... it appears for side chain polar atom, side chain hydrophobic atoms, and backbone atoms,
though it seems most pronounced in atoms with the most freedom to move, such as CH3 side chain atoms
and crystal waters (not shown). Such an effect in the data can be explained by clumping of atoms, as
illustrated below.
Why should atoms in experimental structure be less clumped than in computationally generated
structures? We believe this is due to entropic effects.
Energy-minimized computationally generated structures
are essentially at absolute 0 temperature.. the occupy the
lowest energy conformation without regard top
configurational entropy. To illustrate this, we conducted a
simple experiment with packing of hard spheres in a box
in two dimensions. One billion random configurations
with hard spheres in a box were generated and scored with
a leonard-jones like score function. Shown below are
radial distribution functions of these arrangements for
various energy bins. Note the low energy (red)
configurations, the spacial distribution is quite peaked,
with more spheres very close together, whereas the
“random” or “room temperature” ensemble is more spread
out. Two representative samples from are shown below,
one from a typical random sample and one from a highly
clumped low energy configuration.
“random”
low energy
red: low energy
blue: high energy
black: equilibrium at “room
temperature”
Supplement 5: Computational Refinement of Structures With Rosetta & RosettaHoles2
This figure compares the RosettaHoles2 scores (left panels) and per-residue Rosetta scores (right panels)
of high quality crystal structures before (X axis) and after (Y axis) refinement with vanilla Rosetta (top
panels) and Rosetta with RosettaHoles2 (bottom panels). Lower is better for both scores. The vanilla
Rosetta minimized structures have much lower Rosetta scores (top right panel) but much higher
RosettaHoles2 scores (top left panel) compared to the starting crystal structures. When RosettaHoles2 is
included in minimization, comparable improvements are achieved in Rosetta score (bottom right panel)
without significantly degrading the RosettaHoles2 score (bottom left panel.)
Supplement 6: Details of the Sresl Score
The parameters of the RosettaHoles2 Sresl score are shown above. A numerical data file is available as
part of Rosetta, or upon request.
Although the Sresl score, which correlates with PDB xray resolution, has the same inputs, linear
functional form and number of parameters as the Sdecoy score, a more complex training process was
required to produce the parameters displayed above.This process described below is made possible by the
use of linear kernels, which allows us to train many different models and recombine them easily. In all
cases, the training of each individual model was done thousands of times on different subsets of the
relevant training data in order to produce smooth parameter sets. See the discussion of training Sdecoy for
further explanation.
127
a
s
d
f
The figure above left shows an overview of the three steps used to train the SRESL score. First, the xray
structures in the PDB are broken in to 64 groups based on resolution. Step A (no figure) is to train 128
sets of discriminatory classifiers (a set is one classifier for each of the 28 atom types). Sets 1-64
discriminate the lowest resolution group (< 1.28Å) from each of the 64 groups of structures. These
models discriminate volume distributions of individual atoms. Sets 65-128 discriminate each of the
resolution groups from a set of Rosetta decoys. In step B (above right), the whole-structure discrimination
scores for each protein are computed for each classifier set by averaging the scores for each atom (all 28
types).
In step C (above left) SVM regression is used to predict the resolution of each structure using as inputs
the 128 whole-structure discrimination scores produced in step B. The result is a resolution production
model with 128 parameters. The final combined model for each atom type is produced by multiplying the
matrix of parameters from the 128 discriminatory models for that atom type with the vector of parameters
from the resolution prediction model.
Supplement 7: How to run RosettaHoles2
RosettaHoles2 is available as part of the Rosetta software suite. The application “holes” should be used to
compute quantitative rosettaholes scores as well as cavity visualizations. The DalphaBall binary must be
compiled separately from the main Rosetta applications. Scores can be obtained from standard out on
lines beginning with RosettaHoles:.
Required Arguments:
-in:file:database <path to rosetta database>
-holes:dalphaball <path to DalphaBall binary>
Optional Arguments:
-remember_unrecognized_res
This argument will cause RosettaHoles2 to inlcude ligand, DNA and other atoms which are not canonical
protein atom types. No scores will be produced for these atoms directly, but they will influence the scores
of nearby atoms. Recommended.
-holes:make_pdb
This argument will cause RosettaHoles2 to output a PDB file for each structure scored with per-atom
RosettaHoles2 scores in the temperature field for scored atoms.
-holes:make_voids
The argument will cause RosettaHoles2 to ouput an explicit representation of the voids in the input
strucure. Voids are represented as spheres and are output as HETATM lines. Contiguous voids will have
the same residue number, and the radius of the voids is placed in the temperature column. PyMOL
commands for visualization are available upon request.
Download