Supplement 1: Packing quality of Rosetta Refined Models SDECOY Score The figure above shows the distribution of RosettaHoles2 SDECOY scores for 120 proteins used as the Rosetta benchmark set. The green line shows the scores for the crystal structures from the PDB (“Native”). The black dashed line shows the scores for a population of Rosetta ab-initio structure predictions (“Decoys”). The blue lines show the scores for structure predictions which include some information about the correct structure (“Cheater Decoys”), and are thus of higher quality than standard ab-initio models. The blue dashed line shows scores for the entire population of cheater decoys, while the solid blue line shows the scores for cheater decoys that are closer than 1Å from the correct structure. The red lines show scores for crystal structures that have been refined using the Rosetta force field (“Related Native”). The two solid red lines show scores for structures that first had bond lengths and angles set to ideal values, while the dashed red line shows scores for structures refined with the original bond geometry. Supplement 2: Cavity Volume vs. RosettaHoles (version 1) The above figure shows the RosettaHoles (version 1) scores and fractional cavity volumes for a set of crystal structures and corresponding ab-initio structure predictions. The fractional cavity volume is defined as the total cavity volume divided by the total volume of the molecule (both using a 1.4Å radius probe). There is a significant difference between void volume in crystals and Rosetta decoys, but the cavity volume measurement does not separate the two populations in the same way as the RosettaHoles score. Supplement 3: Illustration of Volumetric Energy Function Volume The four images above illustrate the basic volumetric data used in RosettaHoles2 in a 2D cartoon. In black is a hypothetical group of atoms shown as black filled circles. The second images shows how a volumetric shell extending a certain radius from the atomic surface is defined. The third figure shows many concentric shells, illustrating how volume can be partitioned according to distance to the atomic surface. The fourth figure shows the region occupied by a particular atom, with space occupied by other atoms faded. The figure below illustrates how the volumetric shells correspond to the vectors of volumes used in RosettaHoles2. Two atoms in different environments are pictured, one buried and one on the surface. Note: in these images, the shells have the same thickness, but in RosettaHoles2, the shells are not evenly spaced. Volume Radius Radius Supplement 4: Details of the Sdecoy score. The parameters of the RosettaHoles Sdecoy score are shown above for the 28 atom different types modeled. A numerical data file is available as part of Rosetta, or upon request. The parameters were obtained via SVM discrimination using linear kernels. Positive training examples were drawn from a random subset of protein data bank structures of sub 1.28Å resolution. Negative examples were taken from a random subset of the August 2008 Rosetta optE decoy set. Testing was performed on the decoys not used in training. To produce these smooth parameter sets, it was necessary to train many independent SVM models on subsets of the training set. Because linear kernels were used, the average prediction of the trained models is exactly the same as the prediction from the average model, which is simply a linear combination of all the models. Averaging over many models does not improve discrimination performance, but it does provide a cleaner set of model parameters. Because the input volumetric features are highly correlated, especially for adjacent spherical shells, each individual model may upweight one volume and downweight a correlated nearby volume. When thousands of separate linear models are averaged, these local variations average out to reveal the smooth underlying trends in the parameters, as seen above. This approach is made possible by the vast surfeit of data available in the training set. The most pronounced feature of the parameters is the favorability of volume within 0.2Å of the atomic radius and the dis-favorability of volume 0.4Å-0.6Å from the atomic surface. This effect is nearly universal... it appears for side chain polar atom, side chain hydrophobic atoms, and backbone atoms, though it seems most pronounced in atoms with the most freedom to move, such as CH3 side chain atoms and crystal waters (not shown). Such an effect in the data can be explained by clumping of atoms, as illustrated below. Why should atoms in experimental structure be less clumped than in computationally generated structures? We believe this is due to entropic effects. Energy-minimized computationally generated structures are essentially at absolute 0 temperature.. the occupy the lowest energy conformation without regard top configurational entropy. To illustrate this, we conducted a simple experiment with packing of hard spheres in a box in two dimensions. One billion random configurations with hard spheres in a box were generated and scored with a leonard-jones like score function. Shown below are radial distribution functions of these arrangements for various energy bins. Note the low energy (red) configurations, the spacial distribution is quite peaked, with more spheres very close together, whereas the “random” or “room temperature” ensemble is more spread out. Two representative samples from are shown below, one from a typical random sample and one from a highly clumped low energy configuration. “random” low energy red: low energy blue: high energy black: equilibrium at “room temperature” Supplement 5: Computational Refinement of Structures With Rosetta & RosettaHoles2 This figure compares the RosettaHoles2 scores (left panels) and per-residue Rosetta scores (right panels) of high quality crystal structures before (X axis) and after (Y axis) refinement with vanilla Rosetta (top panels) and Rosetta with RosettaHoles2 (bottom panels). Lower is better for both scores. The vanilla Rosetta minimized structures have much lower Rosetta scores (top right panel) but much higher RosettaHoles2 scores (top left panel) compared to the starting crystal structures. When RosettaHoles2 is included in minimization, comparable improvements are achieved in Rosetta score (bottom right panel) without significantly degrading the RosettaHoles2 score (bottom left panel.) Supplement 6: Details of the Sresl Score The parameters of the RosettaHoles2 Sresl score are shown above. A numerical data file is available as part of Rosetta, or upon request. Although the Sresl score, which correlates with PDB xray resolution, has the same inputs, linear functional form and number of parameters as the Sdecoy score, a more complex training process was required to produce the parameters displayed above.This process described below is made possible by the use of linear kernels, which allows us to train many different models and recombine them easily. In all cases, the training of each individual model was done thousands of times on different subsets of the relevant training data in order to produce smooth parameter sets. See the discussion of training Sdecoy for further explanation. 127 a s d f The figure above left shows an overview of the three steps used to train the SRESL score. First, the xray structures in the PDB are broken in to 64 groups based on resolution. Step A (no figure) is to train 128 sets of discriminatory classifiers (a set is one classifier for each of the 28 atom types). Sets 1-64 discriminate the lowest resolution group (< 1.28Å) from each of the 64 groups of structures. These models discriminate volume distributions of individual atoms. Sets 65-128 discriminate each of the resolution groups from a set of Rosetta decoys. In step B (above right), the whole-structure discrimination scores for each protein are computed for each classifier set by averaging the scores for each atom (all 28 types). In step C (above left) SVM regression is used to predict the resolution of each structure using as inputs the 128 whole-structure discrimination scores produced in step B. The result is a resolution production model with 128 parameters. The final combined model for each atom type is produced by multiplying the matrix of parameters from the 128 discriminatory models for that atom type with the vector of parameters from the resolution prediction model. Supplement 7: How to run RosettaHoles2 RosettaHoles2 is available as part of the Rosetta software suite. The application “holes” should be used to compute quantitative rosettaholes scores as well as cavity visualizations. The DalphaBall binary must be compiled separately from the main Rosetta applications. Scores can be obtained from standard out on lines beginning with RosettaHoles:. Required Arguments: -in:file:database <path to rosetta database> -holes:dalphaball <path to DalphaBall binary> Optional Arguments: -remember_unrecognized_res This argument will cause RosettaHoles2 to inlcude ligand, DNA and other atoms which are not canonical protein atom types. No scores will be produced for these atoms directly, but they will influence the scores of nearby atoms. Recommended. -holes:make_pdb This argument will cause RosettaHoles2 to output a PDB file for each structure scored with per-atom RosettaHoles2 scores in the temperature field for scored atoms. -holes:make_voids The argument will cause RosettaHoles2 to ouput an explicit representation of the voids in the input strucure. Voids are represented as spheres and are output as HETATM lines. Contiguous voids will have the same residue number, and the radius of the voids is placed in the temperature column. PyMOL commands for visualization are available upon request.