Supplementary Methods Note on physical relevance of druggability equation parameters The parameters we obtained for ( ) and C using our calibration set have an interesting physical interpretation which may warrant further study. Our ( ) value of 45 cal/mol/Å2 is interestingly smaller than the 72 cal/mol/Å2 typically cited (reference 12 in article) but close to the value of 43 cal/mol/Å2 derived from measurements of hydrocarbon partitioning by De Young and Dill (reference 15 in article). Use of C=0 implies fortuitous cancellation of energy terms, which is not uncommon but requires further investigation. Details of computational algorithms Here we present further details of the algorithms used to compute the surface area and curvature values required for the MAPPOD equation. The equation, GMAPPOD (r ) A target nonpolar target Adruglike target total A C; (r ) ( ) 1.4 1 r , has three parameters that are measured from the binding site: the total solvent accessible target target surface area (SASA) represented by Atotal , the non-polar SASA represented by Atotal , and the global curvature of the binding site represented by r. The remaining parameters target are defined as described in the main text ( Adruglike =300Å2 , C=0, and ( ) =45 cal/mol/Å2). The goal here is to detail the algorithms for defining the binding site, and 1 the subsequent computation of the surface areas and curvature. The process is depicted in Figure S1. 2 Figure S1. Overview of computational approach. a, From the crystal structure and the set of atoms defining the binding site, computational geometry algorithms are used to define non-overlapping tetrahedra (shown in blue) for the binding site (see also Figure S2 for further details), from which an analytical representation of the surface is derived. The analytical representation includes sphere sections and torus sections that represent the molecular surface of the protein. b, From this surface, the total solvent-accessible surface area (SASA) can be directly computed by summing up surface areas of all the sphere or torus sections. The non-polar SASA is computed by defining surface patches as polar or non-polar based on atom types and summing up the surface areas of the non-polar patches. The outline represents the solvent accessible surface, while the colors depict polar (green) or non-polar (brown) surfaces. c, The curvature is then measured by finding the least-squares fitted sphere to the whole binding site. The yellow outline represents the geometric representation of the binding site molecular surface, while the orange sphere represents the least-squares fitted sphere to the binding site. 3 Computational representation of binding site. Crystal structures were downloaded from the PDB, and inspected for completeness in the binding site. All structures are atomic resolution (<2.5 Å, 1QMF is 2.8 Å) and have a co-crystallized ligand (peptide, small molecule, or substrate). PDB structures were not modified, although the programs we use to calculate surfaces areas and curvature ignore all heteroatoms and hydrogens (as is customary for biological surface area calculations). Ligand binding sites were further filled in using MOE SiteFinder alpha-spheres (version 2004.03; Chemical Computing Group, Montreal, Canada). Binding sites were defined by atoms within 5.0 Å of the ligand or alpha spheres and trimmed at the edges to define a contiguous binding site surface of approximately 300 Å2 of surface area. Analytic representation of the macromolecular surface involves first constructing the three-dimensional weighted Delaunay tessellation of the biomolecule and then subtracting the alpha shape complex, as depicted in Figure S2 and described previously (Liang et al. 1998; Edelsbrunner and Koehl 2003). We used the POCKET program (Edelsbrunner and Koehl 2003) to generate the Delaunay and alpha shape complexes, and Figure S2. Computational definition of binding site. The Delaunay tessellation of the biomolecule is followed by definition of the alpha shape complex. Subtraction of the alpha shape complex from the Delaunay tessellation results in identification of pockets. The exact pocket of interest is identified by the list of binding site atoms. 4 we modified the program to output coordinates for all tetrahedra and alpha shapes. Tetrahedra representing the protein pocket of interest are identified using the list of binding site atoms, where tetrahedra are retained if all four vertices are defined in the atom list. The largest set of connected non-alpha shape tetrahedra is retained as the computational definition of the pocket. In the case shown in the right-most panel of Figure S2, the process results in two tetrahedral clusters, and the smaller cluster (in purple above the larger cluster) is removed. The main advantage of this approach for our application is the accurate definition of a physically reasonable pocket surface. By defining the pocket using the tetrahedra that fill the pocket, we only use the portions of the surface that face into the pocket and directly participate in small-molecule binding. Use of standard available software to perform a simple additive summation of the surface areas of atoms in a binding site results in overestimation of binding surface areas by about 40%, largely from contributions from the “lip” outer edge of the pocket, which contributes minimally to binding affinity (A.C.C., R.G.C., unpublished observation). A reasonable pocket definition is also essential to the curvature calculation detailed below. Surface area calculation Solvent-accessible surface areas are calculated analytically using the collection of tetrahedra representing the binding site. The Delaunay tessellation generates nonoverlapping tetrahedra that divides up the space of the pocket. Because the four corners of the tetrahedra represent atom centers and atoms are represented as spheres, surface areas for each atom present in the tetrahedron can be calculated as portions of spheres. For the purpose of calculating the hydrophobic surface area, we define carbon and sulfur atoms as hydrophobic, and nitrogen and oxygen atoms as polar. 5 Because spheres are overlapping, calculation of surface areas involves an “inclusionexclusion” algorithm developed by Liang et al. (1998). Finding the solvent accessible (SA) surface area within one tetrahedron involves: adding the surface area of each SA sphere section that is within the tetrahedron, subtracting the surface area of the overlap between pairs of spheres within the tetrahedron, and adding back the surface area of the overlap between sets of three spheres within the tetrahedron. If a set of four spheres overlap, there is no accessible surface area within the tetrahedra. We have validated the surface areas calculated by our implementation with results from NACCESS (Laskowski 1995) and POCKET (Edelsbrunner and Koehl 2003). Curves representing boundaries between atoms are then calculated to allow reconstruction of the surface. The solvent accessible surface is a union of sphere sections, while the molecular surface additionally includes torus and sphere sections that map the reentrant surface (in Figure S3, lines between the dark gray spheres). We do not calculate the molecular surface area, but we need to generate the molecular surface for the curvature algorithm detailed below. Figure S3. Definition of solvent accessible surface, molecular surface, and van der Waals surface. The two light gray circles represent water probe spheres, and the dark gray area represents the area occupied by protein atoms. 6 Curvature algorithm A natural way to measure protein surface curvature is to generate the least squares fitted (LSF) sphere to a surface patch and use the radius as the curvature measure. While the concept is simple, the sphere-fitting problem is not trivial and most known approaches to protein surface curvature measurement use alternative approaches that are arguably less straightforward in terms of a physical interpretation. We have previously developed an approach to solve the LSF sphere problem by turning the sphere-fitting problem into a solvable plane-fitting problem using a transformation known as geometric inversion (Coleman et al. 2005). The approach works on any arbitrary surface patch, and returns a radius of curvature that has direct physical interpretation. This radius of curvature is the radius, r, used in the MAPPOD druggability equation. Finding the best-fitted sphere to a surface requires simultaneous minimization of distances and definition of a center given the restriction to a sphere. We use a transformation known as geometric inversion in generating a least squares fit (LSF) sphere to the binding site surface. An inversive sphere of radius k can be defined for any inversive point ( p, q, r ) , and all other points ( xi , yi , z i ) can be transformed around the inversive sphere as follows: xi k 2 ( xi p ) p ( xi p ) 2 ( y i q ) 2 ( z i r ) 2 k 2 ( yi q) yi q ( xi p ) 2 ( y i q ) 2 ( z i r ) 2 zi k 2 ( zi r ) r ( xi p ) 2 ( y i q ) 2 ( z i r ) 2 7 Because the inversive transformation is performed on points, a dot surface representation of the molecular surface is required. A number of available programs can be used to generate this dot surface, including the Connolly approach available from biohedron.com (Connolly 1986) and one based on icosahedrons from the Honig lab (Sridharan et al. 1992). Here, we painted each surface piece evenly with points using a spiral dot placement algorithm and subsequently removed points outside the boundary curves. These points can then be used to find the LSF sphere using the inversive transformation. Depicted in Figure S4a and explained below is the two-dimensional case of finding a least-squares fitted circle. The three-dimensional case that we care about for fitting a binding site is depicted in Figure S4b, and involves an inversion sphere that is used to transform points on the binding site surface to points that can be fitted to a plane in inversive space. Figure S4. Inversive transformation used to generate the least-squares fitted sphere to a binding site. a, In the two-dimensional case (used for clarity), a circle, C, in normal space becomes a line, L’, in inverted space because it passes through the inversion circle center. The dashed line represents the diameter of the circle of interest, C, in normal space, and is the distance between the inversion circle center and the furthest point on circle C. b, For a binding site surface patch in red, the transformation using an inversion sphere results in points that can be fitted to a plane in inverted space. Transforming the LSF plane back to normal space results in the candidate LSF sphere solution shown. 8 Since the transformation takes the inversion point (p, q, r) to infinity, this point must be treated in practice as a special case. For purposes of measuring curvature we can ignore this point since it receives a zero weighting in the LSF fit (see below). We use a unit inversion sphere (k=1) here. We make use of the geometric inversion property of being self-dual, meaning that a point transformed twice returns to the same point. Another property we take advantage of is that inversion of points that lie on a sphere that passes through the inversion point results in a set of points that lies on a plane (depicted as L’ for the two-dimensional case in Figure S4a). It follows that a set of points that lie approximately on a sphere, where the LSF sphere passes through the inversion point, will lie near a plane under inversion. We use inversion to find a LSF sphere, where the fit is determined by the sum of the smallest distances from the ideal sphere to each data point. The best-fit line to points in two dimensions, or the best-fit plane to points in three dimensions can be determined by finding the smallest eigenvalue and corresponding eigenvector of a symmetric, positive, semi-definite matrix (Pearson 1901). The inversive transformation results in the closest point to the origin on the plane found being the furthest point from the origin when inverted back to normal space. The origin and this furthest point defines the diameter of the sphere since they both lie on the sphere (depicted as a dashed line in Figure S4a). Because the inverted space shifts the spatial 4 relationships between points, we use a weight of d i for the plane LSF, where di is the distance for each point, i , from the inversion point in normal space. Making the reasonable assumption that the LSF sphere passes through at least one of the data points, the set of surface points is then fitted around each surface point to generate a set of possible solutions, and the fitted sphere with the least sum of squares is 9 kept as the best fitted sphere solution. The sphere radius is mathematically referred to as the “radius of curvature”. To determine whether the surface is convex or concave, we calculate the distance between the center of the LSF sphere and the relevant atom center, and assign the surface as concave if the distance is less than the diameter of the LSF sphere, and convex otherwise. The complete algorithm that includes the transformation, plane fitting, and inversion about each point, is detailed in Listing 1 below, and further details on the method and its validation are available in Coleman et al. (2005). All algorithms described in this supplementary materials were implemented in Java, and use Java3D libraries for vector math and JAMA libraries for solving eigenvalue problems. The curvature used here is the “global” curvature, where a single sphere is fit to the molecular surface. Our curvature calculation is intuitive but admittedly simple, and we are investigating whether localized curvatures can improve the model. 1. Define a set of points P to find the least sum of squares sphere to. 2. For each point pi P : Let p i be the inversion point ( p, q, r ) and points {xi , yi , z i } be all other points in P . O (n) Invert {xi , yi , z i } using the inversion defined in the methods to generate points t i . Find the least sum of squares plane fit to the points t i . Find the point on the plane closest to p i . Call this point a . Transform a using the inversion defined in the methods to generate a' . Define the sphere center, c i , as the average of p i and a' . Define the radius for the sphere given center c i . If the least sum of squares is lower than the previous best fit, keep c i and the radius. Output the best found center and radius. O (n) a. b. c. d. e. f. g. h. 3. O(1) O(1) O(1) O(1) O(1) O (n) O (n) Listing 1. Listing of the algorithm developed for finding the least squares-fitted sphere to a set of points. Algorithmic complexity for each step is given on the right. 10 References 1. Coleman, R.G., Burr, M.A., Souvaine, D.L. & Cheng, A.C. An intuitive approach to measuring protein surface curvature. Proteins 61, 1068–1074 (2005). 2. Connolly, M.L. Measurement of protein surface shape by solid angles. J. Mol. Graphics 4, 3–6 (1986). 3. Edelsbrunner, H., Koehl, P. The weighted-volume derivative of a space-filling diagram. Proc. Natl. Acad. Sci. U.S.A. 100, 2203–2208 (2003). 4. Laskowski, R.A. SURFNET: A program for visualizing molecular surfaces, cavities, and intermolecular interactions. J. Mol. Graph. 13, 323–330 (1995). 5. Liang, J., Edelsbrunner, H., Fu, P., Sudhakar, P.V., Subramaniam, S. Analytical shape computation of macromolecules: I. Molecular area and volume through alpha shape. Proteins 33, 1–17 (1998). 6. Pearson, K. On lines and planes of closest fit to systems of points in space. The Philosophical Magazine 2, 559–572 (1901). 7. Sridharan S, Nicholls A, Honig B. A new vertex algorithm to calculate solvent accessible surface area. Biophys. J. 61, A174 (1992). 11