PocketDepth: A new depth based algorithm for identification of ligand binding sites in proteins Yeturu Kalidas and Nagasuma Chandra* Bioinformatics Centre and Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, 560012, India, * Correspondence to: Dr. Nagasuma Chandra, Bioinformatics Centre & SERC, Raman Building, Indian Institute of Science, Bangalore 560 012, INDIA Tel: +91-80-23601409, 22932892 Fax: +91-80-23600551 E-mail: nchandra@serc.iisc.ernet.in Keywords: Protein structure, binding site, DBSCAN clustering, grid based method, depth factor 1 Abstract Computational methods for identifying and predicting functional sites in protein structures are increasingly becoming important in structural biology and bioinformatics not only for understanding the function of the molecule in detail but also for structure-based design of possible ligands and potential drugs as well as modified protein molecules. While there are a few structure based prediction methods already available, given the complexity and diversity of protein structural types, there is still a great need to explore newer methods and concepts to develop accurate, versatile and efficient binding site prediction algorithms. We have developed a new method PocketDepth, for identification of binding sites in proteins. The method is purely geometry-based and proceeds in two stages, labeling of grid cells with depth factors followed by a depth based clustering that uses neighbourhood information. Depth is an important parameter considered during protein structure visualization and analysis but has been used more often intuitively than systematically. Our current implementation of depth reflects how central a given sub-space is to a putative pocket rather than reflecting merely how far away it is situated from the nearest external surface of the protein. We have tested the algorithm against PDBbind, a large curated set of 1091 proteins obtained from PDB. A prediction was considered a true-positive if the predicted pocket had at-least 10% overlap with the actual ligand. The prediction accuracy using this set was about 96%. Moreover, 87% of the true-positives were identified within the first five ranks for each protein, of which 55% are in the first rank itself. 77% of the predictions had at least 50% overlap with the experimentally observed ligand. High prediction rates were again observed, when the method was tested against a data-set of apo-proteins and compared with their respective ligand complexes. A comparison of our method with four other widely used methods for a chosen representative set is also presented. 2 Introduction It has long been recognized that understanding ligand binding to a protein molecule holds the key to understanding function of the molecule. The success of the structural genomics and high-throughput structural biology projects are leading to a significant increase in protein structural data (Congreve, 2005; Burley, 2000). A challenge that is emerging out of these is to identify function(s) from structure. Even when protein structures are determined crystallographically as a complex with a ligand, a complete description of their binding sites is not obtained because they may not be complexed with all the ligands required for the function of the molecule or the complexed ligands are often substitutes of the natural ligands. A key step in the process of gaining functional insights from protein structures is therefore identification of all relevant binding sites in protein molecules. A further requirement for accurate identification of binding sites comes from the observation of moonlighting of protein molecules (Jeffery 1999), where many protein molecules have been found to have more than one function, quite often through different binding sites or even different binding modes at overlapping sites on the same protein. Even where crystal structures are available, they are rarely available as complexes with different ligands that may be required for moonlighting, hence making prediction by computational methods very important. The need for accurate prediction of binding sites is also accentuated by the requirement for better definitions of suitable pockets for use in structure-based drug design. Further, knowledge of possible binding sites in protein structures will also enable us to analyse and classify proteins in terms of their ligand recognition profiles. In some cases, proteins containing different folds are known to have a common function in terms of the ligand they recognize. ATP binding proteins (Stockwell and Thornton 2006) or sugar binding proteins (Ramachandraiah and Chandra, 2000; Taroni et al., 2000) are some such examples. There are many reports in literature addressing such questions, but most of them start with an implicit premise that the crystallographically observed ligand binding mode is the most optimal for the given protein ligand pair. Having an independent description of the binding sites will enable studying these aspects with a different perspective. 3 A number of methods have been developed so far and many of them are in the last two to three years itself (Bhinge et al., 2004; Laurie and Jackson 2005; Huang and Schroeder 2006; Glaser et al., 2006; Soga et al., 2007; Chakrabarthi and Lanczycki 2007) indicating high interest in this area. They can be broadly classified into (a) geometry based and (b) energy based methods. The geometry based methods are generally known to be faster while the energy based methods score better in terms of high accuracy of the sub-pockets predicted. Some examples of the geometry based methods are LigsiteCSC (Huang and Schroeder 2006), CASTP (Liang et al., 1998), PASS (Brady and Stouten 2000), LigandFit (Venkatachalam et al., 2002), VOIDOO ( Kleywegt and Jones 1994), APROPOS (Peters et al., 1996), LIGSITE (Hendlich et al., 1997), SURFNET (Glaser et al., 2006), while examples of energy based methods are GRID (Goodford 1985) Pocket finder (An et al., 2005), QsiteFinder (Laurie and Jackson 2005), desolvation based freeenergy models (Coleman et al., 2006) and solvent mapping models (Landon et al., 2007). Roterman and co-workers have also reported identification of active sites based on the characteristics of the spatial distribution of hydrophobicity in a protein molecule, using a fuzzy-oil-drop model (Brylinski et al., 2007). The different methods focus on different properties such as size, hydrophobicity, energy potential, solvent accessibility, desolvation energy or residue propensity for representing and hence analyzing the pockets. The chosen descriptor directly influences the quality of prediction. Hence it is important to explore use of different features to represent protein molecules and subsequently predict binding sites. Here we report a new geometry based method that divides a putative pocket into sub-spaces in a grid and computes their depths within the pocket, which is subsequently used to retain and cluster only the high-depth sub-spaces, thus utilizing the information of the neighbourhood of the relevant atoms in putative sites. High prediction accuracies were obtained using this algorithm both in terms of the number of correct predictions as well as the extent of correctness of each prediction. Extensive validation with a large dataset as well as benchmarking with four other prediction methods have also been carried out. 4 Results and Discussion The concept of depth factor and its implementation in pocket identification Depth is an important parameter considered during protein structure visualization and analysis and has been used more often intuitively than systematically. Earlier reports of the formalization of depth as a parameter have pertained either to considering the depth of a residue as the average of the constituent atom depths, where depth is defined as the distance of the atom from the nearest surface water molecule (Chakravarty and Varadarajan, 1999) or considering it as the distance of a nonhydrogen atom from its closest solvent-accessible protein neighbour (Pintar et al., 2003; Varrazzo et al., 2005). These studies have shown the usefulness of depth in gaining an insight into the protein interior and studying protein folding and stability. Depth has also been found to be correlated with several molecular, residue and atomic properties, such as average protein domain size, protein stability, free energy of formation of protein complexes, amino acid type hydrophobicity, residue conservation and hydrogen/deuterium amide proton exchange rates (Pintar et al., 2003). Depth has also been used to identify binding sites through the identification of superficial depressions from the centre of gravity of a given protein (Caprio et al., 1993). Though these studies vary in detail regarding the precise definition of depth, they demonstrate that depth is a more useful metric than others such as accessible surface area, in protein structure analysis. In a recent study, Coleman and Sharp have formalized depth as a shape descriptor termed Travel-Depth for describing protein surfaces and show the usefulness of this parameter in protein structure, binding site and channel analysis (Coleman and Sharp, 2006). Depth in their work refers to the physical distance a solvent molecule would have to travel from a surface point to a suitably defined reference surface. These implementations take into account how deep a given pocket or residue is situated from an outermost surface point of the protein. When defined in this manner, an underlying assumption is that two depth vectors of the same length would be regarded to be equivalent. For example, two atoms situated at the same distance from their nearest external surfaces of the protein would be considered to be equally contributing to the stability or packing of the protein. In the same way, two atoms at the same depth lining two different putative pockets would be ranked equally during binding site identification. In reality however, this would not be the 5 case because, these implementations disregard the perspective of the neighbourhood of the atom. This is because, in addition to the distance of a given atom from the nearest external surface, the number of different paths that exist for traversal of a hypothetical probe from all surface atoms in the vicinity to the given atom would also have a significant influence on the properties of the given atom. In the context of pocket or cavity detection, it would translate to knowing how central a given putative pocket cell is to the pocket or in other words how many interactions an atom located in that cell can have in that site and therefore how important is that pocket cell for the pocket. Here we seek to address this issue by defining depth differently. Our implementation is based on dividing a given space into multiple subspaces using a grid and weighting the importance of a given subspace within the grid, based on the number of times a depth vector can be drawn through that subspace to the external surface of the protein (Figure 1). A subspace here refers to one or more connected grid cells through common vertices. The depth information is obtained by first flagging grid cells as internal, external or surface and then drawing grid bars between all pairs of surface atoms within a chosen threshold, leading to incremental counts of the traversed grid cells. Such cells are then clustered based on cumulative depth counts as well as spatial proximity. The depth factors thus obtained have then been utilized to guide the clustering process, enhancing the accuracy of binding site identification. To demonstrate the sensitivity of the clustering process, the fraction of grid cells that participated in one or the other clusters among all grid cells that participating in grid bars, were computed for each protein in the dataset (Figure 2a). It can be seen that in majority of proteins, less than a tenth of the grid bars belonged to meaningful clusters, clearly indicating the usefulness of the depth factor in enhancing sensitivity of the clustering process. This also amounts to about a hundredth of the total volume of a protein that can be capable of binding any ligand in most cases (Figure 2b). In other words, ligand binding can be possible only in one of the hundred pieces of possible volumes in a protein, clearly in line with the common knowledge that ligands bind through specific sites. Prediction accuracies A comprehensive set of experiments were performed in order to evaluate the binding site prediction abilities by PocketDepth 6 and to ascertain the possible applications of such a prediction tool. In order to assess the quality of predictions obtained, we tested the algorithm against PDBbind, a large and a well curated dataset that has been made available recently (Wang et al., 2004; Wang et al., 2005). The predictions obtained were analysed in terms of the rank of the pocket for the correct prediction, extent of overlap of the predicted pocket with the crystallographically determined ligand location. The number of common residues between the predicted and observed sites were also analyzed. The algorithm was also assessed by testing against a database of 48 apo proteins and their corresponding ligand complexes. Measuring the prediction success accurately is not a simple task, since we first need to know precisely what the ‘correct’ binding site is for a given protein under physiologically meaningful conditions. Such information may not always be available, given the difficulties such as stability of enzyme-substrate complexes. For practical purposes however, in this study as in many studies of this nature, we have to compare the predictions with the available experimentally observed protein-ligand structures. Different methods have been used for defining a ‘true-positive’ in different algorithms. For example, in Qsitefinder, a minimum of 25% overlap between probes in the predicted pocket and 1.6 A around any ligand atom is considered as true-positive, where-as in Ligsite, the accuracy is measured by the percentage of predicted pocket atoms that are in contact with the ligand. A protein and a ligand atom are considered to be in contact if they are found to be within a distance of the sum of their van der Waals radii plus 0.5 A. The difficulties in comparing different algorithms are described by Huang and Schroeder xx(Huang and Schroeder 2006). In their study reporitng Ligsitecsc, a common criterion for success using a distance-based approach has been used, where-in, a geometric centre of the pocket sites’ grid points is computed to represent the predicted pocket and the prediction is considered a hit if it is within 4 A of any ligand atom. Ligsitecsc also considers residues lining the geometric centre of the predicted pocket within 8 A and ranked the pockets based on the extent of conservation of these residues. While this method provides a common framework for comparison, it does so at the cost of resolution of information since the entire pocket is represented as a single point. We feel it is reasonable and sufficiently informative to measure prediction success by considering volume overlap with the actual ligand, similar to the approach adopted by 7 Qsitefinder. To get a clearer idea of the extent of correctness of the prediction, we also report Tanimoto graphs and common residue graphs. A prediction was considered as a true-positive, when there was at least 10% overlap in volume with the ligand atoms in the corresponding crystal structure. Each ligand was also represented by grid cells in the same framework for each protein, used for predicting pockets. Non-hydrogen ligand atoms were assigned to individual grid cells, around which one layer of grid cells were considered to ensure capturing the entire ligand molecule. Those grid cells of the cluster surrounding the ligand cells within 1.5Å were considered to be common to the predicted cluster and the ligand. Overlap score with respect to ligand is defined as ratio of common grid cells to the number cells of ligand. By using two parameter sets which are explained in detail in methods-section (Table1A), for controlling shapes and extents of binding pockets we obtained following prediction-accuracies. Of the 1123 ligands contained in the set of 1091 proteins in the PDBbind dataset used in the study, ligand binding sites were predicted correctly for 860 ligands corresponding to 841 proteins with the stringent parameter set (Table-1B, deeper), where the minimum depth factor is set to 4. When the remaining 263 ligands that were false negatives in the first scan were scanned with a second parameter set with less stringent criteria (Table-1B, surface), wherein the minimum depth factor was reduced to 1, 222 ligands corresponding to 215 proteins were predicted correctly. Considering these two applications of parameters together, it amounts to correct predictions of 1082 ligands out of 1123, from 1053 proteins out of 1091 leading to 96% accuracy both in terms of the ligand and the protein. When the second (surface) parameter alone set was used to scan all the 1091 proteins, 1075 ligands from 1046 proteins were identified correctly, again amounting to 95.7% prediction accuracies for ligand and proteins. However the quality of prediction was significantly better with the stringent parameter set as discussed below, since it led to the delineation of pocket boundaries more precisely as compared to the surface parameter set. Location and overlap: The extent of overlap between a predicted site and a crystallographically observed site, automatically helps in 8 defining if the correct location has been identified as the binding site. The protocol used here for flagging a prediction as a truepositive implicitly considers this aspect. Figure 3a indicates the extent of overlap between the observed and the predicted pockets in the dataset studied here. It can be seen that the predicted pockets encompass the ligand entirely in a large number (45.6%) of cases. In some other cases, a significant part but not the whole ligand is in the predicted pocket while in a few other cases, the ligand overlapped to a lesser extent with the predicted pocket. In addition, when analyzed in terms of the residues lining the actual ligand verses those in the predicted pocket, it was observed that most (85.4%) of the predicted pockets overlapping with ligand had above 60% of residues in common with those around ligand within 4.0Å (Figure 3b). It is also important to consider the size of the predicted pocket, because very large pockets will encompass ligands well, but lose precision in terms of defining pocket boundaries. Therefore, a comparison of the ligand occupied volume verses the volume of the predicted pockets was computed (Figure 3c), using a Tanimoto quotient where t LigOccupie dCells SiteClusterCells LigOccupie dCells SiteClusterCells Even with this stringent measure, the overall success of the algorithm can be clearly seen, where nearly 43.7% of the predicted pockets have at least 50% of matching volume with the observed ligand. Ranks: Protein structures often contain multiple pockets, out of which only one or two may be biologically significant. It is therefore important for the correctly predicted pockets to have the top most ranks. Here we use the volume of the pocket as a metric for ranking and use the number of non polar, polar and total number of protein atoms surrounding the pocket as additional metrics, as illustrated in Figure 4a. It is clear that as many as 50% of predictions are ranked first, and 78% are in the first three ranks, when sorted by the volume of the pocket. Similar rankings were observed with the total number of protein atoms or the non-polar or polar atoms surrounding the predicted pocket. Further in cases of multimeric proteins such as dimeric neuraminidase, predicted pockets with the 9 top two ranks correspond to the two binding sites, one on each subunit, in many cases (Figure 4b). The results obtained using dataset of 48 apo proteins and their corresponding ligand complexes (apo-plc) is shown in Table 2. Of the 78 ligands corresponding to 46 protein, 70 ligands corresponding to 45 proteins of them are predicted correctly. Considering the coverage of the ligand pocket by any predicted cluster, we observed that out of the 70 true positive ligands, 53 (75%) have more than 90% overlap with respect to ligand volume and half of the true positive ligand (35) have more than 30% tanimato-coefficient value. with an overlap of greater than xx%, 29(41.4%) of the true positive ligands, are in the top rank itself and 58(82.9%),xx of them in the top five ranks. Significant conformational changes have been reported fro xx of these 48 proteins. The fact that the algorithm was capable of predicting the pockets in the apo proteins is indicates that it can perform well even when the protein is not in the ligand-bound conformation. Sensitivity of the prediction accuracy to particular SCOP families, size of the protein and the ligand was also analyzed for the l PDBbind dataset. This is important because enzymes are generally known to have buried pockets and hence are known to be more easily identifiable than many other proteins which show differences in topography of the binding sites. The SCOP classes of the proteins in the dataset were obtained to the super-family level and the number of correct predictions as well as false negatives were plotted for each SCOP category, as illustrated in Figure 6. The results indicate that the true positives range over (xxx) 202 superfamilies of 206 super-families present in the dataset, indicating the predictions are independent of the structural class or the fold of the protein. Similarly, graphs plotted to test whether predictions depended upon either the size of the protein or the size of the ligand, indicate that the predictions were independent both of the protein size as well as the ligand size. Benchmarking: Next, an extensive benchmarking of our prediction vis-a-vis those from the established binding site algorithms CastP (Liang et al., 1998), Q-Sitefinder (Laurie and Jackson 2005), Ligsitecsc (Huang and Schroeder 2006) and LigandFit (Venkatachalam 10 et al., 2003), was also carried out. For this purpose, datasets reported by the authors of these methods have been used. Table 2 indicates the percentage of true positives obtained for the different datasets using our method. The performance of PocketDepth was found to be good with all datasets. The prediction accuracies were better than reported for individual methods in most cases using the same datasets. Q-SiteFinder, an energy based method was observed to predict regions of the pockets most accurately among the other methods, albeit with varying extents of overlap with the actual ligand. Our method, purely based on geometry, performed as well, if not better, in terms of identifying the correct pocket with the top ranks, but in many cases with a higher extent of overlap. In a few cases however, the ranks reported by Q-Site Finder were superior for sub-pockets corresponding to ligand sub-structures, but not the whole ligand itself. Figure 5 shows pocket prediction for five examples from different protein folds and families, using PocketDepth and four other methods, clearly demonstrating that our method outperforms the others in majority of the cases. The examples are chosen to reflect different types of binding sites such as, in an enzyme (alcohol dehydrogenase:1A72), a carbohydrate binding site (peanut lectin:2PEL), a nucleotide binding site (RecA protein:2G88), a DNA binding site (transcription factor:1AIS), peptide binding sites (HIV protease: 1SP5 and MHC molecule:1A1M). It can be seen that PocketDepth performed well and often the best in (a) predicting the site as a top ranking cavity and (b) more importantly, in marking the boundaries of the predicted cavity. It was also observed that our method fared significantly better in identifying the entire pocket as one cavity as against the prediction of several pockets in a single binding site from other methods such as CastP and QSiteFinder, indicating the usefulness of the clustering technique used here. Some examples can be seen in Figure 5a to 5f. An added advantage with our method compared to those such as CastP is that fewer pockets are predicted per protein molecule, thus significantly increasing the sensitivity of the prediction. In summary, by considering depth in the sub-spaces available in a pocket as defined and implemented in our method, we gain to understand how central and not merely how deep a given space is to a pocket. These centrally located cells with high depth counts must be taken into account while designing ligands for a pocket. Our understanding of whether a prediction is successful or not, is 11 necessarily based on the position of the ligand in the crystallographically determined structures. While this is the best measure that may be currently available, it must be pointed out that the crystal structures may not always contain the natural substrate but contain an inhibitor which may be smaller than the real substrate. It does not also show us the possible re-orientation of a ligand in a given site during catalysis or upon binding another natural ligand in an adjacent site. Irrespective of the accuracy of the prediction in terms of how closely it mimics the site of the actual ligand, the pockets give us an idea about the empty spaces around the ligand that might be accessible for the ligand, which can be utilized appropriately while designing inhibitors or gaining further functional insights such as in the case of moonlighting proteins. The Algorithm The major steps in the algorithm are: grid construction, grid cell labeling, drawing grid bars, computing depth factors, clustering and ranking as shown in the flowchart (Figure 7). A stepwise description is given below. Step-1: Grid construction: For a given PDB file of a protein, a 3D grid is constructed to encompass minimum and maximum coordinate-values along each of the X, Y and Z axes, with a cell size of 1Å. Each point (x,y,z) of the protein, in 3D cartesian space is mapped onto a 3D index (i,j,k) where i,j and k represent the number of cells starting from index 0 along each of the 3 dimensions. Step-2: Defining cell properties and labelling cells: After defining a grid based indexing scheme, the data structures were then defined to represent the properties of interest for each grid cell. (i) Each atom is mapped to an appropriate grid cell by taking the offset of the co-ordinates in all three axes from the corresponding xi x min y i y min z i z min , CELL _ WIDTH CELL _ WIDTH CELL _ WIDTH 12 minima as indicated in the formula where xi,yi and zi indicate the (x,y,z) fields of ith site point; xmin, ymin and zmin indicate minimum of all the coordinate values along each of the 3 dimensions for the given protein and CELL_WIDTH indicates the fineness of the grid (1Å in our case). (ii) Every grid cell is classified by examining its neighbourhood, as (a) internal, those grid cells within 2Ǻ from each atom occupied grid cell and those that are surrounded on six faces by other intenal cells. (b) external, the remaining after (a). The flagged external cells were further classified as (c) surface cells, if an atom occupied cell could be found within 3Ǻ Further some external cell would be classified as (d) site cells after drawing grid bars as explained in the following steps. Step-3: Finding connected list of surface cells: A depth first traversal from a surface cell is carried out to locate connected surface cells as follows: the grid cells are marked as they are traversed recursively along six faces from each surface cell. When another surface cell is encountered, the traversal continues or else it closes and moves to the next surface cell in the list. A connection is defined to exist between a pair of surface cells if one is as an immediate neighbour in any of the six directions, to the other. Step-4: Defining putative boundary atoms: If the total number of connected surface cells is below a threshold value (default is 500), then it gets reported as an internal void of the protein, so that it can be processed separately. If the number of connected surface cells is above the required threshold, those atoms that surround the present set of surface cells are stored as list of putative boundary atoms for further processing. Step-5: Deriving the depth factor: All pairs of boundary atoms that lie within a threshold distance range (2 to 15Ǻ) are considered for drawing grid bars between them. However, a grid bar is only drawn between a pair of boundary atoms if it does not pass through the interior of the protein. A grid bar is drawn by traversing a trail of grid cells from the grid cell (i1, j1, k1) occupied by the starting boundary atom to the grid cell (i2,j2,k2) occupied by target boundary atom. The grid bar is drawn such that the target is reached with 13 the shortest Euclidean path from the starting atom. The counter in each grid cell traversed by the bar is cumulatively incremented. The final count called depth factor (DF) in our method, for each grid cell would correspond to the number of times a grid bar has traversed through it and hence indicates the density of atoms in its neighbourhood or in other words how central a given grid cell is to a putative pocket. Step-6: Second pass scanning of inter grid bar spaces between boundary atoms: There may be some cells that may get trapped between grid bars which may not be covered during Step-5 and hence possess zero DF values. In order to find and assign appropriate DF values to such cells, for each boundary atom all other boundary atoms within 10Ǻ around it are considered and centriod of this set of atoms is computed. If the centriod falls in the interior of protein, it is ignored, otherwise a 5Ǻ cube centered about the centriod is considered and average DF of all the cells within that cube touched by grid-bars is calculated and assigned to all the remaining external cells. Step-7: Reporting the labelled grid cells: The grid cells with non zero DF values are reported as dummy atoms in Protein Data Bank (PDB) format where, cell based indices converted back to 3D coordinates (using the converse of the formula in step-2) are written in coordinate fields and DF of the cell is written into the temperature factor field, for ease in using standard structure visualization software. Step-8: Clustering of the site points to identify number and extents of pockets: Density based clustering scheme DBSCAN (Ester et al., 1996), which is suitable to identify clusters of random shapes is adopted here for determining clusters representing a binding sites. However, the method we adopted is different from actual DBSCAN in two aspects. (1) Minor variation in implementation of finding nearest neighbours for a given point and (2) Flexible application of clustering parameters based on DF. The method we use for 14 computing neighbouring points for a given point is different from that of DBSCAN implementation in the sense that here we use simple and efficient data structures suitable for 3 dimensional points. Three sorted arrays of the site points are created, one corresponding to each axis. On a nearest neighbourhood query for a point (x1,x2,x3), a binary search is done on each of the sorted arrays Aj corresponding to the axis along jth dimension (j_Є [1..3] corresponding to X, Y and Z axes). These searches find the set of site points possessing values between xj-ε and xj+ε for its jth dimension where ε is an offset value computed based on the distance parameter of the neighbourhood query. The set of points determined when processing for each dimension are flagged with an incremental count. Finally those points that possess a count equal to the number of dimensions i.e., 3 and those that are within the given Euclidian distance are reported as neighbours. The output of this initial neighbourhood search is further filtered by imposing bounds on DF of grid cells. Once the method of serving nearest neighbor query has been established, the clustering proceeds as described in DBSCAN (Ester et al., 1996) using the concepts of density-reachability, density-connectivity and core-points but with added flexibility of user specified ranges of DF values determining clustering-radius and minimum-number of neighbors. Clusteringradius and minimum number of grid bar cells in that radius defines clustering stringency. The flexibility in our implementation of DBSCAN algorithm is that it uses clustering stringencies and corresponding range of DF values for a grid cell from a clustering parameter file. Clustering stringency is adjusted in accordance with the depth of the cells as follows: a site cell having depth factor in the range of 0 to 1.0 and should contain at-least 200 site points as neighbours within a radius of 3Ǻ, where as those site cells that have depth factors above 4.0, require only 50 site point neighbours within a radius of 3Ǻ to form a cluster. The clustering parameters used in this study are described in Table-1A. We observe that the scenarios at the binding sites encountered in majority of protein structures can be described in two ways: deep pockets and shallow depressions. While the algorithm can detect both these types, the parameters need to be tuned to determine the exact bounds of the predicted pocket. We therefore use two different parameter sets where-in a given protein is first scanned with the more stringent ‘deeper’ parameter, the protein points not identified as putative pocket cells then scanned with the less stringent ‘surface’ parameter set. Clusters predicted with the ‘deeper’ set are given higher priority while ranking 15 the clusters. This has been automated on the web implementation of the algorithm. Clusters identified with the surface set, if adjacent to those with the deeper set are merged together and ranked based on the proportion of cluster cells identified with the ‘deeper’ parameter set. Typically most proteins are found to contain in the order of 5 to 10 clusters by this approach. Cluster numbers are written into residue number-field of the PDB line corresponding to the site cell representing a dummy atom. Any site point that does not belong to a cluster and hence does not have a cluster-id is assigned residue number of 999 to indicate that it is a noise point. The clusters are then ranked based on the number of site points that they contain, where the top ranked cluster will have the maximum number of site points. Different bounds on the value of DF and clustering stringency i.e., radius around each grid cell and threshold minimum number of neighbors give different shapes to clusters of grid cells. The quality of clustering depends directly on the shape of the clusters and the shape depends on the balance between clustering stringency and bounds on DF. For a surface pocket, if the clustering stringency is very less, almost the entire surface would become a cluster which is unacceptable, whereas for a deeper pocket if the clustering stringency is high, small groups of cells/site points that correspond to pocket would be reported as noise hence loosing a potential pocket. In addition, for the same protein, there may exist groups of grid cells with varying densities and depths requiring a flexible, dynamic clustering scheme. In view of these, our flexible DBSCAN-Type clustering has been useful and performed well in identifying possible pockets. Validation In order to verify if the protocol described above yielded binding sites that matched well with experimentally observed sites, a three-fold validation exercise was carried out. The predicted sites were matched with the corresponding experimentally observed ones for (a) correctness of the location of the site in terms of the binding site residues of the protein lining the pocket, (b) the extent of 16 overlap of site points with the atoms of the ligand in the corresponding structure and (c) rank of the cluster occurring at the expected site. We have used PDBbind, a comprehensive curated data set consisting of 1091 proteins as their respective protein complexes (Wang et al., 2005). Hydrogens and water molecules were removed from the PDB files. Small ligands such as sulphate, phosphate and metal ions were also removed. The sensitivity in prediction accuracies by our algorithm to enzymes vs non-enzymes, the SCOP class of the protein as well as the size of the protein and size of the ligand were also analysed. Next, a dataset of 48 apo-proteins and their corresponding ligand complexes compiled by (Huang and Schroeder 2006) xx, was used to validate if the algorithm was capable of identifying pockets in apo proteins as well. Rasmol (Sayle and Milner-White 1995) was used for visualization of the predicted pockets and generating the images presented in this paper. In order to visualize the clusters reported by the program, a line in PDB format is generated for each grid cell, where x,y,z fields indicates 3D positions of grid cells and the temperature factor field holds depth factors of the grid cells. A higher value for DF indicates a deeply buried cell, whereas a lower value indicates a cell located at the surface. Acknowledgements Use of facilities at the Interactive Graphics Based Molecular Modeling Facility and Distributed Information Centre (both supported by Department of biotechnology (DBT), Govt. of India, and the facilities at the Super Computer Education and Research Centre are gratefully acknowledged. Financial support from the DBT computational genomics initiative is also acknowledged. References 17 An J., Totrov M. and Abagyan R. (2005) Pocketome via Comprehensive Identification and Classification of Ligand Binding Envelopes Molecular and Cellular Proteomics 4 , 752-61 Bhinge A., Chakrabarti P., Uthanumallian K., Bajaj K., Chakraborty K., and Varadarajan R. (2004) Accurate Detection of Protein: Ligand Binding Sites Using Molecular Dynamics Simulations Structure, 12, 1989-1999 Brady G.P. Jr, and Stouten P.F. (2000) Fast Prediction and visualization of protein binding pockets with PASS J. Comput Aided Mol Des. 14, 383-401 Burley S.K. (2000) An Overview of structural genomics Nature Structural Biology, Suppl:932-934 Carlos A. Del Carpio, Yoshimasa Takahashi and Shin-ichi Sasaki (1993) A new approach to the automatic identification of candidates for ligand receptor sites in proteins: (I) Search for pocket regions J. Mol. Graphics, 11, 23-29 Chakrabarti S. and Lanczycki C.J. (2007) Analysis and Prediction of functionally important sites in proteins Protein Science 16, 4-13 Chakravarty S. and Varadarajan R. (1999) Residue Depth: A novel parameter for the analysis of protein structure and stability Structure, 7 , 723-732 18 Coleman R.G. and Sharp K.A. (2006) Travel Depth, a New Shape Descriptor for Macromolecules: Application to Ligand Binding J. Mol. Biol. 362, 441-458 Coleman R.G., Salzberg A.C., and Cheng A.C (2006) Structure-Based Identification of Small Molecule Binding Sites Using a Free Energy Model J. Chem. Inf. Model. 46, 2631-2637 Congreve, M., Murray, C.W. and Blundell, T.L. (2005) Structural Biology and drug discovery Drug Discovery Today. 10, 895-907 Ester M., Kriegel H.P., Sander J. and Xu X. (1996) A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of 2nd International conference on Knowledge Discovery and Data Mining (KDD-96) Glaser F., Morris R.J., Najmanovich R.J., Laskowski R.A., Thornton J.M. (2006) A Method for Localizing Ligand Binding Pockets in Protein Strauctures Proteins: Structure, Function and Bioinformatics 62, 479-488 Goodford P.J. (1985) A Computational Procedure for Determining Energetically Favorable Binding Sites of Biologically Important Macromolecules Journal of Medicinal Chemistry, 28, 849-857 Hendlich M., Rippmann F. and Barnickel G. (1997) LIGSITE: Automatic and efficient detection of potential small molecule binding sites in proteins Journal of Molecular Graphics and Modelling 15, 359-363 Huang B. and Schroeder M.(2006) LIGSITECSC: Predicting ligand binding sites using the Connolly surface and degree of conservation 19 BMC Structural Biology 6, 19 Jeffery C.J. (1999) Moonlighting proteins Trends in Biochemical Sciences, 24, 8-11 Kleywegt G.J. and Jones T.A. (1994) Detection, delineation, measurement and display of cavities in macromolecular structures Acta. Cryst. 50D, 178-185 Landon M.R., Lancia D.R. Jr., Yu J., Thiel S.C., and Vajda S. (2007) Identification of Hot Spots within Druggable Binding Regions by Computational Solvent Mapping of Proteins J. Med. Chem. 50, 1231-1240 Laurie A.T.R. and Jackson R.M. (2005) Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites Bioinformatics, 21, 1908-1916 Liang J., Edelsbrunner H., and Woodward C. (1998) Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design Protein Science. 7, 1884-97 Michal Brylinski, Marek Kochanczyk, Elzbieta Broniatowska, Irena Roterman (2007) Localization of ligand binding site in proteins identified in silico. J. Mol. Model. 13, 665-75 Peters K.P., Fauck J. and Frommel C. (1996) The Automatic Search for Ligand Binding Sites in Proteins of Known Three-dimensional Structure Using only Geometric Criteria J. Mol. Biol. 256, 201–213 20 Pintar A., Carugo O., and Ponger S. (2003) Atom Depth as a Descriptor of the Protein Interior Biophysical Journal 84 , 2553-2561 Ramachandraiah and Chandra. N.R. (2000) Sequence and structural determinants of mannose recognition. Proteins: Structure, Function and Bioinformatics , 39:358-364 Sayle R.A., and Milner-White E.J. (1995) RASMOL: biomolecular graphics for all Trends in Biochemical Sciences, 20, 374-376 Soga S., Shrai H., Kobori M., and Hirayana N. (2007) Use of Amino Acid Composition to Predict Ligand-Binding Sites J. Chem. Inf. Model. 47, 400-406 Stockwell G.R. and Thornton J.M. (2006) Conformational diversity of ligands bound to proteins J. Mol. Biol. 356, 928-44 Taroni C., Jones S. and Thornton J.M. (2000) Analysis and prediction of carbohydrate binding sites Protein Engineering, 13, 89-98 Varrazzo D., Bernini A., Spiga O., Ciutti A., Chiellini S., Venditti V., Bracci L., and Niccolai N. (2005) Three-dimensional computation of atom depth in complex molecular structures Bioinformatics 21, 2856-2860 Venkatachalam C.M., Jiang X., Oldfield T., and Waldman M. (2003) LigandFit: a novel method for the shape-directed rapid docking of ligands to protein active sites 21 Journal of Molecular Graphics and Modelling 21 , 289-307 Wang R., Fang X., Lu Y., and Wang S. (2005) The PDBbind Database: Methodologies and Updates J. Med. Chem. 48, 4111-4119 Wang R., Fang X., Lu Y., and Wang S. (2005) The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures J. Med. Chem. 47, 2977-2980 Figure Legends Figure 1: Illustrations depicting the flagging of grid cells, computing grid bars and clustering as in PocketDepth. Only one vertex per grid cell is highlighted for clarity. (a) Internal grid cells are marked in blue whereas the surface grid cells are in green and protein atoms are indicated as red spheres. (b) A comprehensive figure to show the flagging of different grid cells, cyan: external, blue: internal, green: surface, black: internal voids, red: protein atoms, magenta: grid bars. (c) Clustering of grid bars, two clusters one in green and the other in pink are shown along with protein atoms in red. Figure 2: Histograms showing distribution of ratio (in %) of number of cluster grid cells (a) to the total number of grid bar cells (b) to the total number of internal, surface and grid bar cells. Figure 3: Histograms depicting percentage overlap between the best predicted cluster and the corresponding ligand where CV indicates volume occupied by the predicted cluster, LV indicates volume occupied by the ligand in terms of grid cells, LR indicates set of residues lining 4Ǻ around the ligand and CR indicates the set of residues lining 4Ǻ around the predicted pocket. The percentage overlap is shown in terms of (a) volume overlap, (b) number of residues that surround the pocket and (c) the more stringent Tanimoto coefficient based on volume. Figure 4(a): Rank distribution of predicted pockets corresponding to all true positives in the dataset. Ranking is based on four different schemes. Ranking scheme blue: size of cluster, cyan: number of polar atoms, orange: number of non polar atoms and brown: 22 number of surface atoms. (b) The top ranked clusters in PDB:1FDQ corresponding to the binding site of a derivative of hexaeonic acid (HXA) in each subunit of the dimer. The actual ligand is shown in red as a ball-and-stick model and clusters are shown in blue and green coloured tiny spheres. Figure 5: Comparison of pocket prediction for six different proteins using PocketDepth along with four other methods. Ranks for each prediction where appropriate is indicated below in each case. Multiple pockets predicted for each ligand is indicated in different colours other than grey and red. Proteins are shown as grey ribbons and ligand atoms are shown in red as ball and stick models. Figure 6: Distribution of fraction of true positives (TP) and false negatives (FN) across (a) SCOP classes, (b) different protein sizes and (c) different ligand sizes. True positives are indicated above the zero line of Y-axis while false negatives are indicated below. Figure 7: Overview of the algorithm shown as a flow-chart indicating grid construction, labeling, computing grid bars and clustering of grid bar cells based on DF values and spatial proximity. 23 Table-1A: Description of parameter sets. Deeper corresponds to more stringent criteria for clustering. LB,UB: Lower and Upper bounds of DepthFactor(DF). Cluster Radius and Num. of neighbours represent dbscan clustering parameters to be used for any grid cell that has DF within the specified range. DF Range Parameter Set LB deeper/stringent 0 surface/(less stringent) UB Cluster Radius 4.0 3.0 Num. of neighbours 190 4.0 Max 3.0 50 0 1.0 5.0 1.0 3.0 5.0 3.5 Max 3.0 200 60 50 24 Table-1B: Prediction accuracies from PocketDepth against the PDBbind dataset. TP and FN (protein) indicate the number of true-positives and false-negatives with respect to the number of proteins in the dataset whereas TP and FN (ligand) indicate the similar metrics for the number of ligands bound to those proteins in the dataset. Ranks correspond to ranks of the predicted sites that overlapped with some ligand of the protein above the threshold minimum overlap. Deeper and surface parameters refer to clustering stringency. Ranking based on protein considers top rank cluster per protein where as that of ligand corresponds to considering top rank cluster per ligand since there may be more than one cluster overlapping with ligand. Dataset #TP (protein) #FN #TP (protein) (ligands) 841 PDBbind 1091 255 #FN (ligands) 860 263 Ranks Ranking based on (clustering parameter set) (==1) 55.2 54.5 (<=2) 73.8 73.2 (deeper) (<=3) 82.2 81.4 (<=5) 91.2 91 (<=10) 97.4 Protein 97.4 Ligand (surface) PDBbind 255 protein corresponding to 263 ligands 215 41 222 41 34 56.7 64.2 73 89.3 Protein 32.9 57.7 73.4 89.2 Ligand Combination 1053 1053/109196.5% 41 1082 1082/112396.3% 41 50.8 50 70.3 70 64.9 (in two steps) 78.5 77.9 87.3 87.3 95.7 95.6 Protein Ligand PDBbind 1046 1046/109195.9% 48 1075 1075/112395.7% 48 36.4 35.9 51.9 52.3 surface only 58.8 66.3 58.9 66.2 77.5 77.2 Protein Ligand 46 Apo-Plc 45 42.2 41.4 57.8 52.8 95.6 91.4 Protein Ligand 4 70 4 25 64.4 61.4 88.9 82.8 Table-2: Prediction accuracies using PocketDepth against 3 other datasets Dataset #proteins in the dataset #tp proteins %-TP Protein based Ranks 1..5 1..10 LigsiteCSC 209 204 97.6 89.7% 96.9% Qsitefinder 126 120 95.24 89.4% 96.5% LigandFit 15 15 100% 100% 100% Figure 1 (a) (c) (b) 26 27 28 29 30 Figure 5 PocketDepth LigandFit LigSiteCSC QSiteFinder 1,5,12 1 One One of top 10 1,2 2,11,13,17 1,3 2,7,10 1,2,6,7,12,14,16,21 ,25,28,34,39,41, 51,56,57 1,3,6,7,15,19,20 2,7 PDB CASTp 1sp5 (AB) ranks 1a72 ranks 1ais ranks 31 PDB CASTp CavityDepth LigandFit LigSiteCSC QSiteFinder 1a1m (A) ranks 2,3,13,14,28 2,3,5 1,2,6 2,5,13,16 1 1 1,2,3,4,7,30,48,50 1,2,5 2,5,7 2pel (A) ranks 2g88 ranks 32 33 Figure 7 34