CavityGrid: A simple and new grid based method for prediction of

advertisement
PocketDepth: A new depth based algorithm for identification of ligand binding sites in proteins
Yeturu Kalidas and Nagasuma Chandra*
Bioinformatics Centre and Supercomputer Education and Research Centre,
Indian Institute of Science, Bangalore, 560012, India,
* Correspondence to:
Dr. Nagasuma Chandra,
Bioinformatics Centre & SERC,
Raman Building,
Indian Institute of Science,
Bangalore 560 012, INDIA
Tel: +91-80-23601409, 22932892
Fax: +91-80-23600551
E-mail: nchandra@serc.iisc.ernet.in
Keywords: Protein structure, binding site, DBSCAN clustering, grid based method, depth factor
1
Abstract
Computational methods for identifying and predicting functional sites in protein structures are increasingly becoming
important in structural biology and bioinformatics not only for understanding the function of the molecule in detail but also for
structure-based design of possible ligands and potential drugs as well as modified protein molecules. While there are a few structure
based prediction methods already available, given the complexity and diversity of protein structural types, there is still a great need to
explore newer methods and concepts to develop accurate, versatile and efficient binding site prediction algorithms. We have
developed a new method PocketDepth, for identification of binding sites in proteins. The method is purely geometry-based and
proceeds in two stages, labeling of grid cells with depth factors followed by a depth based clustering that uses neighbourhood
information. Depth is an important parameter considered during protein structure visualization and analysis but has been used more
often intuitively than systematically. Our current implementation of depth reflects how central a given sub-space is to a putative
pocket rather than reflecting merely how far away it is situated from the nearest external surface of the protein. We have tested the
algorithm against PDBbind, a large curated set of 1091 proteins obtained from PDB. A prediction was considered a true-positive if
the predicted pocket had at-least 10% overlap with the actual ligand. The prediction accuracy using this set was about 96%.
Moreover, 87% of the true-positives were identified within the first five ranks for each protein, of which 55% are in the first rank
itself. 77% of the predictions had at least 50% overlap with the experimentally observed ligand. High prediction rates were again
observed, when the method was tested against a data-set of apo-proteins and compared with their respective ligand complexes. A
comparison of our method with four other widely used methods for a chosen representative set is also presented.
2
Introduction
It has long been recognized that understanding ligand binding to a protein molecule holds the key to understanding function of the
molecule. The success of the structural genomics and high-throughput structural biology projects are leading to a significant increase
in protein structural data (Congreve, 2005; Burley, 2000). A challenge that is emerging out of these is to identify function(s) from
structure. Even when protein structures are determined crystallographically as a complex with a ligand, a complete description of their
binding sites is not obtained because they may not be complexed with all the ligands required for the function of the molecule or the
complexed ligands are often substitutes of the natural ligands. A key step in the process of gaining functional insights from protein
structures is therefore identification of all relevant binding sites in protein molecules. A further requirement for accurate identification
of binding sites comes from the observation of moonlighting of protein molecules (Jeffery 1999), where many protein molecules have
been found to have more than one function, quite often through different binding sites or even different binding modes at overlapping
sites on the same protein. Even where crystal structures are available, they are rarely available as complexes with different ligands
that may be required for moonlighting, hence making prediction by computational methods very important. The need for accurate
prediction of binding sites is also accentuated by the requirement for better definitions of suitable pockets for use in structure-based
drug design. Further, knowledge of possible binding sites in protein structures will also enable us to analyse and classify proteins in
terms of their ligand recognition profiles. In some cases, proteins containing different folds are known to have a common function in
terms of the ligand they recognize. ATP binding proteins (Stockwell and Thornton 2006) or sugar binding proteins (Ramachandraiah
and Chandra, 2000; Taroni et al., 2000) are some such examples. There are many reports in literature addressing such questions, but
most of them start with an implicit premise that the crystallographically observed ligand binding mode is the most optimal for the
given protein ligand pair. Having an independent description of the binding sites will enable studying these aspects with a different
perspective.
3
A number of methods have been developed so far and many of them are in the last two to three years itself (Bhinge et al., 2004;
Laurie and Jackson 2005; Huang and Schroeder 2006; Glaser et al., 2006; Soga et al., 2007; Chakrabarthi and Lanczycki 2007)
indicating high interest in this area. They can be broadly classified into (a) geometry based and (b) energy based methods. The
geometry based methods are generally known to be faster while the energy based methods score better in terms of high accuracy of the
sub-pockets predicted. Some examples of the geometry based methods are LigsiteCSC (Huang and Schroeder 2006), CASTP (Liang et
al., 1998), PASS (Brady and Stouten 2000), LigandFit (Venkatachalam et al., 2002), VOIDOO ( Kleywegt and Jones 1994),
APROPOS (Peters et al., 1996), LIGSITE (Hendlich et al., 1997), SURFNET (Glaser et al., 2006), while examples of energy based
methods are GRID (Goodford 1985) Pocket finder (An et al., 2005), QsiteFinder (Laurie and Jackson 2005), desolvation based freeenergy models (Coleman et al., 2006) and solvent mapping models (Landon et al., 2007). Roterman and co-workers have also
reported identification of active sites based on the characteristics of the spatial distribution of hydrophobicity in a protein molecule,
using a fuzzy-oil-drop model (Brylinski et al., 2007).
The different methods focus on different properties such as size,
hydrophobicity, energy potential, solvent accessibility, desolvation energy or residue propensity for representing and hence analyzing
the pockets. The chosen descriptor directly influences the quality of prediction. Hence it is important to explore use of different
features to represent protein molecules and subsequently predict binding sites. Here we report a new geometry based method that
divides a putative pocket into sub-spaces in a grid and computes their depths within the pocket, which is subsequently used to retain
and cluster only the high-depth sub-spaces, thus utilizing the information of the neighbourhood of the relevant atoms in putative sites.
High prediction accuracies were obtained using this algorithm both in terms of the number of correct predictions as well as the extent
of correctness of each prediction. Extensive validation with a large dataset as well as benchmarking with four other prediction
methods have also been carried out.
4
Results and Discussion
The concept of depth factor and its implementation in pocket identification
Depth is an important parameter considered during protein structure visualization and analysis and has been used more often
intuitively than systematically. Earlier reports of the formalization of depth as a parameter have pertained either to considering the
depth of a residue as the average of the constituent atom depths, where depth is defined as the distance of the atom from the nearest
surface water molecule (Chakravarty and Varadarajan, 1999) or considering it as the distance of a nonhydrogen atom from its closest
solvent-accessible protein neighbour (Pintar et al., 2003; Varrazzo et al., 2005). These studies have shown the usefulness of depth in
gaining an insight into the protein interior and studying protein folding and stability. Depth has also been found to be correlated with
several molecular, residue and atomic properties, such as average protein domain size, protein stability, free energy of formation of
protein complexes, amino acid type hydrophobicity, residue conservation and hydrogen/deuterium amide proton exchange rates
(Pintar et al., 2003). Depth has also been used to identify binding sites through the identification of superficial depressions from the
centre of gravity of a given protein (Caprio et al., 1993). Though these studies vary in detail regarding the precise definition of depth,
they demonstrate that depth is a more useful metric than others such as accessible surface area, in protein structure analysis. In a
recent study, Coleman and Sharp have formalized depth as a shape descriptor termed Travel-Depth for describing protein surfaces and
show the usefulness of this parameter in protein structure, binding site and channel analysis (Coleman and Sharp, 2006). Depth in
their work refers to the physical distance a solvent molecule would have to travel from a surface point to a suitably defined reference
surface. These implementations take into account how deep a given pocket or residue is situated from an outermost surface point of
the protein. When defined in this manner, an underlying assumption is that two depth vectors of the same length would be regarded to
be equivalent. For example, two atoms situated at the same distance from their nearest external surfaces of the protein would be
considered to be equally contributing to the stability or packing of the protein. In the same way, two atoms at the same depth lining
two different putative pockets would be ranked equally during binding site identification. In reality however, this would not be the
5
case because, these implementations disregard the perspective of the neighbourhood of the atom. This is because, in addition to the
distance of a given atom from the nearest external surface, the number of different paths that exist for traversal of a hypothetical probe
from all surface atoms in the vicinity to the given atom would also have a significant influence on the properties of the given atom. In
the context of pocket or cavity detection, it would translate to knowing how central a given putative pocket cell is to the pocket or in
other words how many interactions an atom located in that cell can have in that site and therefore how important is that pocket cell for
the pocket. Here we seek to address this issue by defining depth differently. Our implementation is based on dividing a given space
into multiple subspaces using a grid and weighting the importance of a given subspace within the grid, based on the number of times a
depth vector can be drawn through that subspace to the external surface of the protein (Figure 1). A subspace here refers to one or
more connected grid cells through common vertices. The depth information is obtained by first flagging grid cells as internal, external
or surface and then drawing grid bars between all pairs of surface atoms within a chosen threshold, leading to incremental counts of
the traversed grid cells. Such cells are then clustered based on cumulative depth counts as well as spatial proximity. The depth factors
thus obtained have then been utilized to guide the clustering process, enhancing the accuracy of binding site identification. To
demonstrate the sensitivity of the clustering process, the fraction of grid cells that participated in one or the other clusters among all
grid cells that participating in grid bars, were computed for each protein in the dataset (Figure 2a). It can be seen that in majority of
proteins, less than a tenth of the grid bars belonged to meaningful clusters, clearly indicating the usefulness of the depth factor in
enhancing sensitivity of the clustering process. This also amounts to about a hundredth of the total volume of a protein that can be
capable of binding any ligand in most cases (Figure 2b). In other words, ligand binding can be possible only in one of the hundred
pieces of possible volumes in a protein, clearly in line with the common knowledge that ligands bind through specific sites.
Prediction accuracies
A comprehensive set of experiments were performed in order to evaluate the binding site prediction abilities by PocketDepth
6
and to ascertain the possible applications of such a prediction tool. In order to assess the quality of predictions obtained, we tested the
algorithm against PDBbind, a large and a well curated dataset that has been made available recently (Wang et al., 2004; Wang et al.,
2005). The predictions obtained were analysed in terms of the rank of the pocket for the correct prediction, extent of overlap of the
predicted pocket with the crystallographically determined ligand location. The number of common residues between the predicted and
observed sites were also analyzed.
The algorithm was also assessed by testing against a database of 48 apo proteins and their
corresponding ligand complexes.
Measuring the prediction success accurately is not a simple task, since we first need to know precisely what the ‘correct’
binding site is for a given protein under physiologically meaningful conditions. Such information may not always be available, given
the difficulties such as stability of enzyme-substrate complexes. For practical purposes however, in this study as in many studies of
this nature, we have to compare the predictions with the available experimentally observed protein-ligand structures. Different
methods have been used for defining a ‘true-positive’ in different algorithms. For example, in Qsitefinder, a minimum of 25% overlap
between probes in the predicted pocket and 1.6 A around any ligand atom is considered as true-positive, where-as in Ligsite, the
accuracy is measured by the percentage of predicted pocket atoms that are in contact with the ligand. A protein and a ligand atom are
considered to be in contact if they are found to be within a distance of the sum of their van der Waals radii plus 0.5 A. The difficulties
in comparing different algorithms are described by Huang and Schroeder xx(Huang and Schroeder 2006). In their study reporitng
Ligsitecsc, a common criterion for success using a distance-based approach has been used, where-in, a geometric centre of the pocket
sites’ grid points is computed to represent the predicted pocket and the prediction is considered a hit if it is within 4 A of any ligand
atom. Ligsitecsc also considers residues lining the geometric centre of the predicted pocket within 8 A and ranked the pockets based
on the extent of conservation of these residues. While this method provides a common framework for comparison, it does so at the
cost of resolution of information since the entire pocket is represented as a single point. We feel it is reasonable and sufficiently
informative to measure prediction success by considering volume overlap with the actual ligand, similar to the approach adopted by
7
Qsitefinder. To get a clearer idea of the extent of correctness of the prediction, we also report Tanimoto graphs and common residue
graphs.
A prediction was considered as a true-positive, when there was at least 10% overlap in volume with the ligand atoms in the
corresponding crystal structure. Each ligand was also represented by grid cells in the same framework for each protein, used for
predicting pockets. Non-hydrogen ligand atoms were assigned to individual grid cells, around which one layer of grid cells were
considered to ensure capturing the entire ligand molecule.
Those grid cells of the cluster surrounding the ligand cells within 1.5Å
were considered to be common to the predicted cluster and the ligand. Overlap score with respect to ligand is defined as ratio of
common grid cells to the number cells of ligand. By using two parameter sets which are explained in detail in methods-section (Table1A), for controlling shapes and extents of binding pockets we obtained following prediction-accuracies. Of the 1123 ligands contained
in the set of 1091 proteins in the PDBbind dataset used in the study, ligand binding sites were predicted correctly for 860 ligands
corresponding to 841 proteins with the stringent parameter set (Table-1B, deeper), where the minimum depth factor is set to 4. When
the remaining 263 ligands that were false negatives in the first scan were scanned with a second parameter set with less stringent
criteria (Table-1B, surface), wherein the minimum depth factor was reduced to 1, 222 ligands corresponding to 215 proteins were
predicted correctly. Considering these two applications of parameters together, it amounts to correct predictions of 1082 ligands out of
1123, from 1053 proteins out of 1091 leading to 96% accuracy both in terms of the ligand and the protein. When the second (surface)
parameter alone set was used to scan all the 1091 proteins, 1075 ligands from 1046 proteins were identified correctly, again amounting
to 95.7% prediction accuracies for ligand and proteins. However the quality of prediction was significantly better with the stringent
parameter set as discussed below, since it led to the delineation of pocket boundaries more precisely as compared to the surface
parameter set.
Location and overlap: The extent of overlap between a predicted site and a crystallographically observed site, automatically helps in
8
defining if the correct location has been identified as the binding site. The protocol used here for flagging a prediction as a truepositive implicitly considers this aspect. Figure 3a indicates the extent of overlap between the observed and the predicted pockets in
the dataset studied here. It can be seen that the predicted pockets encompass the ligand entirely in a large number (45.6%) of cases. In
some other cases, a significant part but not the whole ligand is in the predicted pocket while in a few other cases, the ligand
overlapped to a lesser extent with the predicted pocket. In addition, when analyzed in terms of the residues lining the actual ligand
verses those in the predicted pocket, it was observed that most (85.4%) of the predicted pockets overlapping with ligand had above
60% of residues in common with those around ligand within 4.0Å (Figure 3b). It is also important to consider the size of the predicted
pocket, because very large pockets will encompass ligands well, but lose precision in terms of defining pocket boundaries. Therefore,
a comparison of the ligand occupied volume verses the volume of the predicted pockets was computed (Figure 3c), using a Tanimoto
quotient where t 
LigOccupie dCells  SiteClusterCells
LigOccupie dCells  SiteClusterCells
Even with this stringent measure, the overall success of the algorithm can be clearly seen, where nearly 43.7% of the predicted pockets
have at least 50% of matching volume with the observed ligand.
Ranks: Protein structures often contain multiple pockets, out of which only one or two may be biologically significant. It is therefore
important for the correctly predicted pockets to have the top most ranks. Here we use the volume of the pocket as a metric for ranking
and use the number of non polar, polar and total number of protein atoms surrounding the pocket as additional metrics, as illustrated in
Figure 4a. It is clear that as many as 50% of predictions are ranked first, and 78% are in the first three ranks, when sorted by the
volume of the pocket. Similar rankings were observed with the total number of protein atoms or the non-polar or polar atoms
surrounding the predicted pocket. Further in cases of multimeric proteins such as dimeric neuraminidase, predicted pockets with the
9
top two ranks correspond to the two binding sites, one on each subunit, in many cases (Figure 4b).
The results obtained using dataset of 48 apo proteins and their corresponding ligand complexes (apo-plc) is shown in Table 2.
Of the 78 ligands corresponding to 46 protein, 70 ligands corresponding to 45 proteins of them are predicted correctly. Considering
the coverage of the ligand pocket by any predicted cluster, we observed that out of the 70 true positive ligands, 53 (75%) have more
than 90% overlap with respect to ligand volume and half of the true positive ligand (35) have more than 30% tanimato-coefficient
value. with an overlap of greater than xx%, 29(41.4%) of the true positive ligands, are in the top rank itself and 58(82.9%),xx of
them in the top five ranks. Significant conformational changes have been reported fro xx of these 48 proteins. The fact that the
algorithm was capable of predicting the pockets in the apo proteins is indicates that it can perform well even when the protein is not in
the ligand-bound conformation.
Sensitivity of the prediction accuracy to particular SCOP families, size of the protein and the ligand was also analyzed for the l
PDBbind dataset. This is important because enzymes are generally known to have buried pockets and hence are known to be more
easily identifiable than many other proteins which show differences in topography of the binding sites. The SCOP classes of the
proteins in the dataset were obtained to the super-family level and the number of correct predictions as well as false negatives were
plotted for each SCOP category, as illustrated in Figure 6. The results indicate that the true positives range over (xxx) 202 superfamilies of 206 super-families present in the dataset, indicating the predictions are independent of the structural class or the fold of the
protein. Similarly, graphs plotted to test whether predictions depended upon either the size of the protein or the size of the ligand,
indicate that the predictions were independent both of the protein size as well as the ligand size.
Benchmarking: Next, an extensive benchmarking of our prediction vis-a-vis those from the established binding site algorithms CastP
(Liang et al., 1998), Q-Sitefinder (Laurie and Jackson 2005), Ligsitecsc (Huang and Schroeder 2006) and LigandFit (Venkatachalam
10
et al., 2003), was also carried out. For this purpose, datasets reported by the authors of these methods have been used. Table 2
indicates the percentage of true positives obtained for the different datasets using our method. The performance of PocketDepth was
found to be good with all datasets. The prediction accuracies were better than reported for individual methods in most cases using the
same datasets. Q-SiteFinder, an energy based method was observed to predict regions of the pockets most accurately among the other
methods, albeit with varying extents of overlap with the actual ligand. Our method, purely based on geometry, performed as well, if
not better, in terms of identifying the correct pocket with the top ranks, but in many cases with a higher extent of overlap. In a few
cases however, the ranks reported by Q-Site Finder were superior for sub-pockets corresponding to ligand sub-structures, but not the
whole ligand itself. Figure 5 shows pocket prediction for five examples from different protein folds and families, using PocketDepth
and four other methods, clearly demonstrating that our method outperforms the others in majority of the cases. The examples are
chosen to reflect different types of binding sites such as, in an enzyme (alcohol dehydrogenase:1A72), a carbohydrate binding site
(peanut lectin:2PEL), a nucleotide binding site (RecA protein:2G88), a DNA binding site (transcription factor:1AIS), peptide binding
sites (HIV protease: 1SP5 and MHC molecule:1A1M). It can be seen that PocketDepth performed well and often the best in (a)
predicting the site as a top ranking cavity and (b) more importantly, in marking the boundaries of the predicted cavity. It was also
observed that our method fared significantly better in identifying the entire pocket as one cavity as against the prediction of several
pockets in a single binding site from other methods such as CastP and QSiteFinder, indicating the usefulness of the clustering
technique used here. Some examples can be seen in Figure 5a to 5f. An added advantage with our method compared to those such as
CastP is that fewer pockets are predicted per protein molecule, thus significantly increasing the sensitivity of the prediction.
In summary, by considering depth in the sub-spaces available in a pocket as defined and implemented in our method, we gain
to understand how central and not merely how deep a given space is to a pocket. These centrally located cells with high depth counts
must be taken into account while designing ligands for a pocket. Our understanding of whether a prediction is successful or not, is
11
necessarily based on the position of the ligand in the crystallographically determined structures. While this is the best measure that
may be currently available, it must be pointed out that the crystal structures may not always contain the natural substrate but contain
an inhibitor which may be smaller than the real substrate. It does not also show us the possible re-orientation of a ligand in a given site
during catalysis or upon binding another natural ligand in an adjacent site. Irrespective of the accuracy of the prediction in terms of
how closely it mimics the site of the actual ligand, the pockets give us an idea about the empty spaces around the ligand that might be
accessible for the ligand, which can be utilized appropriately while designing inhibitors or gaining further functional insights such as
in the case of moonlighting proteins.
The Algorithm
The major steps in the algorithm are: grid construction, grid cell labeling, drawing grid bars, computing depth factors,
clustering and ranking as shown in the flowchart (Figure 7). A stepwise description is given below.
Step-1: Grid construction: For a given PDB file of a protein, a 3D grid is constructed to encompass minimum and maximum
coordinate-values along each of the X, Y and Z axes, with a cell size of 1Å. Each point (x,y,z) of the protein, in 3D cartesian space is
mapped onto a 3D index (i,j,k) where i,j and k represent the number of cells starting from index 0 along each of the 3 dimensions.
Step-2: Defining cell properties and labelling cells: After defining a grid based indexing scheme, the data structures were then defined
to represent the properties of interest for each grid cell. (i) Each atom is mapped to an appropriate grid cell by taking the offset of the
co-ordinates
in
all
three
axes
from
the
corresponding
xi  x min
y i  y min
z i  z min
,
CELL _ WIDTH CELL _ WIDTH CELL _ WIDTH
12
minima
as
indicated
in
the
formula
where xi,yi and zi indicate the (x,y,z) fields of ith site point; xmin, ymin and zmin indicate minimum of all the coordinate values along each
of the 3 dimensions for the given protein and CELL_WIDTH indicates the fineness of the grid (1Å in our case). (ii) Every grid cell is
classified by examining its neighbourhood, as (a) internal, those grid cells within 2Ǻ from each atom occupied grid cell and those that
are surrounded on six faces by other intenal cells. (b) external, the remaining after (a). The flagged external cells were further
classified as (c) surface cells, if an atom occupied cell could be found within 3Ǻ Further some external cell would be classified as (d)
site cells after drawing grid bars as explained in the following steps.
Step-3: Finding connected list of surface cells: A depth first traversal from a surface cell is carried out to locate connected surface
cells as follows: the grid cells are marked as they are traversed recursively along six faces from each surface cell. When another
surface cell is encountered, the traversal continues or else it closes and moves to the next surface cell in the list. A connection is
defined to exist between a pair of surface cells if one is as an immediate neighbour in any of the six directions, to the other.
Step-4: Defining putative boundary atoms: If the total number of connected surface cells is below a threshold value (default is 500),
then it gets reported as an internal void of the protein, so that it can be processed separately. If the number of connected surface cells is
above the required threshold, those atoms that surround the present set of surface cells are stored as list of putative boundary atoms for
further processing.
Step-5: Deriving the depth factor: All pairs of boundary atoms that lie within a threshold distance range (2 to 15Ǻ) are considered for
drawing grid bars between them. However, a grid bar is only drawn between a pair of boundary atoms if it does not pass through the
interior of the protein. A grid bar is drawn by traversing a trail of grid cells from the grid cell (i1, j1, k1) occupied by the starting
boundary atom to the grid cell (i2,j2,k2) occupied by target boundary atom. The grid bar is drawn such that the target is reached with
13
the shortest Euclidean path from the starting atom. The counter in each grid cell traversed by the bar is cumulatively incremented. The
final count called depth factor (DF) in our method, for each grid cell would correspond to the number of times a grid bar has traversed
through it and hence indicates the density of atoms in its neighbourhood or in other words how central a given grid cell is to a putative
pocket.
Step-6: Second pass scanning of inter grid bar spaces between boundary atoms:
There may be some cells that may get trapped between grid bars which may not be covered during Step-5 and hence possess zero DF
values. In order to find and assign appropriate DF values to such cells, for each boundary atom all other boundary atoms within 10Ǻ
around it are considered and centriod of this set of atoms is computed. If the centriod falls in the interior of protein, it is ignored,
otherwise a 5Ǻ cube centered about the centriod is considered and average DF of all the cells within that cube touched by grid-bars is
calculated and assigned to all the remaining external cells.
Step-7: Reporting the labelled grid cells: The grid cells with non zero DF values are reported as dummy atoms in Protein Data Bank
(PDB) format where, cell based indices converted back to 3D coordinates (using the converse of the formula in step-2) are written in
coordinate fields and DF of the cell is written into the temperature factor field, for ease in using standard structure visualization
software.
Step-8: Clustering of the site points to identify number and extents of pockets: Density based clustering scheme DBSCAN (Ester et
al., 1996), which is suitable to identify clusters of random shapes is adopted here for determining clusters representing a binding sites.
However, the method we adopted is different from actual DBSCAN in two aspects. (1) Minor variation in implementation of finding
nearest neighbours for a given point and (2) Flexible application of clustering parameters based on DF. The method we use for
14
computing neighbouring points for a given point is different from that of DBSCAN implementation in the sense that here we use
simple and efficient data structures suitable for 3 dimensional points. Three sorted arrays of the site points are created, one
corresponding to each axis. On a nearest neighbourhood query for a point (x1,x2,x3), a binary search is done on each of the sorted
arrays Aj corresponding to the axis along jth dimension (j_Є [1..3] corresponding to X, Y and Z axes). These searches find the set of
site points possessing values between xj-ε and xj+ε for its jth dimension where ε is an offset value computed based on the distance
parameter of the neighbourhood query. The set of points determined when processing for each dimension are flagged with an
incremental count. Finally those points that possess a count equal to the number of dimensions i.e., 3 and those that are within the
given Euclidian distance are reported as neighbours. The output of this initial neighbourhood search is further filtered by imposing
bounds on DF of grid cells. Once the method of serving nearest neighbor query has been established, the clustering proceeds as
described in DBSCAN (Ester et al., 1996) using the concepts of density-reachability, density-connectivity and core-points but with
added flexibility of user specified ranges of DF values determining clustering-radius and minimum-number of neighbors. Clusteringradius and minimum number of grid bar cells in that radius defines clustering stringency. The flexibility in our implementation of
DBSCAN algorithm is that it uses clustering stringencies and corresponding range of DF values for a grid cell from a clustering
parameter file. Clustering stringency is adjusted in accordance with the depth of the cells as follows: a site cell having depth factor in
the range of 0 to 1.0 and should contain at-least 200 site points as neighbours within a radius of 3Ǻ, where as those site cells that have
depth factors above 4.0, require only 50 site point neighbours within a radius of 3Ǻ to form a cluster. The clustering parameters used
in this study are described in Table-1A. We observe that the scenarios at the binding sites encountered in majority of protein structures
can be described in two ways: deep pockets and shallow depressions. While the algorithm can detect both these types, the parameters
need to be tuned to determine the exact bounds of the predicted pocket. We therefore use two different parameter sets where-in a
given protein is first scanned with the more stringent ‘deeper’ parameter, the protein points not identified as putative pocket cells then
scanned with the less stringent ‘surface’ parameter set. Clusters predicted with the ‘deeper’ set are given higher priority while ranking
15
the clusters. This has been automated on the web implementation of the algorithm. Clusters identified with the surface set, if adjacent
to those with the deeper set are merged together and ranked based on the proportion of cluster cells identified with the ‘deeper’
parameter set.
Typically most proteins are found to contain in the order of 5 to 10 clusters by this approach. Cluster numbers are written into
residue number-field of the PDB line corresponding to the site cell representing a dummy atom. Any site point that does not belong to
a cluster and hence does not have a cluster-id is assigned residue number of 999 to indicate that it is a noise point. The clusters are
then ranked based on the number of site points that they contain, where the top ranked cluster will have the maximum number of site
points. Different bounds on the value of DF and clustering stringency i.e., radius around each grid cell and threshold minimum number
of neighbors give different shapes to clusters of grid cells. The quality of clustering depends directly on the shape of the clusters and
the shape depends on the balance between clustering stringency and bounds on DF. For a surface pocket, if the clustering stringency is
very less, almost the entire surface would become a cluster which is unacceptable, whereas for a deeper pocket if the clustering
stringency is high, small groups of cells/site points that correspond to pocket would be reported as noise hence loosing a potential
pocket. In addition, for the same protein, there may exist groups of grid cells with varying densities and depths requiring a flexible,
dynamic clustering scheme. In view of these, our flexible DBSCAN-Type clustering has been useful and performed well in identifying
possible pockets.
Validation
In order to verify if the protocol described above yielded binding sites that matched well with experimentally observed sites, a
three-fold validation exercise was carried out. The predicted sites were matched with the corresponding experimentally observed ones
for (a) correctness of the location of the site in terms of the binding site residues of the protein lining the pocket, (b) the extent of
16
overlap of site points with the atoms of the ligand in the corresponding structure and (c) rank of the cluster occurring at the expected
site. We have used PDBbind, a comprehensive curated data set consisting of 1091 proteins as their respective protein complexes
(Wang et al., 2005). Hydrogens and water molecules were removed from the PDB files. Small ligands such as sulphate, phosphate and
metal ions were also removed. The sensitivity in prediction accuracies by our algorithm to enzymes vs non-enzymes, the SCOP class
of the protein as well as the size of the protein and size of the ligand were also analysed. Next, a dataset of 48 apo-proteins and their
corresponding ligand complexes compiled by (Huang and Schroeder 2006) xx, was used to validate if the algorithm was capable of
identifying pockets in apo proteins as well.
Rasmol (Sayle and Milner-White 1995) was used for visualization of the predicted pockets and generating the images
presented in this paper. In order to visualize the clusters reported by the program, a line in PDB format is generated for each grid cell,
where x,y,z fields indicates 3D positions of grid cells and the temperature factor field holds depth factors of the grid cells. A higher
value for DF indicates a deeply buried cell, whereas a lower value indicates a cell located at the surface.
Acknowledgements
Use of facilities at the Interactive Graphics Based Molecular Modeling Facility and Distributed Information Centre (both
supported by Department of biotechnology (DBT), Govt. of India, and the facilities at the Super Computer Education and Research
Centre are gratefully acknowledged. Financial support from the DBT computational genomics initiative is also acknowledged.
References
17
An J., Totrov M. and Abagyan R. (2005)
Pocketome via Comprehensive Identification and Classification of Ligand Binding Envelopes
Molecular and Cellular Proteomics 4 , 752-61
Bhinge A., Chakrabarti P., Uthanumallian K., Bajaj K., Chakraborty K., and
Varadarajan R. (2004)
Accurate Detection of Protein: Ligand Binding Sites Using Molecular Dynamics Simulations
Structure, 12, 1989-1999
Brady G.P. Jr, and Stouten P.F. (2000)
Fast Prediction and visualization of protein binding pockets with PASS
J. Comput Aided Mol Des. 14, 383-401
Burley S.K. (2000)
An Overview of structural genomics
Nature Structural Biology, Suppl:932-934
Carlos A. Del Carpio, Yoshimasa Takahashi and Shin-ichi Sasaki (1993)
A new approach to the automatic identification of candidates for ligand receptor sites in proteins: (I) Search for pocket regions
J. Mol. Graphics, 11, 23-29
Chakrabarti S. and Lanczycki C.J. (2007)
Analysis and Prediction of functionally important sites in proteins
Protein Science 16, 4-13
Chakravarty S. and Varadarajan R. (1999)
Residue Depth: A novel parameter for the analysis of protein structure and stability
Structure, 7 , 723-732
18
Coleman R.G. and Sharp K.A. (2006)
Travel Depth, a New Shape Descriptor for Macromolecules: Application to Ligand Binding
J. Mol. Biol. 362, 441-458
Coleman R.G., Salzberg A.C., and Cheng A.C (2006)
Structure-Based Identification of Small Molecule Binding Sites Using a Free Energy Model
J. Chem. Inf. Model. 46, 2631-2637
Congreve, M., Murray, C.W. and Blundell, T.L. (2005)
Structural Biology and drug discovery
Drug Discovery Today. 10, 895-907
Ester M., Kriegel H.P., Sander J. and Xu X. (1996)
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
Proceedings of 2nd International conference on Knowledge Discovery and Data Mining (KDD-96)
Glaser F., Morris R.J., Najmanovich R.J., Laskowski R.A., Thornton J.M. (2006)
A Method for Localizing Ligand Binding Pockets in Protein Strauctures
Proteins: Structure, Function and Bioinformatics 62, 479-488
Goodford P.J. (1985)
A Computational Procedure for Determining Energetically Favorable Binding Sites of Biologically Important Macromolecules
Journal of Medicinal Chemistry, 28, 849-857
Hendlich M., Rippmann F. and Barnickel G. (1997)
LIGSITE: Automatic and efficient detection of potential small molecule binding sites in proteins
Journal of Molecular Graphics and Modelling 15, 359-363
Huang B. and Schroeder M.(2006)
LIGSITECSC: Predicting ligand binding sites using the Connolly surface and degree of conservation
19
BMC Structural Biology 6, 19
Jeffery C.J. (1999)
Moonlighting proteins
Trends in Biochemical Sciences, 24, 8-11
Kleywegt G.J. and Jones T.A. (1994)
Detection, delineation, measurement and display of cavities in macromolecular structures
Acta. Cryst. 50D, 178-185
Landon M.R., Lancia D.R. Jr., Yu J., Thiel S.C., and Vajda S. (2007)
Identification of Hot Spots within Druggable Binding Regions by Computational Solvent Mapping of Proteins
J. Med. Chem. 50, 1231-1240
Laurie A.T.R. and Jackson R.M. (2005)
Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites
Bioinformatics, 21, 1908-1916
Liang J., Edelsbrunner H., and Woodward C. (1998)
Anatomy of protein pockets and cavities: measurement of binding site geometry and implications for ligand design
Protein Science. 7, 1884-97
Michal Brylinski, Marek Kochanczyk, Elzbieta Broniatowska, Irena Roterman (2007)
Localization of ligand binding site in proteins identified in silico.
J. Mol. Model. 13, 665-75
Peters K.P., Fauck J. and Frommel C. (1996)
The Automatic Search for Ligand Binding Sites in Proteins of Known Three-dimensional
Structure Using only Geometric Criteria
J. Mol. Biol. 256, 201–213
20
Pintar A., Carugo O., and Ponger S. (2003)
Atom Depth as a Descriptor of the Protein Interior
Biophysical Journal 84 , 2553-2561
Ramachandraiah and Chandra. N.R. (2000)
Sequence and structural determinants of mannose recognition.
Proteins: Structure, Function and Bioinformatics , 39:358-364
Sayle R.A., and Milner-White E.J. (1995)
RASMOL: biomolecular graphics for all
Trends in Biochemical Sciences, 20, 374-376
Soga S., Shrai H., Kobori M., and Hirayana N. (2007)
Use of Amino Acid Composition to Predict Ligand-Binding Sites
J. Chem. Inf. Model. 47, 400-406
Stockwell G.R. and Thornton J.M. (2006)
Conformational diversity of ligands bound to proteins
J. Mol. Biol. 356, 928-44
Taroni C., Jones S. and Thornton J.M. (2000)
Analysis and prediction of carbohydrate binding sites
Protein Engineering, 13, 89-98
Varrazzo D., Bernini A., Spiga O., Ciutti A., Chiellini S., Venditti V., Bracci L., and Niccolai N. (2005)
Three-dimensional computation of atom depth in complex molecular structures
Bioinformatics 21, 2856-2860
Venkatachalam C.M., Jiang X., Oldfield T., and Waldman M. (2003)
LigandFit: a novel method for the shape-directed rapid docking of ligands to protein active sites
21
Journal of Molecular Graphics and Modelling 21 , 289-307
Wang R., Fang X., Lu Y., and Wang S. (2005)
The PDBbind Database: Methodologies and Updates
J. Med. Chem. 48, 4111-4119
Wang R., Fang X., Lu Y., and Wang S. (2005)
The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures
J. Med. Chem. 47, 2977-2980
Figure Legends
Figure 1: Illustrations depicting the flagging of grid cells, computing grid bars and clustering as in PocketDepth. Only one vertex per
grid
cell
is
highlighted
for
clarity.
(a) Internal grid cells are marked in blue whereas the surface grid cells are in green and protein atoms are indicated as red spheres. (b)
A comprehensive figure to show the flagging of different grid cells, cyan: external, blue: internal, green: surface, black: internal voids,
red: protein atoms, magenta: grid bars. (c) Clustering of grid bars, two clusters one in green and the other in pink are shown along with
protein atoms in red.
Figure 2: Histograms showing distribution of ratio (in %) of number of cluster grid cells (a) to the total number of grid bar cells (b) to
the total number of internal, surface and grid bar cells.
Figure 3: Histograms depicting percentage overlap between the best predicted cluster and the corresponding ligand where CV
indicates volume occupied by the predicted cluster, LV indicates volume occupied by the ligand in terms of grid cells, LR indicates set
of residues lining 4Ǻ around the ligand and CR indicates the set of residues lining 4Ǻ around the predicted pocket. The percentage
overlap is shown in terms of (a) volume overlap, (b) number of residues that surround the pocket and (c) the more stringent Tanimoto
coefficient based on volume.
Figure 4(a): Rank distribution of predicted pockets corresponding to all true positives in the dataset. Ranking is based on four
different schemes. Ranking scheme blue: size of cluster, cyan: number of polar atoms, orange: number of non polar atoms and brown:
22
number of surface atoms. (b) The top ranked clusters in PDB:1FDQ corresponding to the binding site of a derivative of hexaeonic
acid (HXA) in each subunit of the dimer. The actual ligand is shown in red as a ball-and-stick model and clusters are shown in blue
and green coloured tiny spheres.
Figure 5: Comparison of pocket prediction for six different proteins using PocketDepth along with four other methods. Ranks for
each prediction where appropriate is indicated below in each case. Multiple pockets predicted for each ligand is indicated in different
colours other than grey and red. Proteins are shown as grey ribbons and ligand atoms are shown in red as ball and stick models.
Figure 6: Distribution of fraction of true positives (TP) and false negatives (FN) across (a) SCOP classes, (b) different protein sizes
and (c) different ligand sizes. True positives are indicated above the zero line of Y-axis while false negatives are indicated below.
Figure 7: Overview of the algorithm shown as a flow-chart indicating grid construction, labeling, computing grid bars and clustering
of grid bar cells based on DF values and spatial proximity.
23
Table-1A: Description of parameter sets. Deeper corresponds to more stringent criteria for clustering. LB,UB: Lower and Upper
bounds of DepthFactor(DF). Cluster Radius and Num. of neighbours represent dbscan clustering parameters to be used for any grid
cell that has DF within the specified range.
DF Range
Parameter Set
LB
deeper/stringent 0
surface/(less
stringent)
UB
Cluster
Radius
4.0
3.0
Num. of
neighbours
190
4.0
Max 3.0
50
0
1.0
5.0
1.0
3.0
5.0
3.5
Max 3.0
200
60
50
24
Table-1B: Prediction accuracies from PocketDepth against the PDBbind dataset. TP and FN (protein) indicate the number of true-positives and
false-negatives with respect to the number of proteins in the dataset whereas TP and FN (ligand) indicate the similar metrics for the number of
ligands bound to those proteins in the dataset. Ranks correspond to ranks of the predicted sites that overlapped with some ligand of the protein
above the threshold minimum overlap. Deeper and surface parameters refer to clustering stringency. Ranking based on protein considers top rank
cluster per protein where as that of ligand corresponds to considering top rank cluster per ligand since there may be more than one cluster
overlapping with ligand.
Dataset
#TP
(protein)
#FN
#TP
(protein) (ligands)
841
PDBbind
1091
255
#FN
(ligands)
860
263
Ranks
Ranking
based on
(clustering parameter set)
(==1)
55.2
54.5
(<=2)
73.8
73.2
(deeper)
(<=3)
82.2
81.4
(<=5)
91.2
91
(<=10)
97.4 Protein
97.4 Ligand
(surface)
PDBbind
255 protein
corresponding
to 263 ligands
215
41
222
41
34
56.7
64.2
73
89.3
Protein
32.9
57.7
73.4
89.2
Ligand
Combination
1053
1053/109196.5%
41
1082
1082/112396.3%
41
50.8
50
70.3
70
64.9
(in two
steps)
78.5
77.9
87.3
87.3
95.7
95.6
Protein
Ligand
PDBbind
1046
1046/109195.9%
48
1075
1075/112395.7%
48
36.4
35.9
51.9
52.3
surface only
58.8
66.3
58.9
66.2
77.5
77.2
Protein
Ligand
46 Apo-Plc
45
42.2
41.4
57.8
52.8
95.6
91.4
Protein
Ligand
4
70
4
25
64.4
61.4
88.9
82.8
Table-2: Prediction accuracies using PocketDepth against 3 other datasets
Dataset
#proteins
in the dataset
#tp
proteins
%-TP
Protein
based Ranks
1..5
1..10
LigsiteCSC
209
204
97.6
89.7%
96.9%
Qsitefinder
126
120
95.24
89.4%
96.5%
LigandFit
15
15
100%
100%
100%
Figure 1
(a)
(c)
(b)
26
27
28
29
30
Figure 5
PocketDepth
LigandFit
LigSiteCSC
QSiteFinder
1,5,12
1
One
One of top 10
1,2
2,11,13,17
1,3
2,7,10
1,2,6,7,12,14,16,21
,25,28,34,39,41,
51,56,57
1,3,6,7,15,19,20
2,7
PDB CASTp
1sp5
(AB)
ranks
1a72
ranks
1ais
ranks
31
PDB CASTp
CavityDepth
LigandFit
LigSiteCSC
QSiteFinder
1a1m
(A)
ranks
2,3,13,14,28
2,3,5
1,2,6
2,5,13,16
1
1
1,2,3,4,7,30,48,50
1,2,5
2,5,7
2pel
(A)
ranks
2g88
ranks
32
33
Figure 7
34
Download