Uploaded by nataly.new.2007.2007

Sheridan-2016-Extreme-gradient-boosting-as-a-meth

advertisement
Article
pubs.acs.org/jcim
Extreme Gradient Boosting as a Method for Quantitative Structure−
Activity Relationships
Robert P. Sheridan,*,† Wei Min Wang,‡ Andy Liaw,§ Junshui Ma,§ and Eric M. Gifford∥
†
Modeling and Informatics Department, Merck & Co. Inc., 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States
Data Science Department, MSD International GmbH (Singapore Branch), 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore
138522
§
Biometrics Research Department, Merck & Co. Inc., 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States
∥
Bioinformatics Department, MSD International GmbH (Singapore Branch), 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore
138522
Downloaded via JORDAN UNIV SCIENCE AND TECHNOLOGY on March 25, 2023 at 10:06:24 (UTC).
See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
‡
S Supporting Information
*
ABSTRACT: In the pharmaceutical industry it is common to generate many QSAR
models from training sets containing a large number of molecules and a large number of
descriptors. The best QSAR methods are those that can generate the most accurate
predictions but that are not overly expensive computationally. In this paper we compare
eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets
on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a
set of standard parameters at which XGBoost makes predictions, on the average, better than
those of random forest and almost as good as those of deep neural nets. The biggest
strength of XGBoost is its speed. Whereas efficient use of random forest requires generating
each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost
can be run on a single CPU in less than a third of the wall-clock time of either of the other
methods.
■
been something of a “gold standard” to which other QSAR
methods are compared.4 This is also true for non-QSAR types
of machine learning.5 RF has been our workhorse QSAR
method for many years.
In 2012, Merck sponsored a Kaggle competition (www.
kaggle.com/c/MerckActivity) to examine how well state-of-theart machine learning methods can perform on QSAR problems.
From that exercise we became aware that neural nets had
undergone a renaissance and that deep neural nets (DNN) had
disruptively improved machine learning. We6 and other
laboratories (reviewed in Gawehn et al.7) subsequently showed
that DNNs are a practical method to apply to large QSAR
problems and that DNNs make somewhat better predictions on
the average than RF. One interesting point is that, although
DNNs have many adjustable parameters, one can find a
standard set of parameters suitable for most QSAR.
In the past this laboratory8 experimented with alternative
recursive partitioning methods such as gradient boosting.
Whereas RF builds an ensemble of independent recursive
partitioning trees of unlimited depth, gradient boosting builds a
sequential series of shallow trees, where each tree corrects for
the residuals in the predictions made by all the previous trees.
At the time we concluded that, while the gbm9 implementation
of gradient boosting [https://cran.r-project.org/web/packages/
INTRODUCTION
Quantitative structure−activity relationships (QSAR) are a very
commonly used technique in the pharmaceutical industry for
predicting on-target and off-target activities. Such predictions
help prioritize the experiments during the drug discovery
process. Higher prediction accuracy is always desirable, but
there are practical constraints due to the fact that in an
industrial environment there may be a large number (dozens)
of models trained on a very large number (>100 000) of
compounds and a large number (several thousand) of
substructure descriptors. These models may need to be
updated frequently (say, every few months). Our general rule
of thumb is that a QSAR method should be able to build a
model from 300 000 molecules with 10 000 descriptors within
12 h elapsed time, without manual intervention. QSAR
methods that are particularly compute-intensive or require
the adjustment of many sensitive parameters to achieve good
prediction for an individual QSAR data set are less desirable.
Only a small number of the many new machine learning
algorithms are suitable for routine application in drug discovery
because of the above constraint. Currently, the most commonly
used methods are variations on random forest (RF)1,2 and
support vector machine (SVM),3 which are among the most
predictive and have few adjustable parameters. In particular, RF
has been very popular since it was introduced as a QSAR
method by Svetnik et al.2 Due to its high prediction accuracy,
ease of use, and robustness to adjustable parameters, RF has
© 2016 American Chemical Society
Received: September 29, 2016
Published: November 23, 2016
2353
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
Table 1. Data Sets for Prospective Prediction
a
data set
type
description
2C8
2C9BIG
2D6
3A4a
A-II
BACE
CAV
CB1a
CLINT
DPP4a
ERK2
FACTORXIA
FASSIF
HERG
HERGBIG
HIVINTa
HIVPROTa
LOGDa
METABa
NAV
NK1a
OX1a
OX2a
PAPP
PGPa
PPBa
PXR
RAT_Fa
TDIa
ADME
ADME
ADME
ADME
Target
Target
ADME
Target
ADME
Target
Target
Target
ADME
ADME
ADME
Target
Target
ADME
ADME
ADME
Target
Target
Target
ADME
ADME
ADME
ADME
ADME
ADME
THROMBINa
Target
CYP P450 2C8 inhibition −log(IC50) M
CYP P450 2C9 inhibition −log(IC50) M
CYP P450 2D6 inhibition −log(IC50) M
CYP P450 3A4 inhibition −log(IC50) M
binding to Angiotensin-II receptor −log(IC50) M
inhibition of beta-secretase −log(IC50) M
inhibition of Cav1.2 ion channel
binding to cannabinoid receptor 1 −log(IC50) M
clearance by human microsome log(clearance) μL/min/mg
inhibition of dipeptidyl peptidase 4 −log(IC50) M
inhibition of ERK2 kinase −log(IC50) M
inhibition of factor Xla −log(IC50) M
solubility in simulated gut conditions log(solubility) mol/L
inhibition of hERG channel −log(IC50) M
inhibition of hERG ion channel −log(IC50) M
inhibition of HIV integrase in a cell based assay −log(IC50) M
inhibition of HIV protease −log(IC50) M
logD measured by HPLC method
percent remaining after 30 min microsomal incubation
inhibition of Nav1.5 ion channel −log(IC50) M
inhibition of neurokinin1 (substance P) receptor binding −log(IC50) M
inhibition of orexin 1 receptor −log(Ki) M
inhibition of orexin 2 receptor −log(Ki) M
apparent passive permeability in PK1 cells log(permeability) cm/s
transport by p-glycoprotein log(BA/AB)
human plasma protein binding log(bound/unbound)
induction of 3A4 by pregnane X receptor; percentage relative to rifampicin
log(rat bioavailability) at 2 mg/kg
time dependent 3A4 inhibitions log(IC50 without NADPH/IC50 with
NADPH)
inhibition of human thrombin −log(IC50) M
number of
molecules
training + test
number of
unique AP, DP
descriptors
29958
189670
50000
50000
2763
17469
50000
11640
23292
8327
12843
9536
89531
50000
318795
2421
4311
50000
2092
50000
13482
7135
14875
30938
8603
11622
50000
7821
5559
8217
11730
9729
9491
5242
6200
8959
5877
6782
5203
6596
6136
9541
9388
12508
4306
6274
8921
4595
8302
5803
4730
5790
7713
5135
5470
9282
5698
5945
6924
5552
mean ± stdev
activity
4.88
4.77
4.50
4.69
8.70
6.07
4.93
7.13
1.93
6.28
4.38
5.49
−4.04
5.21
5.07
6.32
7.30
2.70
23.2
4.77
8.28
6.16
7.25
1.35
0.27
1.51
42.5
1.43
0.37
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
±
0.66
0.59
0.46
0.65
2.72
1.40
0.45
1.21
0.58
1.23
0.68
0.97
0.39
0.78
0.75
0.56
1.71
1.17
33.9
0.40
1.21
1.22
1.46
0.39
0.53
0.89
42.1
0.76
0.48
6.67 ± 2.02
Kaggle data sets.
gbm/gbm.pdf] provided similar prediction accuracy as RF, the
large number of adjustable parameters made it harder to use.
Analogous to the situation with neural nets, the field of
gradient boosting has progressed and a new implementation by
Chen and Guestrin called eXtreme Gradient Boosting
(XGBoost)10 is readily available [https://github.com/dmlc/
xgboost]. XGBoost builds on previous ideas in gradient
boosting. Basically, the training is done using an “additive
strategy”: Given a molecule i with a vector of descriptors xi, a
tree ensemble model uses K additive functions to predict the
output.
Ω(f ) = γT +
In the equation above, l is a differentiable convex loss function
that measures the difference between the prediction ŷi and the
target yi (in our case of QSAR, l is simply mean-square error).
The second term penalizes the complexity of the model in
terms of number of leaves in the tree T and vector of scores on
leaves ω. It helps to smooth the final learned weights to avoid
overfitting. We expect the regularized objective will tend to
select a model employing simple and predictive functions.
Another two techniques aiming at reducing overfitting are
shrinkage and column subsampling. After each step of boosting,
the algorithm scales the newly added weights by a factor η. This
reduces the influence of each tree and makes the model learn
slowly and (hopefully) better. Column (feature) subsampling
considers only a random subset of descriptors in building a
given tree. This also speeds up the training process by reducing
the number of descriptors to consider.
XGBoost also uses the sparsity-aware split-finding approach
to more efficiently train on sparse data (the computation
complexity is approximately linear to number of nonmissing
entries in the input). This is potentially very helpful for QSAR
because, if one uses substructure-type chemical descriptors, the
compound/descriptor table is very sparse (with most entries
K
yi ̂ = ϕ(xi) =
∑ fk (xi),
fk ∈ -
k=1
(1)
Here - is the set of all possible regression trees. The f k
function at each of the k steps maps the descriptor values in xi
to a certain output. It is the function we need to learn,
containing the structure of the tree and the leaf scores.
However, there are minor improvements in the regularized
objective which turned out to be helpful in practice. Specifically,
XGBoost tries to minimize the regularized objective as follows:
3(ϕ) =
∑ l(yi ,̂ yi ) + ∑ Ω(fk )
i
k
1
λ ∥ω∥2
2
(2)
where
2354
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
training and test sets are not selected from the same pool of
compounds, the descriptor distributions in these two subsets
are frequently not the same, or even similar to, each other.
QSAR Descriptors. Each molecule is represented by a list
of features, i.e. “descriptors” in QSAR nomenclature. In this
paper, we use a set of descriptors that is the union of AP, the
original “atom pair” descriptor from Carhart et al.12 and DP
descriptors (“donor−acceptor pair”), called BP in the work of
Kearsley et al.13 Both descriptors are of the form:
zero) and XGBoost is potentially very fast when trained on
even the largest QSAR data set.
The impact of XGBoost has been widely recognized in a
number of machine learning and data mining challenges.
Among the 29 challenge winning solutions published at
Kaggle’s blog during 2015, 17 solutions used XGBoost.
However, XGBoost has not yet been used extensively for
QSAR, and this study applies XGBoost to 30 in-house QSAR
data sets of a practical size. While XGBoost has a very large
number of adjustable parameters, we can show that, for our
problems, these parameters are not particularly sensitive and
one can pick a standard set of parameters with which most
QSAR problems can achieve a level of prediction better than
RF and almost as good as DNN. Most importantly, XGBoost
takes much less compute effort than either of those two
methods and produces much smaller models.
atom type i − (distance in bonds) − atom type j
For AP, atom type includes the element, number of
nonhydrogen neighbors, and number of pi electrons; it is
very specific. For DP, atom type is one of seven (cation, anion,
neutral donor, neutral acceptor, polar, hydrophobe, and other);
it contains a more generic description of chemistry.
Random Forest. RF is an ensemble recursive partitioning
method where each recursive partitioning “tree” is generated
from a bootstrapped sample of compounds, and a random
subset of descriptors is used at the branching of each node in
the tree. RF naturally handles correlation between descriptors,
and does not need a separate descriptor selection procedure to
obtain good performance. Importantly, while there are a
handful of adjustable parameters (e.g., number of trees, fraction
of descriptors used at each branching, node size, etc.), the
quality of predictions is generally insensitive to changes in these
parameters.
The version of RF we are using is a modification of the
original FORTRAN code from Breiman,1 which is built for
regressions. It has been parallelized to run one tree per
processor on a cluster. Such parallelization is necessary to run
some of our larger data sets in a reasonable time. For all RF
models, we generate 100 trees with m/3 descriptors used at
each branch-point, where m is the number of unique
descriptors in the training set. The tree nodes with five or
fewer molecules are not split further. We apply these
parameters to every data set.
Deep Neural Nets. Our Python-based DNN code is the
one obtained through the Kaggle competition from Dahl,14
then at the University of Toronto, and modified by us. The
DNN results we present are for single-task regression models
using the “standard parameters” in Ma et al.,6 which are applied
to all data sets:
1. Four hidden layers with 4000, 2000, 1000, and 1000
neurons.
2. Dropout rate 0% in the input layer, 25% in the first 3
hidden layers and, 10% in the fourth hidden layer.
3. Logarithmic transformation of the descriptors. ReLU
activation.
4. Minibatch size of 125, and 300 training epochs.
5. No unsupervised pretraining.
For timing purposes we also have implemented a simplified
(“quick”) version of the DNN, which achieves almost identical
prediction accuracy to the standard parameters but uses a
smaller neural net.
1. Two intermediate layers of 1000 and 500 neurons with
25% dropout.
2. 75 training epochs.
Otherwise identical to the standard parameters.
XGBoost. The implementation of XGBoost is the C++
version runnable on Linux. There are several dozen adjustable
parameters, the full set is visible at https://github.com/dmlc/
■
METHODS
Data Sets. Table 1 shows the in-house data sets used in this
study which are the same as those in the work of Ma et al.6
These data sets represent a pharmaceutical research relevant
mixture of on-target and off-target activities, easy and hard to
predict activities, and large and small data sets. The data sets are
available in Supporting Information with the descriptors and
molecule names disguised so the original compounds cannot be
reverse-engineered from the descriptors. The 15 data sets
labeled “Kaggle” were used for the Kaggle competition. As
before, we use in-house data sets because:
1. We wanted data sets which are realistically large and
whose compound activity measurements have a realistic
amount of experimental uncertainty and include a nonnegligible amount of qualified data.
2. Time-split validation (see below), which we consider
more realistic than any random cross-validation, requires
dates of testing, and these are almost impossible to find
in public domain data sets.
A number of these data sets contain significant amounts of
qualified data. For example, one might know only that the
measured IC50 is greater than 30 μM because 30 μM was the
highest concentration titrated. For the purposes of modelbuilding those activities are treated as fixed numbers, because
most off-the-self QSAR methods handle only fixed numbers.
For example, IC50 > 30 μM is set to IC50 = 30 × 10 −6 M or
−log(IC50) = 4.5. Our experience is that it is best to keep such
qualified data in the QSAR training sets; otherwise less active
compounds are often predicted to be more active than they
really are.
In order to compare the predictive ability of QSAR methods,
each of these data sets was split into two nonoverlapping
subsets: a training set and a test set. Although a usual way of
making the split is by random selection, i.e. “split at random,” in
actual practice in a pharmaceutical environment, QSAR models
are applied “prospectively”. That is, predictions are made for
compounds not yet tested in the appropriate assay, and these
compounds may or may not have analogs in the training set.
The best way of simulating this is to generate training and test
sets by “time-split”. For each data set, the first 75% of the
molecules assayed for the particular activity form the training
set, while the remaining 25% of the compounds assayed later
form the test set. We have found that, for regressions, R2 from
time-split validation better estimates the R2 for true prospective
prediction than R2 from any “split at random” scheme.11 Since
2355
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
Table 2. Mean Optimal Values for Three Parameters
TESTOPT Kaggle
TESTOPT non-Kaggle
TESTOPT ALL
TRAINOPT Kaggle
TRAINOPT non-Kaggle
TRAINOPT ALL
TESTOPT + TRAINOPT ALL
MAX_DEPTH
NUM_ROUND
COLSAMPLE_BYTREE
6.1
8.1
7.1
6.8
8.5
7.7
7.4
700
783
741
683
816
750
745
0.60
0.60
0.60
0.55
0.52
0.53
0.56
The set of parameters that gives the highest R2 for the
prediction of the second half of the training set is used to
generate a model using the entire training set. This model is
used to predict the test set. This is more representative of the
real life situation because we are optimizing only on the training
set. However, since the training set is not necessarily like the
test set, optimizing on the training set may not be useful for
getting maximum R2 on the test set. TRAINOPT finds a
different set of optimum parameters for each data set.
3. STANDARD. The goal is to find a common value of
MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE to be used for all data sets. The most straightforward
way of generating such a standard set is to find the mean
optimum values of MAX_DEPTH, NUM_ROUND, and
COLSAMPLE_BYTREE over all data sets from either
TESTOPT, TRAINOPT, or both.
Timing. The three methods we are comparing RF, DNN,
and XGBoost work on different machine architectures and/or
modes in our environment:
1. RF runs as 100 jobs (one for each tree) running in
parallel on a cluster. The cluster nodes are HP ProLiant
BL460c Gen8 server blades, each equipped with two 8core Intel(R) Xeon(R) CPU E5-2670 at 2.60 GHz
processors and 256 GB random access memory (RAM).
2. XGBoost runs on a single node of the above cluster with
multiple threads (NTHREAD = 4).
3. DNN runs on a single NVIDIA Tesla C2070 GPU.
It is hard to precisely compare the total computational
expense of each method in our mixed environment, but some
estimation is possible. By running DNN on a cluster CPU vs a
GPU we know that the GPU is 10-fold faster than a CPU. We
also know the total time for a RF model is 100 times the time
to generate one tree on one cluster node. Perhaps most
relevant to users of these methods, however, is the wall-clock
time elapsed between when a model calculation is initiated and
the time the model is written to disk and available to predict
new compounds. Estimating the wall-clock time for RF is
problematical because the 100 parallel jobs may not finish
simultaneously. Sometimes, one or two may be delayed
unpredictably, and the time between launch of the jobs and
the completion of all of them is too dependent on the system
load. A more robust, albeit slightly optimistic, measure is the
time the first 95 jobs (i.e., trees) are completed. Included in the
wall-clock time is the time to read through the descriptor file
and create the molecule/descriptor table.
Model Size. Another interesting parameter is the size of the
model file, in that the speed of prediction is sometimes limited
in practice by the time taken to read the model into memory or
by copying the model from an archive to the computer doing
the predicting. Here we note the size of the (binary) files
comprising the model. In the case of RF, we multiply the size of
a single tree by 100. In practice the size of a model file can vary
xgboost/blob/master/doc/parameter.md. The ones we examined in detail are those which could show deviations from the
defaults in QSAR problems:
ETA (step size shrinkage). This controls the weights of
subsequent trees. Default = 0.3.
MAX_DEPTH (maximum depth of a tree). Default = 6.
NUM_ROUND (maximum number of trees). Default = 10.
COLSAMPLE_BYTREE (what fraction of descriptors would
be examined for each tree). Default = 1.
Metrics. In this paper, the metric used to evaluate prediction
performance is R2, which is the squared Pearson correlation
coefficient between predicted and observed activities in the test
set. The same metric was used in the Kaggle competition and in
our previous work. R2 measures the degree of concordance
between the predictions and corresponding observations. This
value is especially relevant when the whole range of activities is
included in the test set. R2 is an attractive measurement for
model comparison across many data sets because it is unitless
and ranges from 0 to 1 for all data sets. The relative predictivity
of the three methods we examine does not change if we use
alternative metrics such as Q2 or RMSE.
Workflow for XGBoost. The goal is to identify the most
important few parameters and or identify a set of parameters
that would be useful for most QSAR data sets. We found early
on that a small value of ETA (0.05) was useful to keep the
models growing slowly, therefore steadily, and kept ETA
constant. A smaller ETA reduces the influence of each
individual tree and leaves space for future trees to improve
the model. Use of slow growth in boosting was suggested by
Friedman.15 We constructed sets of parameters in the following
way:
1. TESTOPT. Find the values of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE that, used in the model
generated from the training set, gives the highest R2 for the test
set. This is done by doing a grid search over the values
MAX_TREE = (4, 5, 6, 7, 8, 9, 10), NUM_ROUND = (250,
500, 750, 1000), and COLSAMLPE_BYTREE = (0.50, 0.75,
1.0). There are more efficient search methods, but a grid search
turned out to be the most straightforward approach to find an
optimum quickly due to XGBoost’s speed. It should be noted
that TESTOPT does not reflect the real life situation. In reality,
we would not know the activity values of the test set in advance.
However, this gives us an upper limit for the R2 on the test set
we can expect by optimizing these parameters, and it is
interesting to know what parameter values we should use if we
had prior knowledge. TESTOPT finds a different set of
optimum parameters for each data set.
2. TRAINOPT. Finding the optimal value of MAX_DEPTH,
NUM_ROUND, and COLSAMPLE_BYTREE, again by gridsearch, by the cross-validation of each training set. That is, split
the training set in half randomly, make a model from the first
half using the parameters, and then predict the remaining half.
2356
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
Figure 1. Prediction accuracy on the test set for XGBoost and deep neural nets vs the prediction accuracy of random forest. Two different types of
XGBoost parameters are shown, one with the parameters optimized for individual training sets (green), and one using a standard set of parameters
for all data sets (red). (top) Absolute R2. (bottom) R2 minus the R2 for random forest. In both plots the dashed line indicates the R2 for random
forest.
accuracy is not particularly sensitive to the parameters we
examined.
Mean optimum parameter values are shown in Table 2.
While the optimum values of parameters do not correlate well
between TESTOPT and TRAINOPT for individual data sets
(not shown), the mean optimum parameter values for
TESTOPT and TRAINOPT are not far apart relative to the
overall range of each parameter. There is also not much
difference between the Kaggle and non-Kaggle subsets. These
observations suggest it is possible to construct a set of standard
parameters useable for most data sets. We averaged over both
depending on the particular implementation of the QSAR
method, but looking at the size will give us a rough idea of the
relative complexity of models from the respective methods.
■
RESULTS
Optimizations and Standard Parameters for XGBoost.
The optimum set of parameters for individual data sets and the
R2 for each type of optimization are in the Supporting
Information. The difference between the best and worst R2 in
the grid search is 0.07 ± 0.03 over all 30 data sets, and this is
true for TESTOPT and TRAINOPT. That is, prediction
2357
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
Figure 2. Total computational effort (top) and wall-clock time (bottom) for random forest, deep neural nets, and XGBoost. The computational
effort is expressed in units of hours on a single cluster node. For this plot, all XGBoost models were made with NUM_ROUND = 1000, which takes
the maximum compute time.
Accuracy of Prediction for the QSAR Methods. Figure
1 (top) shows the R2 for prediction of the test set for standard
DNN, XGBoost_TRAINOPT, and XGBoost_STANDARD vs
R2 for RF, which we are taking as the baseline method. On the
average, the XGBoost results are better than RF, i.e. the red and
green best-fit lines are above the dashed line indicating the
diagonal, and nearly as good as DNN, indicated by the blue
best-fit line. The improvement of XGBoost over RF tends to
become smaller at higher R2. These trends are made more
visible by subtracting out the RF value, as shown in Figure 1
(bottom). That there is not a systematic difference between
XGBoost_TRAINOPT (green) and XGBoost_STANDARD
(red) is seen in the superposition of green and red lines in
Figure 1. That is, optimizing the parameters for each individual
training set in TRAINOPT does not result in better predictions
TESTOPT and TRAINOPT to obtain the standard parameters
MAX_DEPTH = 7 (nearest integer to 7.4), NUM_ROUND =
745, and COLSAMPLE_BYTREE = 0.56. Predictions using
these values are called XGBoost_STANDARD.
In TRAINOPT and TESTOPT there is a weak relationship
between the optimum MAX_DEPTH and the optimum
NUM_ROUND vs Ntraining, the number of molecules in
the training set: smaller data sets tend to prefer smaller
MAX_DEPTH and smaller NUM_ROUND. Therefore, it
might be possible to guess a good MAX_DEPTH and
NUM_ROUND for individual data sets based only on
Ntraining. However, we will show below that the STANDARD
parameters already give as good predictions as the TRAINOPT
grid-search, so this type of refinement is not likely to be helpful
overall.
2358
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Article
Journal of Chemical Information and Modeling
Figure 3. Total model file size for the QSAR methods. Here the XGBoost models are made with the standard parameters.
(the quick DNN) to produce smaller models than the original
standard DNN, and they do. The size of XGBoost models
might be expected to be almost constant with Ntraining
because there is a fixed number of trees with a small maximum
depth, both set by the user. However, there is an small
dependence of size roughly with log(Ntraining), which
probably reflects the fact that larger data sets have more trees
closer to the maximum depth. In contrast, the number of nodes
in an unpruned recursive partitioning tree should be linear with
Ntraining, and we see this for RF. The model sizes of XGBoost
are much smaller than any of the other methods for a given
value of Ntrainining, except RF at very small values of
Ntraining.
on the test set than are obtained by simply using the
STANDARD set of parameters for all data sets.
The visual impression in Figure 1 is consistent with the mean
R2 over the 30 data sets: RF = 0.39, DNN = 0.43,
XGBoost_TRAINOPT = 0.42, XGBoost_STANDARD =
0.42. One should note these are only averages; any of the
three methods may do best on a particular data set. We do not
show XGBoost_TESTOPT in Figure 1 because it is, as
previously mentioned, expected to be unrealistically optimistic;
however we note that XGBoost_TESTOPT gets only slightly
better R2 than XGBoost_TRAINOPT (mean R2 = 0.44).
Timing. Total compute effort and wall-clock time for the
different methods is shown in Figure 2 as a function of
Ntraining. The log−log plot is the one where all methods show
a maximally linear correlations and the range of timings can be
appreciated. The total compute effort for DNN and XGBoost
are roughly linear with Ntraining. As expected the DNN using
fewer neurons (“quick”) requires less computation than the
standard DNN. The total compute effort for RF rises roughly as
the square of Ntraining. Clearly, XGBoost is the smallest by an
order of magnitude in total compute effort (top) and somewhat
faster in wall-clock time (bottom), at least for the larger data
sets. QSAR methods tend to converge in wall-clock time for
small data sets; for them reading the descriptor files and
forming a molecule/descriptor table, which is shared by all
methods, becomes the rate-limiting step.
Model Size. Total model file size (in megabytes) is shown
in Figure 3 as a function of Ntraining. The log−log plot is the
one where all methods show a maximally linear correlations and
the range of model file sizes can be appreciated. The size of
DNN models is expected to depend on the total number of
neurons. The number of neurons of the lowest layer will
depend on the number of descriptors, which varies roughly as
log(Ntraining), and the number of neurons in the intermediate
layers will depend on the number of intermediate layers and
number of neurons per layer set by the user. Effectively, the
dependence of size is approximately log(Ntraining). We would
expect networks with fewer layers and fewer neurons per layer
■
DISCUSSION
XGBoost appears to be a very effective and efficient machinelearning method. Here we have demonstrated this is true in the
realm of QSAR. It is effective in the sense that it provides
prospective predictions as least as good on the average as RF, a
gold-standard method, and nearly as good as single-task DNN.
It is efficient because it achieves these results with much less
computational effort than either of those methods and
produces much smaller models. Having XGBoost means we
can potentially handle many more and larger data sets and/or
update them more frequently than we have previously, given
our current compute environment.
The potential difficulty with XGBoost having multiple
adjustable parameters turns out, in practice, to not be a real
issue for QSAR because we can identify standard values of at
least some parameters. We have shown that these standard
parameters can be used effectively with a large number of
QSAR data sets; it is not necessary to optimize the parameters
for each individual data set. Since we did not examine every
parameter, it is possible that we could get results even better
than we show here. We hasten to point out that our standard
parameters apply only to QSAR problems in our environment.
We can say nothing about non-QSAR applications of XGBoost.
2359
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Journal of Chemical Information and Modeling
■
ACKNOWLEDGMENTS
W.M.W. and E.M.G. would like to thank Roy Goh and Ivan
Clement, Data Sciences MSD Singapore, for their discussions
and support of this study. Dr. Matt Tudor suggested we
compare model file sizes.
We should point, apropos of the above paragraph, that our
work brings out an issue with any kind of QSAR modelbuilding, which we have noted before16 in another context. It is
a general assumption in QSAR, and in machine-learning in
general, that the molecules to be predicted (in the test set) are
similar enough to the training set that maximizing the crossvalidated predictions of the training set (by using different
descriptors, tweaking adjustable parameters, etc.) is equivalent
to maximizing predictivity on the test set. In practice, the
training and test sets may be different enough that this is not
true.
Despite the excellent performance of XGBoost, there are
some issues that should be examined further. Recursive
partitioning methods make predictions based on the average
observed activities of molecules at their terminal nodes. This
has the effect of compressing the range of predictions relative to
the observed activities. For RF we routinely do “prediction
rescaling”16 where the self-fit predicted activities in the training
set of a particular model are linearly scaled to match the
observed activities, and this scaling is applied to further
predictions from that model. This does not affect the R2 of
prediction but does help the numerical match of predicted and
observed activities at the highest and lowest ranges of activity.
We have found XGBoost also benefits from prediction
rescaling.
One disadvantage of XGBoost (and DNN) is that it does not
produce the domain applicability metric TREE_SD (the
variation in prediction among an ensemble) which is very
useful in estimating the error bar on a QSAR prediction,17 and
which comes “for free” in RF.
Another potential issue is that boosting methods like gbm9
appear much more sensitive to artificial noise added to the
activity data18 than other methods, in the sense that the R2 of
prediction of a test set goes down more quickly with a given
amount of noise added to the training set. We can confirm that
XGBoost is more sensitive to artificial noise than RF, but about
equivalent in sensitivity to DNN. However, it is not clear how
relevant sensitivity to added noise is to the usefulness of a
QSAR method. Clearly XGBoost is better than RF in the limit
of no added noise.
■
■
REFERENCES
(1) Breiman, L. Random Forests. Machine Learning 2001, 45, 5−32.
(2) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.;
Feuston, B. P. Random Forest: a Classification and Regression Tool
for Compound Classification and QSAR Modeling. J. Chem. Inf.
Comput. Sci. 2003, 43, 1947−1958.
(3) Cortes, C.; Vapnik, V. N. Support-Vector Networks. Machine
Learning 1995, 20, 273−297.
(4) Bruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D.
Contemporary QSAR Classifiers Compared. J. Chem. Inf. Model. 2007,
47, 219−227.
(5) Fernandez-Delgado, M.; Cernades, E.; Barro, S.; Amorim, D. Do
We Need Hundreds of Classifiers to Solve Real World Problems? J.
Machine. Learning. Res. 2014, 15, 3133−3181.
(6) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep
Neural Nets as a Method for Quantitative-Structure-Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263−274.
(7) Gawehn, E.; Hiss, J. A.; Schneider, G. Deep Learning in Drug
Discovery. Mol. Inf. 2016, 35, 3−14.
(8) Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R. P.; Song,
Q. Boosting: an Ensemble Learning Tool for Compound Classification
and QSAR Modeling. J. Chem. Inf. Model. 2005, 45, 786−799.
(9) Friedman, J. H. Stochastic Gradient Boosting. Computational
Statistics and Data Analysis 2002, 38, 367−378.
(10) Chen, T.; Guestrin, C. XGBoost: a Scalable Tree Boosting
System. arXiv:1603.02754v3, 2016.
(11) Sheridan, R. P. Time-Split Cross-Validation as a Method For
Estimating the Goodness of Prospective Prediction. J. Chem. Inf.
Model. 2013, 53, 783−790.
(12) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as
Molecular Features in Structure-Activity Studies: Definition and
Application. J. Chem. Inf. Model. 1985, 25, 64−73.
(13) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.;
Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using
Physiochemical Property Descriptors. J. Chem. Inf. Comput. Sci.
1996, 36, 118−27.
(14) Dahl, G. E.; Jaitly, N.; Salakhutdinov, R. Multi-Task Neural
Networks for QSAR Predictions. arXiv:1406.1231 [stat.ML], 2014;
http://arxiv.org/abs/1406.1231.
(15) Friedman, J. Greedy Function Approximation: a Gradient
Boosting Machine. Annals of Statistics 2001, 29, 1189−1232.
(16) Sheridan, R. P. Global Quantitative Structure-Activity Relationship Models vs Selected Local Models as Predictors of Off-Target
Activities For Project Compounds. J. Chem. Inf. Model. 2014, 54,
1083−1092.
(17) Sheridan, R. P. Using Random Forest to Model the Domain
Applicability of Another Random Forest Model. J. Chem. Inf. Model.
2013, 53, 2837−2850.
(18) Cortes-Ciriano, I.; Bender, A.; Malliavin, T. E. Comparing the
Influence of Simulated Errors on 12 Machine Learning Algorithms in
Bioactivity Modeling Using 12 Diverse Data Sets. J. Chem. Inf. Model.
2015, 55, 1413−1425.
ASSOCIATED CONTENT
S Supporting Information
*
The Supporting Information is available free of charge on the
ACS Publications website at DOI: 10.1021/acs.jcim.6b00591.
■
Article
Table of R2 results for RF, DNN, and XGBoost, similarly
for timing and model size (TXTs)
Training and test data for all 30 data sets (with disguised
molecule names and descriptors). These data sets are
suitable for comparing the relative goodness of different
QSAR methods given the AP,DP descriptors (ZIPs)
AUTHOR INFORMATION
Corresponding Author
*E-mail: sheridan@merck.com.
ORCID
Robert P. Sheridan: 0000-0002-6549-1635
Notes
The authors declare no competing financial interest.
2360
DOI: 10.1021/acs.jcim.6b00591
J. Chem. Inf. Model. 2016, 56, 2353−2360
Download