Article pubs.acs.org/jcim Extreme Gradient Boosting as a Method for Quantitative Structure− Activity Relationships Robert P. Sheridan,*,† Wei Min Wang,‡ Andy Liaw,§ Junshui Ma,§ and Eric M. Gifford∥ † Modeling and Informatics Department, Merck & Co. Inc., 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States Data Science Department, MSD International GmbH (Singapore Branch), 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522 § Biometrics Research Department, Merck & Co. Inc., 126 E. Lincoln Ave., Rahway, New Jersey 07065, United States ∥ Bioinformatics Department, MSD International GmbH (Singapore Branch), 1 Fusionopolis Place, #06-10/07-18, Galaxis, Singapore 138522 Downloaded via JORDAN UNIV SCIENCE AND TECHNOLOGY on March 25, 2023 at 10:06:24 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles. ‡ S Supporting Information * ABSTRACT: In the pharmaceutical industry it is common to generate many QSAR models from training sets containing a large number of molecules and a large number of descriptors. The best QSAR methods are those that can generate the most accurate predictions but that are not overly expensive computationally. In this paper we compare eXtreme Gradient Boosting (XGBoost) to random forest and single-task deep neural nets on 30 in-house data sets. While XGBoost has many adjustable parameters, we can define a set of standard parameters at which XGBoost makes predictions, on the average, better than those of random forest and almost as good as those of deep neural nets. The biggest strength of XGBoost is its speed. Whereas efficient use of random forest requires generating each tree in parallel on a cluster, and deep neural nets are usually run on GPUs, XGBoost can be run on a single CPU in less than a third of the wall-clock time of either of the other methods. ■ been something of a “gold standard” to which other QSAR methods are compared.4 This is also true for non-QSAR types of machine learning.5 RF has been our workhorse QSAR method for many years. In 2012, Merck sponsored a Kaggle competition (www. kaggle.com/c/MerckActivity) to examine how well state-of-theart machine learning methods can perform on QSAR problems. From that exercise we became aware that neural nets had undergone a renaissance and that deep neural nets (DNN) had disruptively improved machine learning. We6 and other laboratories (reviewed in Gawehn et al.7) subsequently showed that DNNs are a practical method to apply to large QSAR problems and that DNNs make somewhat better predictions on the average than RF. One interesting point is that, although DNNs have many adjustable parameters, one can find a standard set of parameters suitable for most QSAR. In the past this laboratory8 experimented with alternative recursive partitioning methods such as gradient boosting. Whereas RF builds an ensemble of independent recursive partitioning trees of unlimited depth, gradient boosting builds a sequential series of shallow trees, where each tree corrects for the residuals in the predictions made by all the previous trees. At the time we concluded that, while the gbm9 implementation of gradient boosting [https://cran.r-project.org/web/packages/ INTRODUCTION Quantitative structure−activity relationships (QSAR) are a very commonly used technique in the pharmaceutical industry for predicting on-target and off-target activities. Such predictions help prioritize the experiments during the drug discovery process. Higher prediction accuracy is always desirable, but there are practical constraints due to the fact that in an industrial environment there may be a large number (dozens) of models trained on a very large number (>100 000) of compounds and a large number (several thousand) of substructure descriptors. These models may need to be updated frequently (say, every few months). Our general rule of thumb is that a QSAR method should be able to build a model from 300 000 molecules with 10 000 descriptors within 12 h elapsed time, without manual intervention. QSAR methods that are particularly compute-intensive or require the adjustment of many sensitive parameters to achieve good prediction for an individual QSAR data set are less desirable. Only a small number of the many new machine learning algorithms are suitable for routine application in drug discovery because of the above constraint. Currently, the most commonly used methods are variations on random forest (RF)1,2 and support vector machine (SVM),3 which are among the most predictive and have few adjustable parameters. In particular, RF has been very popular since it was introduced as a QSAR method by Svetnik et al.2 Due to its high prediction accuracy, ease of use, and robustness to adjustable parameters, RF has © 2016 American Chemical Society Received: September 29, 2016 Published: November 23, 2016 2353 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling Table 1. Data Sets for Prospective Prediction a data set type description 2C8 2C9BIG 2D6 3A4a A-II BACE CAV CB1a CLINT DPP4a ERK2 FACTORXIA FASSIF HERG HERGBIG HIVINTa HIVPROTa LOGDa METABa NAV NK1a OX1a OX2a PAPP PGPa PPBa PXR RAT_Fa TDIa ADME ADME ADME ADME Target Target ADME Target ADME Target Target Target ADME ADME ADME Target Target ADME ADME ADME Target Target Target ADME ADME ADME ADME ADME ADME THROMBINa Target CYP P450 2C8 inhibition −log(IC50) M CYP P450 2C9 inhibition −log(IC50) M CYP P450 2D6 inhibition −log(IC50) M CYP P450 3A4 inhibition −log(IC50) M binding to Angiotensin-II receptor −log(IC50) M inhibition of beta-secretase −log(IC50) M inhibition of Cav1.2 ion channel binding to cannabinoid receptor 1 −log(IC50) M clearance by human microsome log(clearance) μL/min/mg inhibition of dipeptidyl peptidase 4 −log(IC50) M inhibition of ERK2 kinase −log(IC50) M inhibition of factor Xla −log(IC50) M solubility in simulated gut conditions log(solubility) mol/L inhibition of hERG channel −log(IC50) M inhibition of hERG ion channel −log(IC50) M inhibition of HIV integrase in a cell based assay −log(IC50) M inhibition of HIV protease −log(IC50) M logD measured by HPLC method percent remaining after 30 min microsomal incubation inhibition of Nav1.5 ion channel −log(IC50) M inhibition of neurokinin1 (substance P) receptor binding −log(IC50) M inhibition of orexin 1 receptor −log(Ki) M inhibition of orexin 2 receptor −log(Ki) M apparent passive permeability in PK1 cells log(permeability) cm/s transport by p-glycoprotein log(BA/AB) human plasma protein binding log(bound/unbound) induction of 3A4 by pregnane X receptor; percentage relative to rifampicin log(rat bioavailability) at 2 mg/kg time dependent 3A4 inhibitions log(IC50 without NADPH/IC50 with NADPH) inhibition of human thrombin −log(IC50) M number of molecules training + test number of unique AP, DP descriptors 29958 189670 50000 50000 2763 17469 50000 11640 23292 8327 12843 9536 89531 50000 318795 2421 4311 50000 2092 50000 13482 7135 14875 30938 8603 11622 50000 7821 5559 8217 11730 9729 9491 5242 6200 8959 5877 6782 5203 6596 6136 9541 9388 12508 4306 6274 8921 4595 8302 5803 4730 5790 7713 5135 5470 9282 5698 5945 6924 5552 mean ± stdev activity 4.88 4.77 4.50 4.69 8.70 6.07 4.93 7.13 1.93 6.28 4.38 5.49 −4.04 5.21 5.07 6.32 7.30 2.70 23.2 4.77 8.28 6.16 7.25 1.35 0.27 1.51 42.5 1.43 0.37 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± 0.66 0.59 0.46 0.65 2.72 1.40 0.45 1.21 0.58 1.23 0.68 0.97 0.39 0.78 0.75 0.56 1.71 1.17 33.9 0.40 1.21 1.22 1.46 0.39 0.53 0.89 42.1 0.76 0.48 6.67 ± 2.02 Kaggle data sets. gbm/gbm.pdf] provided similar prediction accuracy as RF, the large number of adjustable parameters made it harder to use. Analogous to the situation with neural nets, the field of gradient boosting has progressed and a new implementation by Chen and Guestrin called eXtreme Gradient Boosting (XGBoost)10 is readily available [https://github.com/dmlc/ xgboost]. XGBoost builds on previous ideas in gradient boosting. Basically, the training is done using an “additive strategy”: Given a molecule i with a vector of descriptors xi, a tree ensemble model uses K additive functions to predict the output. Ω(f ) = γT + In the equation above, l is a differentiable convex loss function that measures the difference between the prediction ŷi and the target yi (in our case of QSAR, l is simply mean-square error). The second term penalizes the complexity of the model in terms of number of leaves in the tree T and vector of scores on leaves ω. It helps to smooth the final learned weights to avoid overfitting. We expect the regularized objective will tend to select a model employing simple and predictive functions. Another two techniques aiming at reducing overfitting are shrinkage and column subsampling. After each step of boosting, the algorithm scales the newly added weights by a factor η. This reduces the influence of each tree and makes the model learn slowly and (hopefully) better. Column (feature) subsampling considers only a random subset of descriptors in building a given tree. This also speeds up the training process by reducing the number of descriptors to consider. XGBoost also uses the sparsity-aware split-finding approach to more efficiently train on sparse data (the computation complexity is approximately linear to number of nonmissing entries in the input). This is potentially very helpful for QSAR because, if one uses substructure-type chemical descriptors, the compound/descriptor table is very sparse (with most entries K yi ̂ = ϕ(xi) = ∑ fk (xi), fk ∈ - k=1 (1) Here - is the set of all possible regression trees. The f k function at each of the k steps maps the descriptor values in xi to a certain output. It is the function we need to learn, containing the structure of the tree and the leaf scores. However, there are minor improvements in the regularized objective which turned out to be helpful in practice. Specifically, XGBoost tries to minimize the regularized objective as follows: 3(ϕ) = ∑ l(yi ,̂ yi ) + ∑ Ω(fk ) i k 1 λ ∥ω∥2 2 (2) where 2354 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling training and test sets are not selected from the same pool of compounds, the descriptor distributions in these two subsets are frequently not the same, or even similar to, each other. QSAR Descriptors. Each molecule is represented by a list of features, i.e. “descriptors” in QSAR nomenclature. In this paper, we use a set of descriptors that is the union of AP, the original “atom pair” descriptor from Carhart et al.12 and DP descriptors (“donor−acceptor pair”), called BP in the work of Kearsley et al.13 Both descriptors are of the form: zero) and XGBoost is potentially very fast when trained on even the largest QSAR data set. The impact of XGBoost has been widely recognized in a number of machine learning and data mining challenges. Among the 29 challenge winning solutions published at Kaggle’s blog during 2015, 17 solutions used XGBoost. However, XGBoost has not yet been used extensively for QSAR, and this study applies XGBoost to 30 in-house QSAR data sets of a practical size. While XGBoost has a very large number of adjustable parameters, we can show that, for our problems, these parameters are not particularly sensitive and one can pick a standard set of parameters with which most QSAR problems can achieve a level of prediction better than RF and almost as good as DNN. Most importantly, XGBoost takes much less compute effort than either of those two methods and produces much smaller models. atom type i − (distance in bonds) − atom type j For AP, atom type includes the element, number of nonhydrogen neighbors, and number of pi electrons; it is very specific. For DP, atom type is one of seven (cation, anion, neutral donor, neutral acceptor, polar, hydrophobe, and other); it contains a more generic description of chemistry. Random Forest. RF is an ensemble recursive partitioning method where each recursive partitioning “tree” is generated from a bootstrapped sample of compounds, and a random subset of descriptors is used at the branching of each node in the tree. RF naturally handles correlation between descriptors, and does not need a separate descriptor selection procedure to obtain good performance. Importantly, while there are a handful of adjustable parameters (e.g., number of trees, fraction of descriptors used at each branching, node size, etc.), the quality of predictions is generally insensitive to changes in these parameters. The version of RF we are using is a modification of the original FORTRAN code from Breiman,1 which is built for regressions. It has been parallelized to run one tree per processor on a cluster. Such parallelization is necessary to run some of our larger data sets in a reasonable time. For all RF models, we generate 100 trees with m/3 descriptors used at each branch-point, where m is the number of unique descriptors in the training set. The tree nodes with five or fewer molecules are not split further. We apply these parameters to every data set. Deep Neural Nets. Our Python-based DNN code is the one obtained through the Kaggle competition from Dahl,14 then at the University of Toronto, and modified by us. The DNN results we present are for single-task regression models using the “standard parameters” in Ma et al.,6 which are applied to all data sets: 1. Four hidden layers with 4000, 2000, 1000, and 1000 neurons. 2. Dropout rate 0% in the input layer, 25% in the first 3 hidden layers and, 10% in the fourth hidden layer. 3. Logarithmic transformation of the descriptors. ReLU activation. 4. Minibatch size of 125, and 300 training epochs. 5. No unsupervised pretraining. For timing purposes we also have implemented a simplified (“quick”) version of the DNN, which achieves almost identical prediction accuracy to the standard parameters but uses a smaller neural net. 1. Two intermediate layers of 1000 and 500 neurons with 25% dropout. 2. 75 training epochs. Otherwise identical to the standard parameters. XGBoost. The implementation of XGBoost is the C++ version runnable on Linux. There are several dozen adjustable parameters, the full set is visible at https://github.com/dmlc/ ■ METHODS Data Sets. Table 1 shows the in-house data sets used in this study which are the same as those in the work of Ma et al.6 These data sets represent a pharmaceutical research relevant mixture of on-target and off-target activities, easy and hard to predict activities, and large and small data sets. The data sets are available in Supporting Information with the descriptors and molecule names disguised so the original compounds cannot be reverse-engineered from the descriptors. The 15 data sets labeled “Kaggle” were used for the Kaggle competition. As before, we use in-house data sets because: 1. We wanted data sets which are realistically large and whose compound activity measurements have a realistic amount of experimental uncertainty and include a nonnegligible amount of qualified data. 2. Time-split validation (see below), which we consider more realistic than any random cross-validation, requires dates of testing, and these are almost impossible to find in public domain data sets. A number of these data sets contain significant amounts of qualified data. For example, one might know only that the measured IC50 is greater than 30 μM because 30 μM was the highest concentration titrated. For the purposes of modelbuilding those activities are treated as fixed numbers, because most off-the-self QSAR methods handle only fixed numbers. For example, IC50 > 30 μM is set to IC50 = 30 × 10 −6 M or −log(IC50) = 4.5. Our experience is that it is best to keep such qualified data in the QSAR training sets; otherwise less active compounds are often predicted to be more active than they really are. In order to compare the predictive ability of QSAR methods, each of these data sets was split into two nonoverlapping subsets: a training set and a test set. Although a usual way of making the split is by random selection, i.e. “split at random,” in actual practice in a pharmaceutical environment, QSAR models are applied “prospectively”. That is, predictions are made for compounds not yet tested in the appropriate assay, and these compounds may or may not have analogs in the training set. The best way of simulating this is to generate training and test sets by “time-split”. For each data set, the first 75% of the molecules assayed for the particular activity form the training set, while the remaining 25% of the compounds assayed later form the test set. We have found that, for regressions, R2 from time-split validation better estimates the R2 for true prospective prediction than R2 from any “split at random” scheme.11 Since 2355 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling Table 2. Mean Optimal Values for Three Parameters TESTOPT Kaggle TESTOPT non-Kaggle TESTOPT ALL TRAINOPT Kaggle TRAINOPT non-Kaggle TRAINOPT ALL TESTOPT + TRAINOPT ALL MAX_DEPTH NUM_ROUND COLSAMPLE_BYTREE 6.1 8.1 7.1 6.8 8.5 7.7 7.4 700 783 741 683 816 750 745 0.60 0.60 0.60 0.55 0.52 0.53 0.56 The set of parameters that gives the highest R2 for the prediction of the second half of the training set is used to generate a model using the entire training set. This model is used to predict the test set. This is more representative of the real life situation because we are optimizing only on the training set. However, since the training set is not necessarily like the test set, optimizing on the training set may not be useful for getting maximum R2 on the test set. TRAINOPT finds a different set of optimum parameters for each data set. 3. STANDARD. The goal is to find a common value of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE to be used for all data sets. The most straightforward way of generating such a standard set is to find the mean optimum values of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE over all data sets from either TESTOPT, TRAINOPT, or both. Timing. The three methods we are comparing RF, DNN, and XGBoost work on different machine architectures and/or modes in our environment: 1. RF runs as 100 jobs (one for each tree) running in parallel on a cluster. The cluster nodes are HP ProLiant BL460c Gen8 server blades, each equipped with two 8core Intel(R) Xeon(R) CPU E5-2670 at 2.60 GHz processors and 256 GB random access memory (RAM). 2. XGBoost runs on a single node of the above cluster with multiple threads (NTHREAD = 4). 3. DNN runs on a single NVIDIA Tesla C2070 GPU. It is hard to precisely compare the total computational expense of each method in our mixed environment, but some estimation is possible. By running DNN on a cluster CPU vs a GPU we know that the GPU is 10-fold faster than a CPU. We also know the total time for a RF model is 100 times the time to generate one tree on one cluster node. Perhaps most relevant to users of these methods, however, is the wall-clock time elapsed between when a model calculation is initiated and the time the model is written to disk and available to predict new compounds. Estimating the wall-clock time for RF is problematical because the 100 parallel jobs may not finish simultaneously. Sometimes, one or two may be delayed unpredictably, and the time between launch of the jobs and the completion of all of them is too dependent on the system load. A more robust, albeit slightly optimistic, measure is the time the first 95 jobs (i.e., trees) are completed. Included in the wall-clock time is the time to read through the descriptor file and create the molecule/descriptor table. Model Size. Another interesting parameter is the size of the model file, in that the speed of prediction is sometimes limited in practice by the time taken to read the model into memory or by copying the model from an archive to the computer doing the predicting. Here we note the size of the (binary) files comprising the model. In the case of RF, we multiply the size of a single tree by 100. In practice the size of a model file can vary xgboost/blob/master/doc/parameter.md. The ones we examined in detail are those which could show deviations from the defaults in QSAR problems: ETA (step size shrinkage). This controls the weights of subsequent trees. Default = 0.3. MAX_DEPTH (maximum depth of a tree). Default = 6. NUM_ROUND (maximum number of trees). Default = 10. COLSAMPLE_BYTREE (what fraction of descriptors would be examined for each tree). Default = 1. Metrics. In this paper, the metric used to evaluate prediction performance is R2, which is the squared Pearson correlation coefficient between predicted and observed activities in the test set. The same metric was used in the Kaggle competition and in our previous work. R2 measures the degree of concordance between the predictions and corresponding observations. This value is especially relevant when the whole range of activities is included in the test set. R2 is an attractive measurement for model comparison across many data sets because it is unitless and ranges from 0 to 1 for all data sets. The relative predictivity of the three methods we examine does not change if we use alternative metrics such as Q2 or RMSE. Workflow for XGBoost. The goal is to identify the most important few parameters and or identify a set of parameters that would be useful for most QSAR data sets. We found early on that a small value of ETA (0.05) was useful to keep the models growing slowly, therefore steadily, and kept ETA constant. A smaller ETA reduces the influence of each individual tree and leaves space for future trees to improve the model. Use of slow growth in boosting was suggested by Friedman.15 We constructed sets of parameters in the following way: 1. TESTOPT. Find the values of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE that, used in the model generated from the training set, gives the highest R2 for the test set. This is done by doing a grid search over the values MAX_TREE = (4, 5, 6, 7, 8, 9, 10), NUM_ROUND = (250, 500, 750, 1000), and COLSAMLPE_BYTREE = (0.50, 0.75, 1.0). There are more efficient search methods, but a grid search turned out to be the most straightforward approach to find an optimum quickly due to XGBoost’s speed. It should be noted that TESTOPT does not reflect the real life situation. In reality, we would not know the activity values of the test set in advance. However, this gives us an upper limit for the R2 on the test set we can expect by optimizing these parameters, and it is interesting to know what parameter values we should use if we had prior knowledge. TESTOPT finds a different set of optimum parameters for each data set. 2. TRAINOPT. Finding the optimal value of MAX_DEPTH, NUM_ROUND, and COLSAMPLE_BYTREE, again by gridsearch, by the cross-validation of each training set. That is, split the training set in half randomly, make a model from the first half using the parameters, and then predict the remaining half. 2356 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling Figure 1. Prediction accuracy on the test set for XGBoost and deep neural nets vs the prediction accuracy of random forest. Two different types of XGBoost parameters are shown, one with the parameters optimized for individual training sets (green), and one using a standard set of parameters for all data sets (red). (top) Absolute R2. (bottom) R2 minus the R2 for random forest. In both plots the dashed line indicates the R2 for random forest. accuracy is not particularly sensitive to the parameters we examined. Mean optimum parameter values are shown in Table 2. While the optimum values of parameters do not correlate well between TESTOPT and TRAINOPT for individual data sets (not shown), the mean optimum parameter values for TESTOPT and TRAINOPT are not far apart relative to the overall range of each parameter. There is also not much difference between the Kaggle and non-Kaggle subsets. These observations suggest it is possible to construct a set of standard parameters useable for most data sets. We averaged over both depending on the particular implementation of the QSAR method, but looking at the size will give us a rough idea of the relative complexity of models from the respective methods. ■ RESULTS Optimizations and Standard Parameters for XGBoost. The optimum set of parameters for individual data sets and the R2 for each type of optimization are in the Supporting Information. The difference between the best and worst R2 in the grid search is 0.07 ± 0.03 over all 30 data sets, and this is true for TESTOPT and TRAINOPT. That is, prediction 2357 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling Figure 2. Total computational effort (top) and wall-clock time (bottom) for random forest, deep neural nets, and XGBoost. The computational effort is expressed in units of hours on a single cluster node. For this plot, all XGBoost models were made with NUM_ROUND = 1000, which takes the maximum compute time. Accuracy of Prediction for the QSAR Methods. Figure 1 (top) shows the R2 for prediction of the test set for standard DNN, XGBoost_TRAINOPT, and XGBoost_STANDARD vs R2 for RF, which we are taking as the baseline method. On the average, the XGBoost results are better than RF, i.e. the red and green best-fit lines are above the dashed line indicating the diagonal, and nearly as good as DNN, indicated by the blue best-fit line. The improvement of XGBoost over RF tends to become smaller at higher R2. These trends are made more visible by subtracting out the RF value, as shown in Figure 1 (bottom). That there is not a systematic difference between XGBoost_TRAINOPT (green) and XGBoost_STANDARD (red) is seen in the superposition of green and red lines in Figure 1. That is, optimizing the parameters for each individual training set in TRAINOPT does not result in better predictions TESTOPT and TRAINOPT to obtain the standard parameters MAX_DEPTH = 7 (nearest integer to 7.4), NUM_ROUND = 745, and COLSAMPLE_BYTREE = 0.56. Predictions using these values are called XGBoost_STANDARD. In TRAINOPT and TESTOPT there is a weak relationship between the optimum MAX_DEPTH and the optimum NUM_ROUND vs Ntraining, the number of molecules in the training set: smaller data sets tend to prefer smaller MAX_DEPTH and smaller NUM_ROUND. Therefore, it might be possible to guess a good MAX_DEPTH and NUM_ROUND for individual data sets based only on Ntraining. However, we will show below that the STANDARD parameters already give as good predictions as the TRAINOPT grid-search, so this type of refinement is not likely to be helpful overall. 2358 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Article Journal of Chemical Information and Modeling Figure 3. Total model file size for the QSAR methods. Here the XGBoost models are made with the standard parameters. (the quick DNN) to produce smaller models than the original standard DNN, and they do. The size of XGBoost models might be expected to be almost constant with Ntraining because there is a fixed number of trees with a small maximum depth, both set by the user. However, there is an small dependence of size roughly with log(Ntraining), which probably reflects the fact that larger data sets have more trees closer to the maximum depth. In contrast, the number of nodes in an unpruned recursive partitioning tree should be linear with Ntraining, and we see this for RF. The model sizes of XGBoost are much smaller than any of the other methods for a given value of Ntrainining, except RF at very small values of Ntraining. on the test set than are obtained by simply using the STANDARD set of parameters for all data sets. The visual impression in Figure 1 is consistent with the mean R2 over the 30 data sets: RF = 0.39, DNN = 0.43, XGBoost_TRAINOPT = 0.42, XGBoost_STANDARD = 0.42. One should note these are only averages; any of the three methods may do best on a particular data set. We do not show XGBoost_TESTOPT in Figure 1 because it is, as previously mentioned, expected to be unrealistically optimistic; however we note that XGBoost_TESTOPT gets only slightly better R2 than XGBoost_TRAINOPT (mean R2 = 0.44). Timing. Total compute effort and wall-clock time for the different methods is shown in Figure 2 as a function of Ntraining. The log−log plot is the one where all methods show a maximally linear correlations and the range of timings can be appreciated. The total compute effort for DNN and XGBoost are roughly linear with Ntraining. As expected the DNN using fewer neurons (“quick”) requires less computation than the standard DNN. The total compute effort for RF rises roughly as the square of Ntraining. Clearly, XGBoost is the smallest by an order of magnitude in total compute effort (top) and somewhat faster in wall-clock time (bottom), at least for the larger data sets. QSAR methods tend to converge in wall-clock time for small data sets; for them reading the descriptor files and forming a molecule/descriptor table, which is shared by all methods, becomes the rate-limiting step. Model Size. Total model file size (in megabytes) is shown in Figure 3 as a function of Ntraining. The log−log plot is the one where all methods show a maximally linear correlations and the range of model file sizes can be appreciated. The size of DNN models is expected to depend on the total number of neurons. The number of neurons of the lowest layer will depend on the number of descriptors, which varies roughly as log(Ntraining), and the number of neurons in the intermediate layers will depend on the number of intermediate layers and number of neurons per layer set by the user. Effectively, the dependence of size is approximately log(Ntraining). We would expect networks with fewer layers and fewer neurons per layer ■ DISCUSSION XGBoost appears to be a very effective and efficient machinelearning method. Here we have demonstrated this is true in the realm of QSAR. It is effective in the sense that it provides prospective predictions as least as good on the average as RF, a gold-standard method, and nearly as good as single-task DNN. It is efficient because it achieves these results with much less computational effort than either of those methods and produces much smaller models. Having XGBoost means we can potentially handle many more and larger data sets and/or update them more frequently than we have previously, given our current compute environment. The potential difficulty with XGBoost having multiple adjustable parameters turns out, in practice, to not be a real issue for QSAR because we can identify standard values of at least some parameters. We have shown that these standard parameters can be used effectively with a large number of QSAR data sets; it is not necessary to optimize the parameters for each individual data set. Since we did not examine every parameter, it is possible that we could get results even better than we show here. We hasten to point out that our standard parameters apply only to QSAR problems in our environment. We can say nothing about non-QSAR applications of XGBoost. 2359 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360 Journal of Chemical Information and Modeling ■ ACKNOWLEDGMENTS W.M.W. and E.M.G. would like to thank Roy Goh and Ivan Clement, Data Sciences MSD Singapore, for their discussions and support of this study. Dr. Matt Tudor suggested we compare model file sizes. We should point, apropos of the above paragraph, that our work brings out an issue with any kind of QSAR modelbuilding, which we have noted before16 in another context. It is a general assumption in QSAR, and in machine-learning in general, that the molecules to be predicted (in the test set) are similar enough to the training set that maximizing the crossvalidated predictions of the training set (by using different descriptors, tweaking adjustable parameters, etc.) is equivalent to maximizing predictivity on the test set. In practice, the training and test sets may be different enough that this is not true. Despite the excellent performance of XGBoost, there are some issues that should be examined further. Recursive partitioning methods make predictions based on the average observed activities of molecules at their terminal nodes. This has the effect of compressing the range of predictions relative to the observed activities. For RF we routinely do “prediction rescaling”16 where the self-fit predicted activities in the training set of a particular model are linearly scaled to match the observed activities, and this scaling is applied to further predictions from that model. This does not affect the R2 of prediction but does help the numerical match of predicted and observed activities at the highest and lowest ranges of activity. We have found XGBoost also benefits from prediction rescaling. One disadvantage of XGBoost (and DNN) is that it does not produce the domain applicability metric TREE_SD (the variation in prediction among an ensemble) which is very useful in estimating the error bar on a QSAR prediction,17 and which comes “for free” in RF. Another potential issue is that boosting methods like gbm9 appear much more sensitive to artificial noise added to the activity data18 than other methods, in the sense that the R2 of prediction of a test set goes down more quickly with a given amount of noise added to the training set. We can confirm that XGBoost is more sensitive to artificial noise than RF, but about equivalent in sensitivity to DNN. However, it is not clear how relevant sensitivity to added noise is to the usefulness of a QSAR method. Clearly XGBoost is better than RF in the limit of no added noise. ■ ■ REFERENCES (1) Breiman, L. Random Forests. Machine Learning 2001, 45, 5−32. (2) Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P. Random Forest: a Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947−1958. (3) Cortes, C.; Vapnik, V. N. Support-Vector Networks. Machine Learning 1995, 20, 273−297. (4) Bruce, C. L.; Melville, J. L.; Pickett, S. D.; Hirst, J. D. Contemporary QSAR Classifiers Compared. J. Chem. Inf. Model. 2007, 47, 219−227. (5) Fernandez-Delgado, M.; Cernades, E.; Barro, S.; Amorim, D. Do We Need Hundreds of Classifiers to Solve Real World Problems? J. Machine. Learning. Res. 2014, 15, 3133−3181. (6) Ma, J.; Sheridan, R. P.; Liaw, A.; Dahl, G. E.; Svetnik, V. Deep Neural Nets as a Method for Quantitative-Structure-Activity Relationships. J. Chem. Inf. Model. 2015, 55, 263−274. (7) Gawehn, E.; Hiss, J. A.; Schneider, G. Deep Learning in Drug Discovery. Mol. Inf. 2016, 35, 3−14. (8) Svetnik, V.; Wang, T.; Tong, C.; Liaw, A.; Sheridan, R. P.; Song, Q. Boosting: an Ensemble Learning Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Model. 2005, 45, 786−799. (9) Friedman, J. H. Stochastic Gradient Boosting. Computational Statistics and Data Analysis 2002, 38, 367−378. (10) Chen, T.; Guestrin, C. XGBoost: a Scalable Tree Boosting System. arXiv:1603.02754v3, 2016. (11) Sheridan, R. P. Time-Split Cross-Validation as a Method For Estimating the Goodness of Prospective Prediction. J. Chem. Inf. Model. 2013, 53, 783−790. (12) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Application. J. Chem. Inf. Model. 1985, 25, 64−73. (13) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using Physiochemical Property Descriptors. J. Chem. Inf. Comput. Sci. 1996, 36, 118−27. (14) Dahl, G. E.; Jaitly, N.; Salakhutdinov, R. Multi-Task Neural Networks for QSAR Predictions. arXiv:1406.1231 [stat.ML], 2014; http://arxiv.org/abs/1406.1231. (15) Friedman, J. Greedy Function Approximation: a Gradient Boosting Machine. Annals of Statistics 2001, 29, 1189−1232. (16) Sheridan, R. P. Global Quantitative Structure-Activity Relationship Models vs Selected Local Models as Predictors of Off-Target Activities For Project Compounds. J. Chem. Inf. Model. 2014, 54, 1083−1092. (17) Sheridan, R. P. Using Random Forest to Model the Domain Applicability of Another Random Forest Model. J. Chem. Inf. Model. 2013, 53, 2837−2850. (18) Cortes-Ciriano, I.; Bender, A.; Malliavin, T. E. Comparing the Influence of Simulated Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets. J. Chem. Inf. Model. 2015, 55, 1413−1425. ASSOCIATED CONTENT S Supporting Information * The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00591. ■ Article Table of R2 results for RF, DNN, and XGBoost, similarly for timing and model size (TXTs) Training and test data for all 30 data sets (with disguised molecule names and descriptors). These data sets are suitable for comparing the relative goodness of different QSAR methods given the AP,DP descriptors (ZIPs) AUTHOR INFORMATION Corresponding Author *E-mail: sheridan@merck.com. ORCID Robert P. Sheridan: 0000-0002-6549-1635 Notes The authors declare no competing financial interest. 2360 DOI: 10.1021/acs.jcim.6b00591 J. Chem. Inf. Model. 2016, 56, 2353−2360