Comparative study of biodegradability prediction of chemicals using decision trees, functional trees, and logistic regression Guangchao Chen†‡, Xuehua Li*†, Jingwen Chen†, Ya-nan Zhang†, Willie J.G.M. Peijnenburg‡§ †Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of Environmental Science and Technology, Dalian University of Technology, Linggong Road 2, Dalian 116024, China ‡Institute of Environmental Sciences (CML), Leiden University, 2300 RA Leiden, The Netherlands §Center for Safety of Products and Substances, National Institute of Public Health and the Environment, Bilthoven, The Netherlands Table S1. An illustration of data sampling when the ratio of readily biodegradable class (R) to non-biodegradable class (N) is 1:4 Table S2. The overall predictive accuracy, SE, SP, weighted average of SE and SP for the training and test set when class distribution of the training set (R:N) is 1:4 Figure S1. Effects of the training-set class distribution on model performance Table S3. Details of the training and test sets used for model construction Table S4. Performance comparison on test set II of our models with the EPI suite Biowin 5 and 6, and the CHAID-SVM and PubChemFP-SVM models Table S5. Statistical significances of the descriptor distributions between non-biodegradable and readily biodegradable classes Equation S1. Logistic model built on 13 molecular descriptors Table S6. Information on number of leaves and the size of C4.5 decision tree *Correspondence to: Xuhua Li, e-mail: lixuehua@dlut.edu.cn, Tel/Fax: +86 - 411 - 8470 7844. The effect of training-set class distribution on model performance We evaluated the effect of class distribution of readily biodegradable and non-biodegradable chemicals on the statistic performance of models. A total of 7 ratios of class distribution (readily biodegradable to non-biodegradable class, 1:4, 2:4, 3:4, 4:4, 4:3, 4:2, 4:1) were investigated. For this purpose, a data set covering biodegradability information on 470 chain-like chemicals was retrieved from the Japanese NITE data set in which the number of non-biodegradable compounds accounts for 50% (235 chemicals). To avoid coincidence of the results, data were sampled based on the permutation and combination's concept. The chemicals of both readily biodegradable and non-biodegradable classes were randomly divided into five separate equal groups, labeled as Ri and Ni (i = 1-5) with 47 compounds in each group. Subsets R1 and N1 were kept in the test set to assure co-existence of the two classes in both training and test sets. Take the ratio 1:4 (readily biodegradable to non-biodegradable class, R:N) as an example, one of the subsets Ri (i = 2-5) should be separately grabbed in combination with the data of N2, N3, N4 and N5 which yields four models. The remaining data were used for external validation (see Table S1). Similarly, when R:N equals to 2:4, two of the subsets Ri (i = 2-5) should be grabbed and combined with the subsets Ni (i = 2-5) generating 6 models. The constitutional descriptors and functional groups in the DRAGON software were used to build the FT models. The effect of class distribution on model performance was evaluated by two average statistics, i.e., the average of overall predictive accuracies, and the weighted average of SE (and SP) between training and test sets. Evaluation statistics when R:N = 1:4 are listed in Table S2 as an example. The effects of class distribution on model performance are shown in Figure S1. It can be concluded that predictive accuracy for the minority class is less than that for the majority class, whereas the models built on balanced training data outperform the models constructed on data of imbalanced class distributions. Therefore the rational choice of class distribution (i.e., the ratio of readily biodegradable versus non-biodegradable chemicals) of the training data is of essential importance for development of the biodegradation models. Based on these results, a balanced class distribution for the training data was chosen for subsequent model construction. Table S1. An illustration of data sampling when the ratio of readily biodegradable class (R) to non-biodegradable class (N) is 1:4 Model R1 R2 No. 1 Test Training R3 R4 R5 N1 N2 N3 N4 N5 Test Test Test Test Training Training Training Training No. 2 Test Test Training Test Test Test Training Training Training Training No. 3 Test Test Test Training Test Test Training Training Training Training No. 4 Test Test Test Test Training Test Training Training Training Training Method of dividing training and test sets In this paper one training set and two test sets (I and II) were used for building and validating models. Totally 1629 compounds were collected from the three sources. 27 compounds from the external validation set of Cheng et al. (J. Chem. Inf. Model. 2012. 52:655-669) were employed for the test set II. The remained 1602 compounds were divided into training set (825 compounds) and test set I (777 compounds) based on the following procedures: (1) The number of non-biodegradable (N) compounds was 968 and readily biodegradable (R) was 634. (2) To obtain a dataset with the ratio of N:R around 1:1, 644 out of 968 (nearly 33/50) non-biodegradable compounds were randomly selected. These 644 compounds together with 634 R compounds composed a dataset of 1278 compounds. (3) The dataset consisting 1278 compounds were randomly divided with a ratio of 2:1, thus datasets of 825 compounds (the training set) and 453 compounds were obtained. To make best use of our collected data, the 453 compounds in procedure 3 together with the 324 (968 - 644) N compounds in procedure 2 were used for test set I (777 compounds). Table S2. The overall predictive accuracy, SE, SP, weighted average of SE and SP for the training and test set when class distribution of the training set (R:N) is 1:4 Training set Test set Weighted average Qtraining (%) SE (%) SP (%) Qtest (%) SE (%) SP (%) SE (%) SP (%) Model 1 83.8 46.8 93.1 62.1 53.7 95.7 52.3 93.6 Model 2 82.1 36.2 93.6 40.0 26.6 93.6 28.5 93.6 Model 3 84.3 55.3 91.5 64.3 56.9 93.6 56.6 91.9 Model 4 76.2 23.4 89.4 58.3 50.5 89.4 45.1 89.4 Average 81.6 45.6 92.1 56.2 Figure S1. Effects of the training-set class distribution on model performance. (a) Variation of predictive accuracy of the models derived for readily biodegradable and non-biodegradable chemicals; (b) Variation of predictive accuracy of the models derived training and test sets. Table S3. Details of the training and test sets used for model construction Training set Test set I Test set II Readily biodegradable 409 225 4 Non-biodegradable 416 552 23 Total 825 777 27 Table S4. Performance comparison on test set II of our models with the EPI suite Biowin 5 and 6, and the CHAID-SVM and PubChemFP-SVM models Model TP FN TN FP SE (%) SP (%) Q (%) AUC C4.5 decision tree 4 0 17 6 100.0 73.9 77.8 0.793 FT-Inner 4 0 23 0 100.0 100.0 100.0 1.00 Logistic regression 4 0 20 3 100.0 87.0 88.9 0.924 Biowin 5 3 1 20 3 75.0 87.0 85.2 — Biowin 6 2 2 21 2 50.0 91.3 85.2 — CHAID-SVM 4 0 22 1 100.0 95.7 96.3 0.978 PubChemFP-SVM 4 0 23 1 100.0 100.0 100.0 1.00 SE (sensitivity) = TP/(TP + FN); SP (specificity) = TN/(TN + FP); Q (overall predictive accuracy) = (TP + TN)/(TP+ FP + TN + FN); AUC (area under the curve); TP - true positives; FP - false positives; TN - true negatives; FN - false negatives. Table S5. Statistical significances of the descriptor distributions between non-biodegradable and readily biodegradable classes Descriptor nCIC nN nS nX SRW10 ATS3p MATS3m Significance < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 < 0.001 Descriptor GATS3m C-001 C-007 C-040 O-061 Cl-089 Significance 0.951 0.046 0.177 < 0.001 < 0.001 < 0.001 Equation S1. The constructed logistic model has the form: z = - 0.5239 + 1.3552×nCIC + 1.0402×nN + 1.2639×nS + 0.6266×nX + 0.00001226×SRW10 + 1.5046×ATS3p - 3.7317×MATS3m + 62.5716×GATS3m + 0.3933×C-001 -1.8468×C-007 - 0.4511×C-040 + 0.8399×O-061 + 0.7382×Cl-089 where f ( z) 1 1 z ( i X i ) 1 e 1 e when f(z) > 0.50 the corresponding organic chemical will be judged as non-biodegradable, and biodegradable otherwise. (S1) Table S6. Information on number of leaves and size of the C4.5 decision tree Number of Leaves Size of the tree 55 109 Tree size is measured as number of bits required for encoding the decision tree which is equal to the number of inner nodes. Number of leaves is measured as number of decision nodes in a tree. The C4.5 decision tree is shown as follows (output by the Weka software): ATS3p <= 0.398 | nN <= 1 | | nS <= 0 | | | nCIC <= 0 | | | | C-040 <= 0 | | | | | C-001 <= 2 | | | | | | MATS3m <= 0.968 | | | | | | | GATS3m <= 0.024: N (8.0/1.0) | | | | | | | GATS3m > 0.024: R (7.0/1.0) | | | | | | MATS3m > 0.968: R (99.0/11.0) | | | | | C-001 > 2 | | | | | | SRW10 <= 2736: R (4.0) | | | | | | SRW10 > 2736: N (10.0/2.0) | | | | C-040 > 0: R (100.0/2.0) | | | nCIC > 0 | | | | nCIC <= 1 | | | | | GATS3m <= 0.021 | | | | | | nN <= 0: R (36.0/8.0) | | | | | | nN > 0 | | | | | | | SRW10 <= 3860: R (6.0) | | | | | | | SRW10 > 3860: N (6.0/1.0) | | | | | GATS3m > 0.021: N (7.0/1.0) | | | | nCIC > 1: N (5.0/1.0) | | nS > 0 | | | nN <= 0 | | | | C-040 <= 0 | | | | | GATS3m <= 0.019: R (3.0/1.0) | | | | | GATS3m > 0.019: N (8.0/1.0) | | | | C-040 > 0: R (3.0/1.0) | | | nN > 0: R (2.0) | nN > 1 | | nX <= 0 | | | nCIC <= 1 | | | | C-001 <= 1 | | | | | GATS3m <= 0.02 | | | | | | nN <= 2: R (13.0/3.0) | | | | | | nN > 2 | | | | | | | C-040 <= 1: R (2.0) | | | | | | | C-040 > 1: N (3.0) | | | | | GATS3m > 0.02: N (3.0) | | | | C-001 > 1: N (5.0/1.0) | | | nCIC > 1: N (3.0) | | nX > 0: N (2.0) ATS3p > 0.398 | nN <= 1 | | Cl-089 <= 0 | | | SRW10 <= 13724 | | | | GATS3m <= 0.037 | | | | | C-040 <= 0 | | | | | | nN <= 0 | | | | | | | nCIC <= 1 | | | | | | | | GATS3m <= 0.013 | | | | | | | | | GATS3m <= 0.008: N (4.0/1.0) | | | | | | | | | GATS3m > 0.008: R (32.0/3.0) | | | | | | | | GATS3m > 0.013 | | | | | | | | | ATS3p <= 0.423: R (8.0) | | | | | | | | | ATS3p > 0.423 | | | | | | | | | | SRW10 <= 6788 | | | | | | | | | | | ATS3p <= 0.478: R (14.0/3.0) | | | | | | | | | | | ATS3p > 0.478 | | | | | | | | | | | | C-001 <= 1: N (7.0/1.0) | | | | | | | | | | | | C-001 > 1: R (3.0/1.0) | | | | | | | | | | SRW10 > 6788 | | | | | | | | | | | C-001 <= 1 | | | | | | | | | | | | SRW10 <= 8082: N (2.0) | | | | | | | | | | | | SRW10 > 8082: R (2.0) | | | | | | | | | | | C-001 > 1: N (19.0) | | | | | | | nCIC > 1 | | | | | | | | C-001 <= 0 | | | | | | | | | SRW10 <= 8928: N (3.0) | | | | | | | | | SRW10 > 8928 | | | | | | | | | | MATS3m <= 0.987: R (7.0/1.0) | | | | | | | | | | MATS3m > 0.987: N (4.0/1.0) | | | | | | | | C-001 > 0: N (7.0) | | | | | | nN > 0 | | | | | | | C-001 <= 1 | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | O-061 <= 0 | | | | | | | | MATS3m <= 0.965: R (4.0) | | | | | | | | MATS3m > 0.965 | | | | | | | | | MATS3m <= 0.992: N (20.0/3.0) | | | | | | | | | MATS3m > 0.992: R (4.0/1.0) | | | | | | | O-061 > 0: N (6.0) | | | | | | C-001 > 1: N (10.0) | | | | C-040 > 0 | | | | | C-001 <= 1 | | | | | | nCIC <= 0: R (10.0) | | | | | | nCIC > 0 | | | | | | | ATS3p <= 0.415: N (3.0) | | | | | | | ATS3p > 0.415: R (53.0/7.0) | | | | | C-001 > 1 | | | | | | GATS3m <= 0.013: R (2.0) | | | | | | GATS3m > 0.013: N (7.0) | | | GATS3m > 0.037: N (14.0) | | SRW10 > 13724 | | | nN <= 0 | | | | SRW10 <= 16326: N (16.0) | | | | SRW10 > 16326 | | | | | nX <= 0 | | | | | | MATS3m <= 0.983: N (32.0/5.0) | | | | | | MATS3m > 0.983: R (5.0) | | | | | nX > 0: N (4.0) | | | nN > 0: N (22.0) | Cl-089 > 0: N (72.0/8.0) nN > 1 | C-007 <= 0: N (89.0/4.0) | C-007 > 0 | | nN <= 3: N (3.0) | | nN > 3: R (2.0)