Comparative study of chemical biodegradability prediction

advertisement
Comparative study of biodegradability prediction of chemicals
using decision trees, functional trees, and logistic regression
Guangchao Chen†‡, Xuehua Li*†, Jingwen Chen†, Ya-nan Zhang†, Willie J.G.M. Peijnenburg‡§
†Key Laboratory of Industrial Ecology and Environmental Engineering (MOE), School of
Environmental Science and Technology, Dalian University of Technology, Linggong Road 2,
Dalian 116024, China
‡Institute of Environmental Sciences (CML), Leiden University, 2300 RA Leiden, The
Netherlands
§Center for Safety of Products and Substances, National Institute of Public Health and the
Environment, Bilthoven, The Netherlands
Table S1. An illustration of data sampling when the ratio of readily biodegradable class (R) to
non-biodegradable class (N) is 1:4
Table S2. The overall predictive accuracy, SE, SP, weighted average of SE and SP for the
training and test set when class distribution of the training set (R:N) is 1:4
Figure S1. Effects of the training-set class distribution on model performance
Table S3. Details of the training and test sets used for model construction
Table S4. Performance comparison on test set II of our models with the EPI suite Biowin 5
and 6, and the CHAID-SVM and PubChemFP-SVM models
Table S5. Statistical significances of the descriptor distributions between non-biodegradable
and readily biodegradable classes
Equation S1. Logistic model built on 13 molecular descriptors
Table S6. Information on number of leaves and the size of C4.5 decision tree
*Correspondence
to: Xuhua Li, e-mail: lixuehua@dlut.edu.cn, Tel/Fax: +86 - 411 - 8470 7844.
The effect of training-set class distribution on model performance
We evaluated the effect of class distribution of readily biodegradable and
non-biodegradable chemicals on the statistic performance of models. A total of 7 ratios of
class distribution (readily biodegradable to non-biodegradable class, 1:4, 2:4, 3:4, 4:4, 4:3, 4:2,
4:1) were investigated. For this purpose, a data set covering biodegradability information on
470 chain-like chemicals was retrieved from the Japanese NITE data set in which the number
of non-biodegradable compounds accounts for 50% (235 chemicals). To avoid coincidence of
the results, data were sampled based on the permutation and combination's concept. The
chemicals of both readily biodegradable and non-biodegradable classes were randomly
divided into five separate equal groups, labeled as Ri and Ni (i = 1-5) with 47 compounds in
each group. Subsets R1 and N1 were kept in the test set to assure co-existence of the two
classes in both training and test sets. Take the ratio 1:4 (readily biodegradable to
non-biodegradable class, R:N) as an example, one of the subsets Ri (i = 2-5) should be
separately grabbed in combination with the data of N2, N3, N4 and N5 which yields four
models. The remaining data were used for external validation (see Table S1). Similarly, when
R:N equals to 2:4, two of the subsets Ri (i = 2-5) should be grabbed and combined with the
subsets Ni (i = 2-5) generating 6 models. The constitutional descriptors and functional groups
in the DRAGON software were used to build the FT models.
The effect of class distribution on model performance was evaluated by two average statistics,
i.e., the average of overall predictive accuracies, and the weighted average of SE (and SP)
between training and test sets. Evaluation statistics when R:N = 1:4 are listed in Table S2 as
an example. The effects of class distribution on model performance are shown in Figure S1. It
can be concluded that predictive accuracy for the minority class is less than that for the
majority class, whereas the models built on balanced training data outperform the models
constructed on data of imbalanced class distributions. Therefore the rational choice of class
distribution (i.e., the ratio of readily biodegradable versus non-biodegradable chemicals) of
the training data is of essential importance for development of the biodegradation models.
Based on these results, a balanced class distribution for the training data was chosen for
subsequent model construction.
Table S1. An illustration of data sampling when the ratio of readily biodegradable class (R) to
non-biodegradable class (N) is 1:4
Model R1
R2
No. 1 Test Training
R3
R4
R5
N1
N2
N3
N4
N5
Test
Test
Test
Test Training Training Training Training
No. 2 Test
Test
Training
Test
Test
Test Training Training Training Training
No. 3 Test
Test
Test
Training
Test
Test Training Training Training Training
No. 4 Test
Test
Test
Test
Training Test Training Training Training Training
Method of dividing training and test sets
In this paper one training set and two test sets (I and II) were used for building and validating
models. Totally 1629 compounds were collected from the three sources. 27 compounds from
the external validation set of Cheng et al. (J. Chem. Inf. Model. 2012. 52:655-669) were
employed for the test set II. The remained 1602 compounds were divided into training set
(825 compounds) and test set I (777 compounds) based on the following procedures:
(1) The number of non-biodegradable (N) compounds was 968 and readily biodegradable
(R) was 634.
(2) To obtain a dataset with the ratio of N:R around 1:1, 644 out of 968 (nearly 33/50)
non-biodegradable compounds were randomly selected. These 644 compounds together with
634 R compounds composed a dataset of 1278 compounds.
(3) The dataset consisting 1278 compounds were randomly divided with a ratio of 2:1,
thus datasets of 825 compounds (the training set) and 453 compounds were obtained.
To make best use of our collected data, the 453 compounds in procedure 3 together with the
324 (968 - 644) N compounds in procedure 2 were used for test set I (777 compounds).
Table S2. The overall predictive accuracy, SE, SP, weighted average of SE and SP for the training and test
set when class distribution of the training set (R:N) is 1:4
Training set
Test set
Weighted average
Qtraining (%)
SE (%)
SP (%)
Qtest (%)
SE (%)
SP (%)
SE (%)
SP (%)
Model 1
83.8
46.8
93.1
62.1
53.7
95.7
52.3
93.6
Model 2
82.1
36.2
93.6
40.0
26.6
93.6
28.5
93.6
Model 3
84.3
55.3
91.5
64.3
56.9
93.6
56.6
91.9
Model 4
76.2
23.4
89.4
58.3
50.5
89.4
45.1
89.4
Average
81.6
45.6
92.1
56.2
Figure S1. Effects of the training-set class distribution on model performance. (a) Variation of predictive
accuracy of the models derived for readily biodegradable and non-biodegradable chemicals; (b) Variation
of predictive accuracy of the models derived training and test sets.
Table S3. Details of the training and test sets used for model construction
Training set
Test set I
Test set II
Readily biodegradable
409
225
4
Non-biodegradable
416
552
23
Total
825
777
27
Table S4. Performance comparison on test set II of our models with the EPI suite Biowin 5 and 6, and the
CHAID-SVM and PubChemFP-SVM models
Model
TP
FN
TN
FP
SE (%)
SP (%)
Q (%)
AUC
C4.5 decision tree
4
0
17
6
100.0
73.9
77.8
0.793
FT-Inner
4
0
23
0
100.0
100.0
100.0
1.00
Logistic regression
4
0
20
3
100.0
87.0
88.9
0.924
Biowin 5
3
1
20
3
75.0
87.0
85.2
—
Biowin 6
2
2
21
2
50.0
91.3
85.2
—
CHAID-SVM
4
0
22
1
100.0
95.7
96.3
0.978
PubChemFP-SVM
4
0
23
1
100.0
100.0
100.0
1.00
SE (sensitivity) = TP/(TP + FN); SP (specificity) = TN/(TN + FP); Q (overall predictive accuracy) = (TP +
TN)/(TP+ FP + TN + FN); AUC (area under the curve); TP - true positives; FP - false positives; TN - true
negatives; FN - false negatives.
Table S5. Statistical significances of the descriptor distributions between non-biodegradable and readily
biodegradable classes
Descriptor
nCIC
nN
nS
nX
SRW10
ATS3p
MATS3m
Significance
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
< 0.001
Descriptor
GATS3m
C-001
C-007
C-040
O-061
Cl-089
Significance
0.951
0.046
0.177
< 0.001
< 0.001
< 0.001
Equation S1.
The constructed logistic model has the form:
z = - 0.5239 + 1.3552×nCIC + 1.0402×nN + 1.2639×nS + 0.6266×nX + 0.00001226×SRW10
+ 1.5046×ATS3p - 3.7317×MATS3m + 62.5716×GATS3m + 0.3933×C-001
-1.8468×C-007 - 0.4511×C-040 + 0.8399×O-061 + 0.7382×Cl-089
where
f ( z) 
1
1

z
 (    i X i )
1 e
1 e
when f(z) > 0.50 the corresponding organic chemical will be judged as non-biodegradable, and
biodegradable otherwise.
(S1)
Table S6. Information on number of leaves and size of the C4.5 decision tree
Number of Leaves
Size of the tree
55
109
Tree size is measured as number of bits required for encoding the decision tree which is equal to the
number of inner nodes. Number of leaves is measured as number of decision nodes in a tree.
The C4.5 decision tree is shown as follows (output by the Weka software):
ATS3p <= 0.398
| nN <= 1
| | nS <= 0
| | | nCIC <= 0
| | | | C-040 <= 0
| | | | | C-001 <= 2
| | | | | | MATS3m <= 0.968
| | | | | | | GATS3m <= 0.024: N (8.0/1.0)
| | | | | | | GATS3m > 0.024: R (7.0/1.0)
| | | | | | MATS3m > 0.968: R (99.0/11.0)
| | | | | C-001 > 2
| | | | | | SRW10 <= 2736: R (4.0)
| | | | | | SRW10 > 2736: N (10.0/2.0)
| | | | C-040 > 0: R (100.0/2.0)
| | | nCIC > 0
| | | | nCIC <= 1
| | | | | GATS3m <= 0.021
| | | | | | nN <= 0: R (36.0/8.0)
| | | | | | nN > 0
| | | | | | | SRW10 <= 3860: R (6.0)
| | | | | | | SRW10 > 3860: N (6.0/1.0)
| | | | | GATS3m > 0.021: N (7.0/1.0)
| | | | nCIC > 1: N (5.0/1.0)
| | nS > 0
| | | nN <= 0
| | | | C-040 <= 0
| | | | | GATS3m <= 0.019: R (3.0/1.0)
| | | | | GATS3m > 0.019: N (8.0/1.0)
| | | | C-040 > 0: R (3.0/1.0)
| | | nN > 0: R (2.0)
| nN > 1
| | nX <= 0
| | | nCIC <= 1
| | | | C-001 <= 1
| | | | | GATS3m <= 0.02
| | | | | | nN <= 2: R (13.0/3.0)
| | | | | | nN > 2
| | | | | | | C-040 <= 1: R (2.0)
| | | | | | | C-040 > 1: N (3.0)
| | | | | GATS3m > 0.02: N (3.0)
| | | | C-001 > 1: N (5.0/1.0)
| | | nCIC > 1: N (3.0)
| | nX > 0: N (2.0)
ATS3p > 0.398
| nN <= 1
| | Cl-089 <= 0
| | | SRW10 <= 13724
| | | | GATS3m <= 0.037
| | | | | C-040 <= 0
| | | |
| | nN <= 0
| | | | | | | nCIC <= 1
| | | | | | | | GATS3m <= 0.013
| | | | | | | | | GATS3m <= 0.008: N (4.0/1.0)
| | | | | | | | | GATS3m > 0.008: R (32.0/3.0)
| | | | | |
| | GATS3m > 0.013
| | | | | | | | | ATS3p <= 0.423: R (8.0)
| | | | | | | | | ATS3p > 0.423
| | | | | | | | | | SRW10 <= 6788
| | | | | | | | | | | ATS3p <= 0.478: R (14.0/3.0)
| | | | | | | | | | | ATS3p > 0.478
| | | | | | | | | | | | C-001 <= 1: N (7.0/1.0)
| | | | | | | | | | | | C-001 > 1: R (3.0/1.0)
| | | | | | | | | | SRW10 > 6788
| | | | | | | | | | | C-001 <= 1
| | | | | | | | | | | | SRW10 <= 8082: N (2.0)
| | | | | | | | | | | | SRW10 > 8082: R (2.0)
| | | | | | | | | | | C-001 > 1: N (19.0)
| | | | | | | nCIC > 1
| | | | | | | | C-001 <= 0
| | | | | | | | | SRW10 <= 8928: N (3.0)
| | | | | | | | | SRW10 > 8928
| | | | | | | | | | MATS3m <= 0.987: R (7.0/1.0)
| | | | | | | | | | MATS3m > 0.987: N (4.0/1.0)
| | | | | | | | C-001 > 0: N (7.0)
| | | | | | nN > 0
| | | | | | | C-001 <= 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| | | | | | | O-061 <= 0
| | |
| | | | | MATS3m <= 0.965: R (4.0)
| | | | | | | | MATS3m > 0.965
| | | | | | | | | MATS3m <= 0.992: N (20.0/3.0)
| | | | | | | | | MATS3m > 0.992: R (4.0/1.0)
| | | | | | | O-061 > 0: N (6.0)
| | | | | | C-001 > 1: N (10.0)
| | | | C-040 > 0
| | | | | C-001 <= 1
| | | | | | nCIC <= 0: R (10.0)
| | | | | | nCIC > 0
| | | | | |
| ATS3p <= 0.415: N (3.0)
| | | | | | | ATS3p > 0.415: R (53.0/7.0)
| | | | | C-001 > 1
| | | | | | GATS3m <= 0.013: R (2.0)
| | | | | | GATS3m > 0.013: N (7.0)
| | | GATS3m > 0.037: N (14.0)
| | SRW10 > 13724
| | | nN <= 0
| | | | SRW10 <= 16326: N (16.0)
| | | | SRW10 > 16326
| | | | | nX <= 0
| | | | | | MATS3m <= 0.983: N (32.0/5.0)
| | | | |
| MATS3m > 0.983: R (5.0)
| | | | | nX > 0: N (4.0)
| | | nN > 0: N (22.0)
| Cl-089 > 0: N (72.0/8.0)
nN > 1
| C-007 <= 0: N (89.0/4.0)
| C-007 > 0
| | nN <= 3: N (3.0)
| | nN > 3: R (2.0)
Download