Author template for journal articles

advertisement
Assessing Prediction Reliability with High
Confidence for High Dimensional QSAR Data
(Supporting Information)
Jianping Huang1, Xiaohui Fan1, *
1
Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang
University, 866 YuHangTang Rd, Hangzhou 310058, China.
Tel: +86 571 88208596
Fax: +86 571 88208428
Email: fanxh@zju.edu.cn or jimny@zju.edu.cn
1
Introduction of Ensemble Method
In the field of machine learning, ensemble method has attracted much attention for
more than a decade[1]. In QSAR modeling, ensemble of numerous homogeneous
models (generated by identical learners) or heterogeneous models (generated by
different learners) has also been recognized to be able to produce more accurate and
stable results[2,3]. In terms of the ensemble accuracy, two elements are the most
important, namely the individual accuracy and the diversity in their predictions
(making different errors) of the base classifiers. So far, a clear fundamental basis is
still lacking with respect to establishing a complete relationship between these two
elements. Intuitively, one may prefer to create many high-accuracy individual
classifiers, with the hope that they can be used to combine into a stronger ensemble
with higher accuracy. However, ensembles that integrated with many diverse weak
classifiers may outperform those combined with many strong and similar individual
classifiers. Bagging[4], boosting[5], random subspace[6] and RF are such successful
examples, which are often combined with a number of unstable and weak learners to
reach a strong final classifier. Therefore, an ensemble should take into account the
trade-off between the individual accuracy and the diversity between the base
classifiers. Since the basis relationship between these two elements is still lacking, it
is difficult to create a set of based classifiers that possess both high accuracy and
desirable diversity between them. As a result, most efforts devoted to the investigation
of such problem are empirically based. Many approaches have been contributed to
making the base classifiers different, among which the popular ones are: 1) varying
the initial parameters of the learning method[7], 2) varying the variable space[8,6],
and 3) varying the training data[4,5,8]. Usually, in order to construct an effective
ensemble with weak learners, a large number of based learners are often needed,
leading to a huge ensemble.
TreeEC Method
Different from the aforementioned prevailing ensemble approaches that often
combined with hundreds or even thousands of weak trees to converge, the TreeEC
intended to combine fewer trees without scarifying the accuracy and the diversity
2
between the component trees. It was more attractive in that it aimed to combine into
the ensemble using less and strong trees.
Fig. S1 The construction flowchart of TreeEC method
The construction steps of the TreeEC were detailed in Fig. S1. The basic classifier
used in the TreeEC was similar to the Classification and Regression Tree (CART)[9].
However, a variant of entropy function instead of Gini-Index was employed as
splitting criterion, and the trees were built without pruning. Thus, no misclassification
cost calculation was required.
3
In order to make the individual trees strong and different, several points should be
addressed:
1) Three parameters were introduced, namely the maximum number of samples
allowed in a leaf node (L), the maximum number of times a descriptor can be used as
node splitter in all trees (R), and the maximum number of trees (T) to create.
2) To make the individual trees different, the major thing needed to do was carefully
picking up different descriptors presented in the root node of each tree, because
decision tree is underlying unstable in nature, a small disturbance on a node may lead
to completely different descendent sub-trees. Therefore, if a descriptor had been used
in the root node of a tree, it would not be chosen as root node splitter again in other
trees. At the same time, to avoid high correlation among descriptors, descriptors
having high correlation coefficient (e.g. > 90%) wouldn’t be chosen as root node
splitter simultaneously. Following these rules, trees would grow differently.
3) The parameter R can be used to control the number of times a descriptor can be
used in building the ensemble. Ideally, if there were many informative descriptors to
choose, R can be set larger; otherwise, R should be narrowed down to avoid building
too many trees with those uninformative descriptors.
The three parameters L, R and T were by default set to 5, 5 and 30 respectively. These
setting were sufficient to reach reasonable results in most cases. Reasonable accuracy
can be obtained when parameter L and R were set to relatively small integers. The
final prediction result was yielded by averaging the sum of predictions produced by
individual trees. Therefore, TreeEC classifies samples according to probability (or
confidence), and a sample is designated active (1) or inactive (0) according to the
probability greater than 0.5 or not.
It should be noted that, we had borrowed some ideas from the Decision Forest (DF)
developed by Tong et al.[10,11], which aim is to construct individual trees with
similar predictive quality so that the ensemble would not lose its robustness and
predictability. However, DF is more complex than TreeEC in construction. For
example, DF includes a pruning step. At the same time, it introduces a
misclassification criterion to
guarantee
that
the
individual trees
produce
misclassification error as small as possible; if this criterion does not match, the
constructed tree has to be abandoned, and a new tree needs to be grown. In addition,
every tree of DF is built with a unique set of descriptors so as to maintain its quality
consistently and at the same time guarantee the diversity between each other.
4
However, because each descriptor can be used in only one tree, the contribution of
those real informative descriptors could be limited. In case there are not enough
informative descriptors available, building more trees may not add to the efficiency,
but more easily lead to the degradation of the performance.
References
1. Dietterich TG (1997) Machine Learning Research: Four Current Directions. AI magazine 18 (4):97
2. Merkwirth C, Mauser H, Schulz-Gasch T, Roche O, Stahl M, Lengauer T (2004) Ensemble methods
for classification in cheminformatics. Journal Of Chemical Information and Computer Sciences 44
(6):1971-1978
3. Huang J, Fan X (2011) Why QSAR Fails: An Empirical Evaluation Using Conventional
Computational Approach. Molecular Pharmaceutics 8 (2):600-608
4. Breiman L (1996) Bagging predictors. Machine Learning 24 (2):123-140
5. Freund Y, Schapire R Experiments with a new boosting algorithm. In: Machine Learning:
Proceedings of the Thirteenth International Conference, 1996. pp 148-156
6. Ho T (1998) The random subspace method for constructing decision forests. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 20 (8):832-844
7. Hansen L, Salamon P (1990) Neural network ensembles. IEEE Transactions on Pattern Analysis and
Machine Intelligence 12 (10):993-1001
8. Breiman L (2001) Random Forests. Machine Learning 45 (1):5-32
9. Breiman L (1998) Classification and Regression Trees. Chapman & Hall/CRC,
10. Tong W, Hong H, Fang H, Xie Q, Perkins R (2003) Decision forest: combining the predictions of
multiple independent decision tree models. Journal Of Chemical Information and Computer Sciences
43 (2):525-531
11. Xie Q, Ratnasinghe L, Hong H, Perkins R, Tang Z, Hu N, Taylor P, Tong W (2005) Decision forest
analysis of 61 single nucleotide polymorphisms in a case-control study of esophageal cancer; a novel
method. BMC Bioinformatics 6 (2):S4
5
Download