Assessing Prediction Reliability with High Confidence for High Dimensional QSAR Data (Supporting Information) Jianping Huang1, Xiaohui Fan1, * 1 Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, 866 YuHangTang Rd, Hangzhou 310058, China. Tel: +86 571 88208596 Fax: +86 571 88208428 Email: fanxh@zju.edu.cn or jimny@zju.edu.cn 1 Introduction of Ensemble Method In the field of machine learning, ensemble method has attracted much attention for more than a decade[1]. In QSAR modeling, ensemble of numerous homogeneous models (generated by identical learners) or heterogeneous models (generated by different learners) has also been recognized to be able to produce more accurate and stable results[2,3]. In terms of the ensemble accuracy, two elements are the most important, namely the individual accuracy and the diversity in their predictions (making different errors) of the base classifiers. So far, a clear fundamental basis is still lacking with respect to establishing a complete relationship between these two elements. Intuitively, one may prefer to create many high-accuracy individual classifiers, with the hope that they can be used to combine into a stronger ensemble with higher accuracy. However, ensembles that integrated with many diverse weak classifiers may outperform those combined with many strong and similar individual classifiers. Bagging[4], boosting[5], random subspace[6] and RF are such successful examples, which are often combined with a number of unstable and weak learners to reach a strong final classifier. Therefore, an ensemble should take into account the trade-off between the individual accuracy and the diversity between the base classifiers. Since the basis relationship between these two elements is still lacking, it is difficult to create a set of based classifiers that possess both high accuracy and desirable diversity between them. As a result, most efforts devoted to the investigation of such problem are empirically based. Many approaches have been contributed to making the base classifiers different, among which the popular ones are: 1) varying the initial parameters of the learning method[7], 2) varying the variable space[8,6], and 3) varying the training data[4,5,8]. Usually, in order to construct an effective ensemble with weak learners, a large number of based learners are often needed, leading to a huge ensemble. TreeEC Method Different from the aforementioned prevailing ensemble approaches that often combined with hundreds or even thousands of weak trees to converge, the TreeEC intended to combine fewer trees without scarifying the accuracy and the diversity 2 between the component trees. It was more attractive in that it aimed to combine into the ensemble using less and strong trees. Fig. S1 The construction flowchart of TreeEC method The construction steps of the TreeEC were detailed in Fig. S1. The basic classifier used in the TreeEC was similar to the Classification and Regression Tree (CART)[9]. However, a variant of entropy function instead of Gini-Index was employed as splitting criterion, and the trees were built without pruning. Thus, no misclassification cost calculation was required. 3 In order to make the individual trees strong and different, several points should be addressed: 1) Three parameters were introduced, namely the maximum number of samples allowed in a leaf node (L), the maximum number of times a descriptor can be used as node splitter in all trees (R), and the maximum number of trees (T) to create. 2) To make the individual trees different, the major thing needed to do was carefully picking up different descriptors presented in the root node of each tree, because decision tree is underlying unstable in nature, a small disturbance on a node may lead to completely different descendent sub-trees. Therefore, if a descriptor had been used in the root node of a tree, it would not be chosen as root node splitter again in other trees. At the same time, to avoid high correlation among descriptors, descriptors having high correlation coefficient (e.g. > 90%) wouldn’t be chosen as root node splitter simultaneously. Following these rules, trees would grow differently. 3) The parameter R can be used to control the number of times a descriptor can be used in building the ensemble. Ideally, if there were many informative descriptors to choose, R can be set larger; otherwise, R should be narrowed down to avoid building too many trees with those uninformative descriptors. The three parameters L, R and T were by default set to 5, 5 and 30 respectively. These setting were sufficient to reach reasonable results in most cases. Reasonable accuracy can be obtained when parameter L and R were set to relatively small integers. The final prediction result was yielded by averaging the sum of predictions produced by individual trees. Therefore, TreeEC classifies samples according to probability (or confidence), and a sample is designated active (1) or inactive (0) according to the probability greater than 0.5 or not. It should be noted that, we had borrowed some ideas from the Decision Forest (DF) developed by Tong et al.[10,11], which aim is to construct individual trees with similar predictive quality so that the ensemble would not lose its robustness and predictability. However, DF is more complex than TreeEC in construction. For example, DF includes a pruning step. At the same time, it introduces a misclassification criterion to guarantee that the individual trees produce misclassification error as small as possible; if this criterion does not match, the constructed tree has to be abandoned, and a new tree needs to be grown. In addition, every tree of DF is built with a unique set of descriptors so as to maintain its quality consistently and at the same time guarantee the diversity between each other. 4 However, because each descriptor can be used in only one tree, the contribution of those real informative descriptors could be limited. In case there are not enough informative descriptors available, building more trees may not add to the efficiency, but more easily lead to the degradation of the performance. References 1. Dietterich TG (1997) Machine Learning Research: Four Current Directions. AI magazine 18 (4):97 2. Merkwirth C, Mauser H, Schulz-Gasch T, Roche O, Stahl M, Lengauer T (2004) Ensemble methods for classification in cheminformatics. Journal Of Chemical Information and Computer Sciences 44 (6):1971-1978 3. Huang J, Fan X (2011) Why QSAR Fails: An Empirical Evaluation Using Conventional Computational Approach. Molecular Pharmaceutics 8 (2):600-608 4. Breiman L (1996) Bagging predictors. Machine Learning 24 (2):123-140 5. Freund Y, Schapire R Experiments with a new boosting algorithm. In: Machine Learning: Proceedings of the Thirteenth International Conference, 1996. pp 148-156 6. Ho T (1998) The random subspace method for constructing decision forests. Pattern Analysis and Machine Intelligence, IEEE Transactions on 20 (8):832-844 7. Hansen L, Salamon P (1990) Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (10):993-1001 8. Breiman L (2001) Random Forests. Machine Learning 45 (1):5-32 9. Breiman L (1998) Classification and Regression Trees. Chapman & Hall/CRC, 10. Tong W, Hong H, Fang H, Xie Q, Perkins R (2003) Decision forest: combining the predictions of multiple independent decision tree models. Journal Of Chemical Information and Computer Sciences 43 (2):525-531 11. Xie Q, Ratnasinghe L, Hong H, Perkins R, Tang Z, Hu N, Taylor P, Tong W (2005) Decision forest analysis of 61 single nucleotide polymorphisms in a case-control study of esophageal cancer; a novel method. BMC Bioinformatics 6 (2):S4 5