A COMPARATIVE STUDY OF MACHINE LEARNING ALGORITHMS APPLIED TO PREDICTIVE TOXICOLOGY DATA MINING Neagu C.D.*, Guo G.*, Trundle P.R.* and Cronin M.T.D.** *Department of Computing, University of Bradford, Bradford, BD7 1DP, UK {D.Neagu, G.Guo, P.R.Trundle}@bradford.ac.uk **School of Pharmacy and Chemistry, Liverpool John Moores University, L3 3AF, UK M.T.Cronin@ljmu.ac.uk Abstract: This paper reports results of a comparative study of widely used machine learning algorithms applied to predictive toxicology data mining. The involved machine learning algorithms are chosen in terms of their representability and diversity, and are extensively evaluated on seven toxicity data sets which come from real-world applications. Some results based on visual analysis of the correlations of different descriptors to the class values of chemical compounds and on the relationships of the range of chosen descriptors to the performance of machine learning algorithms are emphasized from our experiments. Some interesting findings (no specific algorithm appears best for all seven toxicity data sets; up to five descriptors are sufficient for creating classification models for each toxicity data set with good accuracy) on data and models’ quality are presented. We suggest that, for a specific dataset, model accuracy is affected by the feature selection method and model development technique. Models built with too many or too few descriptors are both undesirable, and finding the optimal feature subset appears at least as important as selecting appropriate algorithms with which to build a final model. Keywords: predictive toxicology, data mining, algorithm, visual analysis, feature selection 1. Introduction The increasing amount and complexity of data used in predictive toxicology calls for new and flexible approaches to mine the data. Traditional manual data analysis has become inefficient and computer-based analysis is indispensable. Statistical methods [1], expert systems [2], fuzzy neural networks [3], other machine learning algorithms [4, 5] are extensively studied and applied to predictive toxicology for model development and decision making. However, due to the complexity of modelling existing toxicity data sets caused by numerous irrelevant descriptors, skewed distribution, missing values and noisy data, no dominant machine learning algorithm can be proposed to model accurately all the toxicity data sets available. This motivated us to conduct a comparative study of machine learning algorithms applied to seven toxicity data sets. The intention of this study was to discuss on the applicability of some widely used machine learning algorithms for the toxicity data sets at hand. For this purpose, seven machine learning algorithms which are described in next section were chosen for this comparative study in terms of their representability and diversity, and a library of models was built in order to provide some useful model benchmarks for researchers working in this area. 2. Methods 2.1. Machine Learning Algorithms Seven algorithms have been chosen for this study in terms of their representability, i.e. ability to learn numerical data as reported by the machine learning community [6]. They were also chosen in terms of their diversity, i.e. the way they learn data and represent the final models differently [6]. A brief introduction of the seven machine learning algorithms applied in this study is given below: Support Vector Machine [7]- SVM is based on the Structural Risk Minimization principle from statistical learning theory. Given a training set in a vector space, SVM finds the best decision hyperplane that separates the instances in two classes. The quality of a decision hyperplane is determined by the distance (referred as margin) between two hyperplanes that are parallel to the decision hyperplane and touch the closest instances from each class. Bayes Net [8] – Given a data set with instances characterized by features A 1,..,Ak, then the BN method assigns the most probable class value c to a new instance with observed feature values a1 through ak which satisfy P(C c A1 a1 ... Ak ak ) is maximal. Decision Tree [9] - DT is a widely used classification method in machine learning and data mining. The decision tree is grown by recursively splitting the training set based on a locally optimal criterion until all or most of the records belonging to each of the leaf nodes bear the same class label. Instance-Based Learners – IBLs [10] classify an instance by comparing it to a set of pre-classified instances and choose a dominant class of similar instances as the classification result. Repeated Incremental Pruning to Produce Error Reduction – RIPPER [11] is a propositional rule learning algorithm that performs efficiently on large noisy data sets. It induces classification (if-then) rules from a set of pre-labeled instances and looks at the instances to find a set of rules that predict the class of earlier instances. It also allows users to specify constraints on the learned if-then rules to add prior knowledge about the concepts, in order to get more accurate hypothesis. Multi-Layer Perceptrons - MLPs [11] are feedforward neural networks with one or two hidden layers, trained with the standard backpropagation algorithm. They can approximate virtually any input-output map and have been shown to approximate the performance of optimal statistical classifiers in difficult problems. Fuzzy Neural Networks – FNNs [12] are connectionist structures that implement fuzzy rules and fuzzy inference. We use the Back Propagation (BP) algorithm to identify and express input-output relationships in the form of fuzzy rules, thus leading further to possible knowledge extraction by humans. 2.2. Toxicity Data Sets For the purpose of evaluation, seven data sets from real-world applications are chosen. Among these data sets, five of them, i.e. TROUT, ORAL_QUAIL, DAPHNIA, DIETARY_QUAIL and BEE, come from the DEMETRA project [13], APC data set is provided by Central Science Laboratory (CSL) York, England [14], Phenols data set comes from TETRATOX database [15]. A random division of each data set into a training set and a testing set was carried out before evaluation. General information about these data sets is given in Table 1. <Table 1> In Table 1, the meaning of the title in each column is as follows: NI - Number of Instances, NF_FS - Number of Features after Feature Selection using a correlation-based method which identifies subsets of features that are highly correlated to the class [16]; NC - Number of Classes; CD - Class Distribution; CD_TR - Class Distribution of TRaining set, and CD_TE - Class Distribution of TEsting set. 3. Results Experimental results of different algorithms evaluated on these seven data sets are presented in Tables 2 and 3, where parameter LR for MLP stands for learning rate and parameter k for IBL stands for the number of nearest neighbours used for classifying new instances. The learning rate is a parameter to control the adjustment of connections strength during the training process of a neural network [11]. The classification accuracies of models created by each algorithm vary between each data set: some accuracies are relatively poor when compared to ‘benchmark’ data sets from the University of California at Irvine (UCI) machine learning repository [17]. The UCI machine learning repository is a collection of databases, domain theories and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. We ran the same algorithms against some UCI data sets and found that performances obtained are better on average than for the toxicity models [18]. This indicates that the data from the seven toxicity data sets used in this paper, which are often noisy, unevenly distributed across the multi-dimensional attribute space and have a low ratio of instances (rows) to features (columns), can make accurate class predictions difficult. In Tables 2 and 3, the classification accuracy is defined by eq. (1): Classifica tion Accuracy Number of Test instances correctly classified by the model Total Number of instances used for testing (1) <Table 2> In Tables 2 and 3, the figures in bold in each row represent the best classification accuracy for the data set named to the left. Table 2 helps identify the best model developed by the considered algorithms. Table 3 focuses on identification of the most suitable algorithm to develop good models for the data sets under consideration. Moreover, Table 2 reports accuracies with a single train/test split of the data (see Table 1), whereas data used for models in Table 3 has been automatically split in 90/10 ten times (ten fold cross validation). 90 percent of toxicity data were used for training and the remaining 10 percent for testing in each of the 10 cases. The results reported in Table 3 are the average classification accuracy over the 10 tests. This means that the models listed in Table 2 are more dependent on the division of data sets compared with the models reported in Table 3. Consequently the classification accuracies listed in Table 3 reflect more fairly the learning ability of each machine learning algorithm. <Table 3> Data sets properties like noisiness, uneven distribution and size can make creating accurate models difficult. As shown in Table 3, some algorithms appear more suitable for particular data sets, i.e. obtain higher classification accuracy: IBL for BEE, SVM for PHENOLS and BN for APC. They exhibit higher than average accuracy compared to their results across all seven data sets. This implies that careful algorithm selection can make the creation of accurate models more straightforward.. A case study of visual analysis [19] of the correlations for different descriptors to the class values of chemical compounds has been carried out on two data sets: PHENOLS and TROUT. Figures 1 and 2 show three selected attributes that are the most highly correlated to the class for these data sets. For PHENOLS the three selected attributes were Log P, magnitude of dipole moment and molecular weight and the class is described by the mechanism of action. For TROUT the three selected attributes were the 3rd order valence corrected cluster molecular connectivity, specific polarity and Log D at pH9 and the class value is given by LC50 (mg/l) after 96 hours for the rainbow trout. Figure 1 (PHENOLS) shows a moderately good distribution of data, but lacks clearly defined boundaries between classes. In particular, Class 2 and Class 3 show a large amount of overlap in the lower portion of the graph. Figure 2 (TROUT) shows the same lack of boundaries between classes, but also shows an uneven distribution of data: a large cluster of data-points from all three classes can be seen to the left of the graph, with only a small amount of data-points falling in the remaining attribute space. These factors contribute to the relatively low prediction accuracies obtained on these toxicity data sets. Whilst it is common practise to remove outliers from data sets with the intention of improving the prediction accuracy of models, the aim of this paper was not to create highly predictive models, but rather to investigate the probable causes of poor model performance; undoubtedly outliers are one such cause. <Figure 1> <Figure 2> A further study on implications of data quality to classification accuracy has been carried out. Two data sets, PHENOLS and ORAL_QUAIL, and six algorithms (BN, MLP, IBL, DT, RIPPER, and SVM) were considered in the experiment. The top 20 descriptors from each data set with the highest correlation to class values were extracted using the feature selection method ReliefF [20] implemented in Wekaa. ReliefF is an extension of the Relief algorithm, which works only for binary classification problems. The Relief algorithm works for two class problems by randomly sampling an instance and locating its nearest neighbour from the same and opposite class. The values of the features of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each feature. ReliefF, an extension of Relief, aims to solve the problem of datasets with multi-class, noisy and incomplete data. Twenty models were created for each data set, with each model using the n most correlated descriptors to the class – where n varied from 1 to 20. 10-fold cross validated accuracies of these models are presented in Figures 3 and 4. <Figure 3> <Figure 4> Figure 3 shows that increasing the number of descriptors used to build the models on PHENOLS data set has little impact once the top 3-4 descriptors (1: an indicator variable for the presence of a 2- or 4-dihydroxy phenol (OH OH), 2: the maximum donor superdelocalisability, 3: Log P (calculated by the ACD software), 4: the number of elements in each molecular (Nelem)) are included. After this point the accuracies of the various algorithms vary by little more than 5%. This suggests the first 4 descriptors of the PHENOLS data set have a high correlation to the class value, and that they are sufficient to describe the majority of variation within the data. a Weka: a free data mining software: http://www.cs.waikato.ac.nz/~ml/weka [9] Figure 4 (ORAL_QUAIL data) shows that increasing the number of descriptors used to create a model can decrease the subsequent accuracy. This reflects the unreliability of the ORAL_QUAIL data set, i.e. a large amount of noise, less relevant descriptors etc. The first 4-5 descriptors (1: SdsssP_acnt - Count of all ( > P = ) groups in molecule; 2: SdsssP - Sum of all ( -> P = ) E-state values; 3: SdS_acnt - Count of all ( = S ) groups in molecule; 4: SdS - Sum of all ( = S ) E-State values in molecule; 5: SssO_acnt - Count of all ( - O ) groups in molecule) of this data set appear to be sufficient for creating models, and including any further descriptors could lead to possible overfitting on the noisy and irrelevant data they contain. 4. Conclusions The outcomes of our comparative study and experiments proved that single classifier-based models are not sufficiently discriminative for all the data sets considered given the main characteristics of toxicity data (noisiness, uneven distribution and size). Case studies of a multiple classifier combination system [21] indicates that hybrid intelligent systems are worthy of further research in order to obtain better performance for specific applications in predictive toxicology data mining. This is because multiple classifier combination systems have the advantage that they can manage complex class distributions through combinations of different model learning abilities. The authors would also speculate that model accuracy could be improved further by choosing a particular feature selection method based on the data set and algorithm used. The inclusion of more feature selection methods i.e., kNNMFS [22], ReliefF [20], is proposed as future work. The comparison of models created using different numbers of features highlights the need for care when using feature selection techniques. Reducing the number of descriptors in a data set is commonly accepted as a necessary step towards highly predictive, yet interpretable, models. General information about toxicology data sets Data sets NI NF_FS NC CD CD_TR CD_TE TROUT 282 22 3 129:89:64 109:74:53 20:15:11 ORAL_QUAIL 116 8 4 4:28:24:60 3:24:19:51 1:4:5:9 DAPHNIA 264 20 4 122:65:52:25 105:53:43:21 17:12:9:4 DIETARY QUAIL 123 12 5 8:37:34:34:10 7:31:28:29:8 1:6:6:5:2 BEE 105 11 5 13:23:13:42:14 12:18:11:35:12 1:5:2:7:2 PHENOLS 250 11 3 61:152:37 43:106:26 18:46:11 60 6 4 17:16:16:11 12:12:12:9 5:4:4:2 APC Table 2. Classification accuracies of different algorithms on seven data sets Average classification accuracy of data sets Data sets BN MLP LR IBL K DT RIPPER SVM FNN TROUT 56.52 65.22 0.3 63.04 5 56.52 54.35 60.87 50.00 ORAL_QUAIL 47.37 47.37 0.3 47.37 5 47.37 42.10 47.37 47.37 DAPHNIA 47.62 54.76 0.3 64.29 5 45.24 57.14 52.38 57.14 DIETARY QUAIL 40.00 70.00 0.9 60.00 10 45.00 40.00 55.00 40.00 BEE 58.82 58.82 0.9 70.59 1 58.82 58.82 58.82 47.06 PHENOLS 70.67 86.67 0.3 73.33 5 77.33 72.00 78.67 73.33 APC 40.00 53.33 0.9 53.33 5 53.33 46.67 46.67 40.00 Average 51.57 62.31 / 61.71 / 54.80 53.76 57.11 50.70 Table 3. Classification accuracies of different algorithms on seven data sets using ten-fold cross validation Average Classification Accuracy of ten-fold Data sets BN MLP LR IBL TROUT 61.70 58.16 0.9 59.93 ORAL_QUAIL 62.07 51.72 0.3 DAPHNIA 50.38 53.41 DIETARY QUAIL 42.28 BEE K DT RIPPER SVM FNN 5 55.32 56.74 62.06 59.79 57.76 5 62.93 60.34 65.52 55.27 0.3 54.17 5 50.00 50.00 54.55 50.00 55.28 0.3 48.78 5 45.53 39.84 48.78 37.50 49.52 51.43 0.3 58.09 5 45.71 46.67 53.33 55.89 PHENOLS 76.40 78.40 0.3 74.80 10 74.40 76.40 80.00 72.67 APC 58.33 40.00 0.3 43.33 5 43.33 40.00 43.33 40.00 Average 57.24 55.49 / 56.69 / 53.89 52.86 58.22 53.02 Figures Figure 1: Three attributes most correlated to class in PHENOLS data set Figure 2: Three attributes most correlated to class in TROUT data set Figure 3: Performances for PHENOLS Figure 4: Performances for ORAL_QUAIL