A Comparative Study of Machine Learning Algorithms Applied to

advertisement
A COMPARATIVE STUDY OF MACHINE LEARNING
ALGORITHMS APPLIED TO PREDICTIVE TOXICOLOGY DATA
MINING
Neagu C.D.*, Guo G.*, Trundle P.R.* and Cronin M.T.D.**
*Department
of Computing, University of Bradford, Bradford, BD7 1DP, UK
{D.Neagu, G.Guo, P.R.Trundle}@bradford.ac.uk
**School
of Pharmacy and Chemistry, Liverpool John Moores University, L3 3AF, UK
M.T.Cronin@ljmu.ac.uk
Abstract: This paper reports results of a comparative study of widely used machine learning algorithms
applied to predictive toxicology data mining. The involved machine learning algorithms are chosen in terms
of their representability and diversity, and are extensively evaluated on seven toxicity data sets which come
from real-world applications. Some results based on visual analysis of the correlations of different descriptors
to the class values of chemical compounds and on the relationships of the range of chosen descriptors to the
performance of machine learning algorithms are emphasized from our experiments. Some interesting findings
(no specific algorithm appears best for all seven toxicity data sets; up to five descriptors are sufficient for
creating classification models for each toxicity data set with good accuracy) on data and models’ quality are
presented. We suggest that, for a specific dataset, model accuracy is affected by the feature selection method
and model development technique. Models built with too many or too few descriptors are both undesirable,
and finding the optimal feature subset appears at least as important as selecting appropriate algorithms with
which to build a final model.
Keywords: predictive toxicology, data mining, algorithm, visual analysis, feature selection
1. Introduction
The increasing amount and complexity of data used in predictive toxicology calls for new and flexible
approaches to mine the data. Traditional manual data analysis has become inefficient and computer-based
analysis is indispensable. Statistical methods [1], expert systems [2], fuzzy neural networks [3], other machine
learning algorithms [4, 5] are extensively studied and applied to predictive toxicology for model development
and decision making. However, due to the complexity of modelling existing toxicity data sets caused by
numerous irrelevant descriptors, skewed distribution, missing values and noisy data, no dominant machine
learning algorithm can be proposed to model accurately all the toxicity data sets available. This motivated us
to conduct a comparative study of machine learning algorithms applied to seven toxicity data sets. The
intention of this study was to discuss on the applicability of some widely used machine learning algorithms
for the toxicity data sets at hand. For this purpose, seven machine learning algorithms which are described in
next section were chosen for this comparative study in terms of their representability and diversity, and a
library of models was built in order to provide some useful model benchmarks for researchers working in this
area.
2. Methods
2.1. Machine Learning Algorithms
Seven algorithms have been chosen for this study in terms of their representability, i.e. ability to learn
numerical data as reported by the machine learning community [6]. They were also chosen in terms of their
diversity, i.e. the way they learn data and represent the final models differently [6].
A brief introduction of the seven machine learning algorithms applied in this study is given below:

Support Vector Machine [7]- SVM is based on the Structural Risk Minimization principle from statistical
learning theory. Given a training set in a vector space, SVM finds the best decision hyperplane that
separates the instances in two classes. The quality of a decision hyperplane is determined by the distance
(referred as margin) between two hyperplanes that are parallel to the decision hyperplane and touch the
closest instances from each class.

Bayes Net [8] – Given a data set with instances characterized by features A 1,..,Ak, then the BN method
assigns the most probable class value c to a new instance with observed feature values a1 through ak
which satisfy P(C  c A1  a1  ...  Ak  ak ) is maximal.

Decision Tree [9] - DT is a widely used classification method in machine learning and data mining. The
decision tree is grown by recursively splitting the training set based on a locally optimal criterion until all
or most of the records belonging to each of the leaf nodes bear the same class label.

Instance-Based Learners – IBLs [10] classify an instance by comparing it to a set of pre-classified
instances and choose a dominant class of similar instances as the classification result.

Repeated Incremental Pruning to Produce Error Reduction – RIPPER [11] is a propositional rule learning
algorithm that performs efficiently on large noisy data sets. It induces classification (if-then) rules from a
set of pre-labeled instances and looks at the instances to find a set of rules that predict the class of earlier
instances. It also allows users to specify constraints on the learned if-then rules to add prior knowledge
about the concepts, in order to get more accurate hypothesis.

Multi-Layer Perceptrons - MLPs [11] are feedforward neural networks with one or two hidden layers,
trained with the standard backpropagation algorithm. They can approximate virtually any input-output
map and have been shown to approximate the performance of optimal statistical classifiers in difficult
problems.

Fuzzy Neural Networks – FNNs [12] are connectionist structures that implement fuzzy rules and fuzzy
inference. We use the Back Propagation (BP) algorithm to identify and express input-output relationships
in the form of fuzzy rules, thus leading further to possible knowledge extraction by humans.
2.2. Toxicity Data Sets
For the purpose of evaluation, seven data sets from real-world applications are chosen. Among these data sets,
five of them, i.e. TROUT, ORAL_QUAIL, DAPHNIA, DIETARY_QUAIL and BEE, come from the
DEMETRA project [13], APC data set is provided by Central Science Laboratory (CSL) York, England [14],
Phenols data set comes from TETRATOX database [15]. A random division of each data set into a training
set and a testing set was carried out before evaluation. General information about these data sets is given in
Table 1.
<Table 1>
In Table 1, the meaning of the title in each column is as follows: NI - Number of Instances, NF_FS - Number
of Features after Feature Selection using a correlation-based method which identifies subsets of features that
are highly correlated to the class [16]; NC - Number of Classes; CD - Class Distribution; CD_TR - Class
Distribution of TRaining set, and CD_TE - Class Distribution of TEsting set.
3. Results
Experimental results of different algorithms evaluated on these seven data sets are presented in Tables 2 and
3, where parameter LR for MLP stands for learning rate and parameter k for IBL stands for the number of
nearest neighbours used for classifying new instances. The learning rate is a parameter to control the
adjustment of connections strength during the training process of a neural network [11]. The classification
accuracies of models created by each algorithm vary between each data set: some accuracies are relatively
poor when compared to ‘benchmark’ data sets from the University of California at Irvine (UCI) machine
learning repository [17]. The UCI machine learning repository is a collection of databases, domain theories
and data generators that are used by the machine learning community for the empirical analysis of machine
learning algorithms. We ran the same algorithms against some UCI data sets and found that performances
obtained are better on average than for the toxicity models [18]. This indicates that the data from the seven
toxicity data sets used in this paper, which are often noisy, unevenly distributed across the multi-dimensional
attribute space and have a low ratio of instances (rows) to features (columns), can make accurate class
predictions difficult. In Tables 2 and 3, the classification accuracy is defined by eq. (1):
Classifica tion Accuracy 
Number of Test instances correctly classified by the model
Total Number of instances used for testing
(1)
<Table 2>
In Tables 2 and 3, the figures in bold in each row represent the best classification accuracy for the data
set named to the left. Table 2 helps identify the best model developed by the considered algorithms. Table 3
focuses on identification of the most suitable algorithm to develop good models for the data sets under
consideration. Moreover, Table 2 reports accuracies with a single train/test split of the data (see Table 1),
whereas data used for models in Table 3 has been automatically split in 90/10 ten times (ten fold cross
validation). 90 percent of toxicity data were used for training and the remaining 10 percent for testing in each
of the 10 cases. The results reported in Table 3 are the average classification accuracy over the 10 tests. This
means that the models listed in Table 2 are more dependent on the division of data sets compared with the
models reported in Table 3. Consequently the classification accuracies listed in Table 3 reflect more fairly the
learning ability of each machine learning algorithm.
<Table 3>
Data sets properties like noisiness, uneven distribution and size can make creating accurate models
difficult. As shown in Table 3, some algorithms appear more suitable for particular data sets, i.e. obtain higher
classification accuracy: IBL for BEE, SVM for PHENOLS and BN for APC. They exhibit higher than
average accuracy compared to their results across all seven data sets. This implies that careful algorithm
selection can make the creation of accurate models more straightforward..
A case study of visual analysis [19] of the correlations for different descriptors to the class values of
chemical compounds has been carried out on two data sets: PHENOLS and TROUT. Figures 1 and 2 show
three selected attributes that are the most highly correlated to the class for these data sets. For PHENOLS the
three selected attributes were Log P, magnitude of dipole moment and molecular weight and the class is
described by the mechanism of action. For TROUT the three selected attributes were the 3rd order valence
corrected cluster molecular connectivity, specific polarity and Log D at pH9 and the class value is given by
LC50 (mg/l) after 96 hours for the rainbow trout. Figure 1 (PHENOLS) shows a moderately good distribution
of data, but lacks clearly defined boundaries between classes. In particular, Class 2 and Class 3 show a large
amount of overlap in the lower portion of the graph. Figure 2 (TROUT) shows the same lack of boundaries
between classes, but also shows an uneven distribution of data: a large cluster of data-points from all three
classes can be seen to the left of the graph, with only a small amount of data-points falling in the remaining
attribute space. These factors contribute to the relatively low prediction accuracies obtained on these toxicity
data sets. Whilst it is common practise to remove outliers from data sets with the intention of improving the
prediction accuracy of models, the aim of this paper was not to create highly predictive models, but rather to
investigate the probable causes of poor model performance; undoubtedly outliers are one such cause.
<Figure 1>
<Figure 2>
A further study on implications of data quality to classification accuracy has been carried out. Two data
sets, PHENOLS and ORAL_QUAIL, and six algorithms (BN, MLP, IBL, DT, RIPPER, and SVM) were
considered in the experiment. The top 20 descriptors from each data set with the highest correlation to class
values were extracted using the feature selection method ReliefF [20] implemented in Wekaa. ReliefF is an
extension of the Relief algorithm, which works only for binary classification problems. The Relief algorithm
works for two class problems by randomly sampling an instance and locating its nearest neighbour from the
same and opposite class. The values of the features of the nearest neighbours are compared to the sampled
instance and used to update the relevance scores for each feature. ReliefF, an extension of Relief, aims to
solve the problem of datasets with multi-class, noisy and incomplete data. Twenty models were created for
each data set, with each model using the n most correlated descriptors to the class – where n varied from 1 to
20. 10-fold cross validated accuracies of these models are presented in Figures 3 and 4.
<Figure 3>
<Figure 4>
Figure 3 shows that increasing the number of descriptors used to build the models on PHENOLS data set
has little impact once the top 3-4 descriptors (1: an indicator variable for the presence of a 2- or 4-dihydroxy
phenol (OH OH), 2: the maximum donor superdelocalisability, 3: Log P (calculated by the ACD software), 4:
the number of elements in each molecular (Nelem)) are included. After this point the accuracies of the various
algorithms vary by little more than 5%. This suggests the first 4 descriptors of the PHENOLS data set have a
high correlation to the class value, and that they are sufficient to describe the majority of variation within the
data.
a
Weka: a free data mining software: http://www.cs.waikato.ac.nz/~ml/weka [9]
Figure 4 (ORAL_QUAIL data) shows that increasing the number of descriptors used to create a model
can decrease the subsequent accuracy. This reflects the unreliability of the ORAL_QUAIL data set, i.e. a
large amount of noise, less relevant descriptors etc. The first 4-5 descriptors (1: SdsssP_acnt - Count of all ( > P = ) groups in molecule; 2: SdsssP - Sum of all ( -> P = ) E-state values; 3: SdS_acnt - Count of all ( = S )
groups in molecule; 4: SdS - Sum of all ( = S ) E-State values in molecule; 5: SssO_acnt - Count of all ( - O ) groups in molecule) of this data set appear to be sufficient for creating models, and including any further
descriptors could lead to possible overfitting on the noisy and irrelevant data they contain.
4. Conclusions
The outcomes of our comparative study and experiments proved that single classifier-based models are not
sufficiently discriminative for all the data sets considered given the main characteristics of toxicity data
(noisiness, uneven distribution and size). Case studies of a multiple classifier combination system [21]
indicates that hybrid intelligent systems are worthy of further research in order to obtain better performance
for specific applications in predictive toxicology data mining. This is because multiple classifier combination
systems have the advantage that they can manage complex class distributions through combinations of
different model learning abilities.
The authors would also speculate that model accuracy could be improved further by choosing a particular
feature selection method based on the data set and algorithm used. The inclusion of more feature selection
methods i.e., kNNMFS [22], ReliefF [20], is proposed as future work.
The comparison of models created using different numbers of features highlights the need for care when
using feature selection techniques. Reducing the number of descriptors in a data set is commonly accepted as
a necessary step towards highly predictive, yet interpretable, models. However, as our results show, an
optimum number of descriptors exists, at least for the data sets used here. Models built with too many or too
few descriptors are both undesirable, and finding the optimal feature subset appears at least as important as
selecting appropriate algorithms with which to build a final model.
Acknowledgements
This work is partially supported by the EU EP5 project DEMETRA (http://www.demetra-tox.net). DN, MC
and
GG
acknowledge
the
support
of
the
EPSRC
project
PYTHIA
GR/T02508/01
(http://pythia.inf.brad.ac.uk). PT acknowledges the support of the EPSRC + CSL grant.
References
1.
Friksson, L., Johansson, E. & Lundstedt, T. (2004). Regression- and Projection- based Approaches in
Predictive Toxicology. In Predictive Toxicology. (ed. C. Helma). Marcel Dekker, New York.
2.
Parsons, S. & McBurney, P. (2004). The Use of Expert Systems for Toxicology Risk Prediction. In
Predictive Toxicology. (ed. C. Helma). Marcel Dekker, New York.
3.
Mazzatorta, P., Benfenati, E., Neagu, D. & Gini, G. (2000). Tuning Neural and Fuzzy-neural Networks
for Toxicity Modelling. Journal of Chemical Information and Computer Sciences, American Chemical
Society, Washington. 42 (5), 1250-1255.
4.
Craciun, M.V, Neagu, D., Craciun, C.A & Smiesko, M. (2004). A Study of Supervised and Unsupervised
Machine Learning Methodologies for Predictive Toxicology. In Intelligent Systems in Medicine. (ed.
H.N. Teodorescu), pp. 61-69. Performantica, Iasi, Romania.
5.
Guo, G., Neagu, D. (2005). Fuzzy kNNModel Applied to Predictive Toxicology Data Mining. Journal of
Computational Intelligence and Applications, Imperial College Press, 5 (3), 1-13.
6.
Caruana, R. & Niculescu-Mizil, A. (2006). An Empirical Comparison of Supervised Learning
Algorithms. In Procs. of ICML2006, pp.161-168.
7.
Cristianini, N & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines (and other kernelbased learning methods). Cambridge University Press.
8.
Cooper, G. F. & Herskovits, E. (1992) A Bayesian Method for the Induction of Probabilistic Networks
from Data. Machine Learning, Kluwer Academic Publishers, 9, 309-347.
9.
Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning. Kluwer Academic Publishers,
1(1), 81-106.
10. Aha, DW, Kibler, D & Albert. MK. (1991). Instance-based Learning Algorithms. Machine Learning,
Kluwer Academic Publishers, 6, 37-66.
11. Witten, I.H. & Frank, G. (2000). Data Mining: Practical Machine Learning Tools with Java
Implementations, Morgan Kaufmann, San Francisco.
12. Liu, P and Li, H. (2004). Fuzzy Neural Network Theory and Application. Series in Machine Perception
and Artificial Intelligence. 59.
13. Website of the EU FP5 Quality of Life DEMETRA QLRT-2001-00691. Development of Environmental
Modules
for
Evaluation
of
Toxicity
of
pesticide
Residues
in
Agriculture,
2001-2006:
http://www.demetra-tox.net
14. Project CSL: Development of Artificial Intelligence-based In-silico Toxicity Models for Use in Pesticide
Risk Assessment, 2004-2007.
15. Schultz, T.W. (1997). TETRATOX: Tetrahymena Pyriformis Population Growth Impairment Endpoint A Surrogate for Fish Lethality. Toxicol. Methods 7, 289-309.
16. Hall, M. A. (1998). Correlation-based Feature Subset Selection for Machine Learning. PhD Thesis,
University of Waikato.
17. Newman, D.J., Hettich, S., Blake, C.L. & Merz, C.J. (1998). UCI Repository of machine learning
databases
Website http://www.ics.uci.edu/~mlearn/MLRepository.html Irvine, CA: University of
California, Department of Information and Computer Science.
18. Shhab, A, Guo, G & Neagu, D. (2005). A Study on Applications of Machine Learning Techniques in
Data Mining. In Procs. of the 22nd BNCOD workshop on Data Mining and Knowledge Discovery in
Databases, (eds. David Nelson, Sue Stirk, Helen Edwards, Kenneth McGarry), pp. 131-138, University
of Sunderland Press
19. Keim, D.A (2002) Information visualization and visual data mining, IEEE Transactions on Visualization
and Computer Graphics, 8 (1), pp. 1-8.
20. Kononenko, I. (1994). Estimating attributes: Analysis and Extension of Relief. In Procs. of ECML’94, the
Seventh European Conference in Machine Learning, Springer-Verlag, pp.171-182.
21. Neagu, D. & Guo, G. (2006). An Effective Combination based on Class-wise Expertise of Diverse
Classifiers for Predictive Toxicology Data Mining. In Procs. of ADMA 06, Springer Berlin / Heidelberg,
LNAI 4093/2006, 165-172.
22. Guo, G.,Neagu, D & Cronin, M.T.D. (2005). Using kNN Model for Automatic Feature Selection. In
Proc. of ICAPR2005, Springer Berlin / Heidelberg, LNCS 3686/2005, 410-419.
Tables
Table 1. General information about toxicology data sets
Data sets
NI
NF_FS
NC
CD
CD_TR
CD_TE
TROUT
282
22
3
129:89:64
109:74:53
20:15:11
ORAL_QUAIL
116
8
4
4:28:24:60
3:24:19:51
1:4:5:9
DAPHNIA
264
20
4
122:65:52:25
105:53:43:21
17:12:9:4
DIETARY QUAIL
123
12
5
8:37:34:34:10
7:31:28:29:8
1:6:6:5:2
BEE
105
11
5
13:23:13:42:14
12:18:11:35:12
1:5:2:7:2
PHENOLS
250
11
3
61:152:37
43:106:26
18:46:11
60
6
4
17:16:16:11
12:12:12:9
5:4:4:2
APC
Table 2. Classification accuracies of different algorithms on seven data sets
Average classification accuracy of data sets
Data sets
BN
MLP
LR
IBL
K
DT
RIPPER
SVM
FNN
TROUT
56.52
65.22
0.3
63.04
5
56.52
54.35
60.87
50.00
ORAL_QUAIL
47.37
47.37
0.3
47.37
5
47.37
42.10
47.37
47.37
DAPHNIA
47.62
54.76
0.3
64.29
5
45.24
57.14
52.38
57.14
DIETARY QUAIL
40.00
70.00
0.9
60.00
10
45.00
40.00
55.00
40.00
BEE
58.82
58.82
0.9
70.59
1
58.82
58.82
58.82
47.06
PHENOLS
70.67
86.67
0.3
73.33
5
77.33
72.00
78.67
73.33
APC
40.00
53.33
0.9
53.33
5
53.33
46.67
46.67
40.00
Average
51.57
62.31
/
61.71
/
54.80
53.76
57.11
50.70
Table 3. Classification accuracies of different algorithms on seven data sets using ten-fold cross validation
Average Classification Accuracy of ten-fold
Data sets
BN
MLP
LR
IBL
TROUT
61.70
58.16
0.9
59.93
ORAL_QUAIL
62.07
51.72
0.3
DAPHNIA
50.38
53.41
DIETARY QUAIL
42.28
BEE
K
DT
RIPPER
SVM
FNN
5
55.32
56.74
62.06
59.79
57.76
5
62.93
60.34
65.52
55.27
0.3
54.17
5
50.00
50.00
54.55
50.00
55.28
0.3
48.78
5
45.53
39.84
48.78
37.50
49.52
51.43
0.3
58.09
5
45.71
46.67
53.33
55.89
PHENOLS
76.40
78.40
0.3
74.80
10
74.40
76.40
80.00
72.67
APC
58.33
40.00
0.3
43.33
5
43.33
40.00
43.33
40.00
Average
57.24
55.49
/
56.69
/
53.89
52.86
58.22
53.02
Figures
Figure 1: Three attributes most correlated to class in PHENOLS data set
Figure 2: Three attributes most correlated to class in TROUT data set
Figure 3: Performances for PHENOLS
Figure 4: Performances for ORAL_QUAIL
Download