A Comparative Study for Various Methods of Classification Peiman Mamani Barnaghi

advertisement
2012 International Conference on Information and Computer Networks (ICICN 2012)
IPCSIT vol. 27 (2012) © (2012) IACSIT Press, Singapore
A Comparative Study for Various Methods of Classification
Peiman Mamani Barnaghi+,Vahid Alizadeh Sahzabi and Azuraliza Abu Bakar
Department of Artificial Intelligence
National University of Malaysia (UKM)
Bangi, Malaysia
{peiman.barnaghi,vahid.alizadeh,aab}@ftsm.ukm.my
Abstract. This paper discusses data mining techniques to process a medical dataset and identify the
relevance of liver disorder and drinking alcohol drink by classification of blood test data. We have used four
different classification methods including decision tree, Bayesian algorithms (Naive Bayes and Bayesian
Networks), Neural Network classification and Rough Sets methods. To evaluate the methods, we have used
the Waikato Environment for Knowledge Analysis (WEKA) open source tool. WEKA is a collection of
machine learning algorithms that can be used for different processing tasks such as classification, and
clustering. Bayesian algorithms and Neural Network classification methods are implemented with WEKA.
However, as WEKA does not support methods based on Rough Sets we have used Rosetta. We have
provided an evaluation based on applying these classification methods to our dataset and measuring the
accuracy of test results. The evaluation results show that using Neural Networks obtains the best result
among the other methods.
Keywords: Classification, Decision tree methods, Bayesian algorithms, Neural Network, Rough Sets.
1. Introduction
In this paper we process a blood test dataset and use different classification methods to learn from the
test data set and develop a system that is able to identify the existing of a liver disorder by processing the
blood test data. The Liver disorder dataset consists of a set of blood test results that are used to describe the
liver disorders that might arise from extreme alcohol consumption and find any relationship between alcohol
consumption and liver disease. This dataset has some attributes that show values depend on red blood cell
volume, hydrolase enzyme, Serum glutamate primitive transaminase; that high level of this kind of enzyme
released into the blood may be a sign of liver damage, Aspartate transaminase that it is another enzyme
associated with liver parenchyma cells, “Gamma-glutamyl transpeptidase” An enzyme that catalyzes the
transfer of a γ -glutamyl group from glutathione or γ-glutamyl peptide to another peptide or amino acid; that
high level value is the best single screening assay for detecting latent or chronic liver disease.
The Waikato Environment for Knowledge Analysis (WEKA) is a collection of machine learning
algorithms that can be used for processing tasks such as classification and clustering. In this paper, we use
decision tree methods that contain J48 that is implemented of C.4.5 and LMT type. Bayes algorithms and
Neural Network classification methods, Naive Bayes and Bayesian methods (i.e. MLP and RBF) are
implemented in WEKA. We used Rosetta for implementing Rough Set methods. We divided our data set to
training data (66%) and all of methods tested and compared according 34% of this data.
2. The classification methods
+
Corresponding author. Tel. (+6)012 241 31 68
E-mail address: peiman.barnaghi@gmail.com
62
This section describes the classification methods are used in this paper. We discuss each method and
explain how the method has been used in our experiment.
2.1.
Decision tree
Decision Trees (DT) tree learning algorithms work based on processing and deciding upon attributes of
the data. Attributes in DT are nodes and each leaf node is representing a classification. Two algorithms
namely J48 and LMT were used in our experiments [1-2]. LMT is a method based on DT that final nodes are
replaced with logistic regression functions. After modelling result, comparison of results analyzes shows the
approach that provides a better performance. J48 is an implementation of C4.5 in WEKA. C4.5 uses
information entropy concept [3]. The disadvantages of DT are focus on continues attributes, computational
efficiently with growing tree size. According to comparison provided for different classification methods in
emotion recognition [4], DT is the best classifier method on that group with 15 attributes.
2.2.
Artificial Neural Networks
Artificial Neural Networks (ANN) are one of the common classification methods in data mining. To
employ Neural Network based classifiers, Multi Layer Perceptron (MLP) and Radial Base Function (RBF)
were used in this work. MLP is a feed forward network that makes a model to map input data to output data.
Hidden layer in MLP can include various layers between input and output. The structure of MLP is shown in
Figure 1. RBF is another type on ANN. The input of NN in RBF is linear and the output is nonlinear. The
output of this type of ANN is taken from weighted sum of hidden layer’s output. The RBF networks are
divided in two feed-forward layer. Figure 2 is illustrates the structure of this networks (The figure are
adapted from [5]).
Fig 1. Multi Layer Perception schema
Fig 2. Radial Base network
In [5], Avellaneda, et al compared four types of Neural Network classifiers. The experiment data
included 700 records with nine attributes. Avellaneda, et al reported that that MLP obtained the best result
in their experiment.
2.3.
Rough Sets
Rough set method is another classification technique in data mining. This section describes some basic
concepts and features of rough set theory that are important to describe the classification method. A suite of
methodologies that is useful for characterization of data that is not accurate has been provided by Rough set
theory. One of the main aspects in rough set methods is providing a formal framework for automated
transformation of data into knowledge for changing the modules [6]. Rough set theory offers mathematical
tools to discover hidden patterns in data that can be used in data mining. The main goal of the rough set
analysis is providing an estimate of data [7]. The rough set theory has been used for several types of data
with various sizes. In [7] a Rough Set method is tested on small, medium and large size of data. The results
show that Rough Sets are useful for small and medium size datasets [7]. As the dataset that we have used for
liver diseases is small size, we also consider applying Rough Set based classification in our evaluations.
2.4.
Bayesian methods
Bayesian methods are also used as one of the classification solutions in data mining. In our work we use
two main Bayesian methods namely naive Bayes and Bayesian networks that are implemented in WEKA
software for classification [8].
63
A naive Bayes classifier could be defined as an independent feature model deals with a simple
probabilistic classifier based on applying Bayes' theorem with strong independence assumptions. There are
several models that make different assumption fitting for Naïve Bayes [9] [10].
3. Result and discussion
In this work four classification methods: ANN (MLP, RBF), DT (J48, LMT), Bayes (Nave Bayes,
Bayesian network) and rough set were evaluated. We used a training set for our data and then applied the test
part of our dataset and measured the accuracy of the classifications.
There are 340 instances with 7 attributes in Liver disorder dataset that are divided into two classes. As
we can see in Table.1 almost all of four methods improved accuracy by increasing the training size. However,
the high accuracy for large training sets can be also resulted due to the over fitting problem. J48, MLP and
RBF with 76.41% have higher accuracy compared to other methods. Figure 3 shows the changes in accuracy
of different methods by changing the size of the training set.
Table 1. Accuracy of all methods in different training sizes
Training
Size
J48
LMT
Bayes
Net
Naïve
Bayes
10-90
20-80
30-70
40-60
50-50
60-40
70-30
80-20
90-10
53.28%
57.03%
54.85%
57.14%
64.49%
60.74%
66.33%
63.23%
79.41%
55.59%
59.62%
56.54%
69.45%
71.00%
70.37%
59.40%
69.17%
74.47%
58.55%
61.11%
59.91%
65.02%
65.08%
61.48%
67.32%
66.17%
76.47%
57.89%
59.62%
59.91%
64.03%
65.58%
61.48%
67.32%
66.17%
76.47%
MLP (2
hidden
layer)
57.23%
61.85%
62.02%
69.95%
65.68%
66.66%
69.30%
70.58%
79.41%
RBF
Rough
Set
61.84%
62.96%
64.13%
67.98%
65.68%
62.96%
67.32%
67.64%
79.41%
55.92%
60.37%
67.51%
67.98%
65.08%
65.18%
68.31%
54.70%
76.47%
Fig 3. The accuracy is increased fluctuated during raising training set size
We have also considered different attributes and their affected in liver disorder dataset classification.
According to the attribute selection methods such as best first and ranker methods, some attributes including
SGOT, ALK, MCV and SGPT are removed and then the classification methods are applied. The result of
this experiment is shown in Table 2.
Table 2. Accuracy of all methods in different feature sizes
Number
Of
features
3
4
5
6
7
J48
LMT
Bayes
Net
Naïve
Bayes
76.47%
82.35%
79.41%
79.41%
79.41%
76.47%
59.40%
82.35%
73.52%
74.47%
76.47%
82.35%
76.47%
76.47%
76.47%
76.47%
82.35%
76.47%
76.47%
76.47%
64
MLP
(5 hidden
layer)
76.47%
76.47%
88.23%
91.17%
79.41%
RBF
Rough Set
73.52%
79.41%
79.41%
82.35%
79.41%
58.82%
61.76%
64.70%
73.52%
76.47%
As shown in Fig.4 there was not a clear effect in accuracy of some of the methods with increasing
feature sizes; however the performance of MLP is raised sharply by using only 6 features. MLP showed high
accuracy at 91.17 %; in addition RBF had a slight improvement by using only 6 features. As we can see,
there is not significant change in Naïve Bayes, Bayes Net, J48 and LMT methods by increasing the number
of features.
Fig 4. Relevant of number of features with accuracy
Figure 4 demonstrate the the accuracy of the evaluations by applying different numver of attributes. This
justifies that feature size of data set have important role in accuracy and in particular is significantly affects
the performance of the Rough Set method.
4. Conclusions and future work
In this paper, seven types of four classification methods including MLP and RBF in NN, Naïve Bayes
and Bayesian Net in Bayesian, J48 and LMT in decision tree and Rough set are applied to a Liver disorder
dataset [11]. All blood tests should be classified in two classes: “Class 0” and “Class 1”. Compared to
Bayesian and Rough Sets, Neural Networks classifier methods obtain a good result. MLP obtains higher
results than RFB and also J48 shows good results but Rough Sets did not perform well to classify the
experimental dataset compare to other methods. It is assumed that in liver disorder dataset, increasing the
size of training set will produce better results. MLP shows that can it provide better results with larger
training set. The future work will focus on attribute selection attribute techniques. We will also study the
tuning and optimization techniques for the classifiers and to make sure that the large training set will not
cause over fitting problem.
5. References
[1] K. Golnabi, et al., "Analysis of firewall policy rules using data mining techniques," 2006, pp. 305-315.
[2] H. Li, et al., "Data Mining Techniques for Complex Formation Evaluation in Petroleum Exploration and
Production: A Comparison of Feature Selection and Classification Methods," 2008, pp. 37-43.
[3] J. Liang and Z. Shi, "The information entropy, rough entropy and knowledge granulation in rough set
theory," International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, vol. 12, pp. 3746, 2004.
[4] T. Justin, et al., "Comparison of different classification methods for emotion recognition," pp. 700-703.
[5] D. A. Avellaneda, et al., "Natural Texture Classification: A Neural Network Models Benchmark," 2009,
pp. 325-329.
[6] J. F. Peters and S. Ramanna, "Towards a software change classification system: A rough set approach,"
Software Quality Journal, vol. 11, pp. 121-147, 2003.
[7] A. Butalia, et al., "Applications of Rough Sets in the Field of Data Mining," 2008, pp. 498-503.
[8] J. Nicholson, et al., "Emotion recognition in speech using neural networks," Neural computing &
applications, vol. 9, pp. 290-296, 2000.
65
[9] G. Qiang, "An Effective Algorithm for Improving the Performance of Naive Bayes for Text
Classification," pp. 699-701.
[10] Y. Herdiyeni, et al., "A Bayesian network approach for image similarity," pp. 1-6.
[11] Liver disorder data set-Bupa medical research category. Available at:
http://archive.ics.uci.edu/ml/datasets/Liver+Disorders
66
Download