c4.5 algorithm - Academic Science,International Journal of

advertisement
An incisive study of the Naïve Bayes Classifier and
Decision Trees C4.5 and CART for the Diagnosis of
Diabetes
Umang Shah
Prof. Khushali Deulkar
D. J. Sanghvi College of Engineering
Mumbai, India
D. J. Sanghvi College of Engineering
umang.k.shah@gmail.com
khushali.deulkar@djsce.ac.in
Mumbai, India
Table 1. Literature Survey
ABSTRACT
Data Mining is a technique used to extract meaningful
information from a large set of unstructured data and
combined with machine learning has been used as a valuable
tool in medical diagnosis, for the purpose of prediction as well
as diagnosing the presence of diseases based on common
factors or symptoms. Various classification algorithms aid in
diagnosis and prediction of occurrence of disease by
identifying required information. This paper aims at analyzing
and contrasting the C4.5, CART and Naïve Bayes methods of
classification mining, for diagnosing the presence of diabetes
mellitus in patients. The open source WEKA tool was used to
evaluate the performance of both methods.
Author
Method Used.
Barakat.et.al
(2004) [3]
Three methods:
Genetic Algorithm,
EM algorithm and
H-means+
clustering, A study
undertaken planned
for discovering the
hidden knowledge
from a specific
dataset for
improving
healthcare’s quality
for diabetic patients.
An effective
method for
classifying
diabetes patients
was formed.
Humar
Kahramanli,
Novruz
Allahveri (2008)
[4]
Artificial neural
networks (ANN)
and fuzzy neural
networks (FNN).
Achieved a very
high accuracy for
diagnosis of
Diabetes and
Heart diseases.
Sathyadevi
(2011) [5]
Data transformation
and discretization
methods are applied
for improving the
quality of data. Data
mining technique
that groups diabetes
patients in
healthcare into
diverse
subpopulations
using decision tree
algorithm.
Determining
treatments and
health policies for
diabetes patients.
A.Ambica,
Satyanarayana
Gandi ,
Amarendra
Kothalanka
(2013) [6]
A two level
approach: Initially,
optimal features are
extracted from the
existing training
data and the positive
and negative
probability is
calculated, until a
new dataset is
formed.
An efficient
knowledge expert
system by the
Naïve
classification,
which is highly
accurate in
classifying
diabetes patients.
Keywords
Data Mining, Diabetes Mellitus, Decision Trees, CART, C4.5,
Naïve Bayes.
INTRODUCTION
Diabetes mellitus is a group of metabolic disease where,
either, cells do not respond properly to insulin, or there is no
sufficient insulin produced by the body. Diabetes if left
unconstrained, causes other complications such as heart
disease, stroke, high blood pressure, liver disease, kidney
disease, neuropathy and the loss of some organs in the
body. It gives rise to high blood sugar levels whose common
symptoms include frequent urination, increased thirst and
hunger. [1] More than 380 million people worldwide are
afflicted by it. WHO estimates this number to double by 2030,
as the numbers are considerably increasing, day by day. [2]
The same way, in which doctors identify various symptoms
and body parameters to diagnose diseases, medical diagnosis
is possible using data mining and machine leaning. Using the
Data Mining Techniques based on Decision tree generation
using CART technique and C4.5 Algorithm, and Naïve Bayes
Classification, all of which use training data to form empirical
models or patterns for diagnosis of Diabetes Mellitus. It is
possible to develop classification models, by taking into
account a number of factors, which can analyze given
symptoms to determine whether patient is diagnosed with
Diabetes.
LITERATURE SURVEY
Due to the urgency associated with a disease like diabetes, a
large amount of work has been conducted in this field and
high importance is given to medical diagnosis and the use of
machine learning and data mining for the same. As we can see
in Table 1, some of the advances and findings.
Solution
At the next level,
they classify the
testing data features
in the optimal
dataset using Naïve
Bayes classifier.
NAÏVE BAYES CLASSIFIER
Naïve Bayes classifiers are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features.
Naïve Bayes overcomes various confines including omission
of complex iterative estimations of the parameter and can is
applicable to a large dataset in real time. Naïve Bayes can be
applied to a large data set in real time situations like
prediction of Diabetes. The Formula of Probability is shown
in Fig. 1. [7]
C4.5 ALGORITHM
C4.5, which is the extension of the ID3 algorithm given by
Quinlan, is widely used for medical diagnosis. C4.5 uses
training data to build decision trees using the concept
of information entropy. These decision trees are used for the
purpose of classification. The training data is a set S of
already classified samples. The standard decision tree
structure is shown in Fig 2.
In this tree, at every node, C4.5 chooses the attribute of the
dataset that most effectively splits the set of samples into
subsets abundant in one class or the other. The measure for
splitting is the normalized information gain (difference
in entropy). The attribute which has the highest normalized
information gain is chosen to make the decision. The C4.5
algorithm then recurs on the smaller sub lists.[9]
Fig 1: Naïve Bayes Formula
The Conditional independence of Bayes theorem states that,
the presence or absence of one of the parameters is not
dependent to the presence or absence of some other
parameters, thus each parameter’s value has an independent
effect on the result. For example for a parameter “Frequency
of Urination”, the probability of both Diabetic =’True’ or
‘False’ is calculated as: P(Diabetic=’True’) given “Frequency
of Urination” = ‘Value from Test Data’ & P(Diabetic=’False’)
given “Frequency of Urination” = ‘Value from Test Data’. So
individually, probabilities of all parameters are stored and we
can calculate their individual contribution to the final result in
different variables. For the condition of zero probability
values for some parameter, Laplace Correction was used to
deal with that. At last the test data gets classified into
categories either Diabetic or Not Diabetic. [8]pages other than
the first page, start at the top of the page, and continue in
double-column format. The two columns on the last page
should be as close to equal length as possible.
CART ALGORITHM
Classification and Regression Trees, or simply, CART is a
data mining algorithm which is capable of finding hidden
patterns from complex data. While forming the decision tree,
CART uses a Gini index to decide the splitting node. Gini
index is calculated for all attributes and the attribute which
has the smallest gini index is selected as the splitting attribute.
Gini index is calculated as follows:
G= 1−
∑𝑘𝑖=1 𝑝𝑖 2
Here, 𝑝𝑖 refers to the probability of each factor. [15] After
identifying best split, search process is repeated recursively
until splitting is stopped. Once a decision tree is generated it is
pruned, this is done by removing sections of the tree that do
not provide much impact in classifying the data. Pruning
reduces the prediction error rate. CART can use exhaustive
searches as well as computer based testing to find patterns and
relationships in given data. It can be applied on any data set
and needs very small input from user. [16]
Fig 2: Standard Decision Tree Structure
When the decision tree has been formed based on training
data it follows hierarchical pattern in reaching a leaf node
which specifies the class variable either Diabetic or non
Diabetic.
CLASSIFICATION OF DATA SET
AND RESULT ANALYSIS
The Pima Indian Data Set of National Institute of Diabetes
and Digestive and Kidney Diseases from UCI Machine
Learning Repository are used. It consists of 9 attributes
namely [10]:

Number of times pregnant

Plasma glucose concentration a 2 hours in an oral
glucose tolerance test

Diastolic blood pressure (mm Hg)

Triceps skin fold thickness (mm)

2-Hour serum insulin (mu U/ml)

Body mass index (weight in kg/(height in m)^2)

Diabetes pedigree function

Age (years)

Class variable (0 or 1)
The Open source tool - WEKA [11], which is a collection of
machine learning algorithms to carry out data mining tasks. It
was used to carry out the testing, the Pima Indian data set,
which consisted of 768 records. Simulation was conducted for
both algorithms. For C4.5, its Java extension J48 was used.
SimpleCart Decision Tree was generated to process results
using CART and for Naïve Bayes, the Bayesian Classifiers’
fixed Naïve Bayes search was used.
For the 1st case (Show in Blue in Fig 3) 10% of the data was
supplied as training data and the remaining 691 instances
were tested. For the 2nd case (Shown in Red in Fig 3) 80% of
the data was supplied as training data and the remaining 154
instances were tested.
COMPARATIVE ANALYSIS OF
NAÏVE BAYES, CART AND C4.5
After studying the various methods of diagnosing diabetes and
performing experiment on test samples, we now compare and
contrast the results in Table 2.
Fig 3: Shows Correctness rate of Naïve Bayes, CART and
C4.5 as obtained using WEKA tool
Table 2. Comparing Naïve Bayes, CART and C4.5
Factor
Interdependence
of Parameters



Naïve Bayes
Factors Used for Prediction
are independent to the
presence or absence of other
variables.
Probability of one factor
does not affect the overall
classification.
All factors are considered
independently.
CART
CART generates a decision
tree.

Each leaf node is dependent
on the parent nodes.

However, Gini factor is
calculated to decide the
splitting node.

Some nodes have higher
contribution than other nodes
in final classification.
CART has an error rate of
approximately 25%, given a small
training data set.




C4.5
C4.5 generates a decision
tree.
Each leaf node is dependent
on the parent nodes.
Thus, parent factor leads to
the factors in child nodes.
Error Rate
(Figure 3)
When small amount of training
data is provided, the error rate
approx (30%) is much lower than
C4.5.
Time
Runtime of 0.02s
Runtime of 0.35s
Runtime of 0.08s
Advantages




Disadvantages
Naïve Bayes algorithm scales
linearly, as number of
predictors and rows increase.
[13]
It can handle real as well as
discrete data equally well.
The independence of factors may
introduce errors in cases where
one feature is absolutely
dependent on another feature.
It can generate regression
trees, where each leaf
predicts a real number and
not just a class.

CART can identify the most
significant variables and
eliminate non-significant
ones by referring to the Gini
index. [12]
CART may have an unstable
decision tree. When learning
sample is modified there may be
increase or decrease of tree
complexity, changes in splitting
variables and values. [12]
Results obtained from WEKA
show that C4.5 has lower error
rate (approx. 20%) given large
amount of training data.

It can have continuous as
well as discrete
attributes.[12]
C4.5 allows the use of
missing attributes.
It Suffers from the problem of
over fitting whenever the
algorithm picks up data with
uncommon characteristics. [14]
CONCLUSION
We have studied three methods of data mining, which are
used for monitoring the presence of diabetes in a person.
Experiments were conducted on Pima Indian diabetes
dataset. The evaluation of results indicated that, C4.5 and
CART perform better when large training data is present,
their accuracy (approx. 80%) is better than Naïve Bayes.
However, Naïve Bayes has a better execution accuracy
compared to CART, given a practical amount of training
data. CART also suffers from a longer execution time
compared to both Naïve Bayes and C4.5. At a low amount
of training data, C4.5 has the highest error rate. In practical
application of medical diagnosis, training set size is
variable in which case Naïve Bayes classifier’s accuracy
rate combined with the quick execution rate even for large
data sets is more promising and should be the preferred
approach.
REFERENCES
[1] Diabetes mellitus - Wikipedia, the free encyclopedia
[Online].
https://en.wikipedia.org/wiki/Diabetes_mellitus
[2] Diabetes
Research
[Online].
http://www.diabetesresearch.org/what-is-diabetes
TECHNIQUES. Dubai: International Journal of Data
Mining & Knowledge Management Process (IJDKP)
Vol.5, No.1, January 2015.
[8] Sarvar, A., & Vinod Sharma, D. o. (2012). Intelligent
Naïve Bayes Approach to Diagnose. Jammu: Special
Issue of International Journal of Computer
Applications (0975 – 8887) on Issues and Challenges
in Networking, Intelligence and Computing
Technologies – ICNICT 2012, November 2012.
[9] C4.5
Algorithm
[Online]
Available:
https://en.wikipedia.org/wiki/C4.5_algorithm
[10] UCI Machine Learning Repository- Center for
Machine Learning and Intelligent System [Online].
http://archive.ics.uci.edu
[11] Weka
Tool
http://www.cs.waikato.ac.nz/ml/weka/
[Online]
[12] Singh, Soniya & Priyanka Gupta 2014.Comparative
Study ID3, CART and C4.5 Decision Tree Algorithm:
A Survey. International Journal of Advanced
Information Science and Technology (IJAIST) Vol.27,
No.27, July 2014.
[3] Barakat, N. 2004 “Learning-based rule-extraction
from support vector machines”.
[13] Oracle Documentation – Naïve Bayes [Online]
Available:
http://docs.oracle.com/cd/B28359_01/datamine.111/b
28129/algo_nb.htm
[4] Humar Kahramanli , Novruz Allahverdi.
2008
“Design of a hybrid system for the diabetes and heart
disease”. Expert Systems with Applications 35, 2008,
p 82–89
[14] Mohammad M Mazid,A B M Shawkat Ali, Kevin
Tickle, “Improved C4.5 Algorithm for Rule Based
Classification” School of Computing Science, Central
Queensland University, Australia..
[5] Sathyadevi, G. 2011. Application of CART algorithm
in hepatitis disease diagnosis. Recent Trends in
Information Technology , 44-78.
[15] Ding, W. and Marchionini, G. 1997 A Study on Video
Browsing Strategies. Technical Report. University of
Maryland at College Park.
[6] A.Ambica, Satyanarayana Gandi , Amarendra
Kothalanka (2013). An Efficient Expert System For
Diabetes By Naïve Bayesian Classifier. International
Journal of Engineering Trends and Technology
(IJETT) – Volume 4 Issue 10 - Oct 2013.
[16] Bel, L. 2009. CART algorithm for spatial data:
Application to environmental and ecological data.
Statistics & Data Analysis, 33-78.
[7] Aiswarya Iyer, S. J. (2015). DIAGNOSIS OF
DIABETES USING CLASSIFICATION MINING
Download