c4.5 algorithm - Academic Science,International Journal of

An incisive study of the Naïve Bayes Classifier and
Decision Trees C4.5 and CART for the Diagnosis of
Umang Shah
Prof. Khushali Deulkar
D. J. Sanghvi College of Engineering
Mumbai, India
D. J. Sanghvi College of Engineering
Mumbai, India
Table 1. Literature Survey
Data Mining is a technique used to extract meaningful
information from a large set of unstructured data and
combined with machine learning has been used as a valuable
tool in medical diagnosis, for the purpose of prediction as well
as diagnosing the presence of diseases based on common
factors or symptoms. Various classification algorithms aid in
diagnosis and prediction of occurrence of disease by
identifying required information. This paper aims at analyzing
and contrasting the C4.5, CART and Naïve Bayes methods of
classification mining, for diagnosing the presence of diabetes
mellitus in patients. The open source WEKA tool was used to
evaluate the performance of both methods.
Method Used.
(2004) [3]
Three methods:
Genetic Algorithm,
EM algorithm and
clustering, A study
undertaken planned
for discovering the
hidden knowledge
from a specific
dataset for
healthcare’s quality
for diabetic patients.
An effective
method for
diabetes patients
was formed.
Allahveri (2008)
Artificial neural
networks (ANN)
and fuzzy neural
networks (FNN).
Achieved a very
high accuracy for
diagnosis of
Diabetes and
Heart diseases.
(2011) [5]
Data transformation
and discretization
methods are applied
for improving the
quality of data. Data
mining technique
that groups diabetes
patients in
healthcare into
using decision tree
treatments and
health policies for
diabetes patients.
Gandi ,
(2013) [6]
A two level
approach: Initially,
optimal features are
extracted from the
existing training
data and the positive
and negative
probability is
calculated, until a
new dataset is
An efficient
knowledge expert
system by the
which is highly
accurate in
diabetes patients.
Data Mining, Diabetes Mellitus, Decision Trees, CART, C4.5,
Naïve Bayes.
Diabetes mellitus is a group of metabolic disease where,
either, cells do not respond properly to insulin, or there is no
sufficient insulin produced by the body. Diabetes if left
unconstrained, causes other complications such as heart
disease, stroke, high blood pressure, liver disease, kidney
disease, neuropathy and the loss of some organs in the
body. It gives rise to high blood sugar levels whose common
symptoms include frequent urination, increased thirst and
hunger. [1] More than 380 million people worldwide are
afflicted by it. WHO estimates this number to double by 2030,
as the numbers are considerably increasing, day by day. [2]
The same way, in which doctors identify various symptoms
and body parameters to diagnose diseases, medical diagnosis
is possible using data mining and machine leaning. Using the
Data Mining Techniques based on Decision tree generation
using CART technique and C4.5 Algorithm, and Naïve Bayes
Classification, all of which use training data to form empirical
models or patterns for diagnosis of Diabetes Mellitus. It is
possible to develop classification models, by taking into
account a number of factors, which can analyze given
symptoms to determine whether patient is diagnosed with
Due to the urgency associated with a disease like diabetes, a
large amount of work has been conducted in this field and
high importance is given to medical diagnosis and the use of
machine learning and data mining for the same. As we can see
in Table 1, some of the advances and findings.
At the next level,
they classify the
testing data features
in the optimal
dataset using Naïve
Bayes classifier.
Naïve Bayes classifiers are a family of simple probabilistic
classifiers based on applying Bayes' theorem with strong
(naive) independence assumptions between the features.
Naïve Bayes overcomes various confines including omission
of complex iterative estimations of the parameter and can is
applicable to a large dataset in real time. Naïve Bayes can be
applied to a large data set in real time situations like
prediction of Diabetes. The Formula of Probability is shown
in Fig. 1. [7]
C4.5, which is the extension of the ID3 algorithm given by
Quinlan, is widely used for medical diagnosis. C4.5 uses
training data to build decision trees using the concept
of information entropy. These decision trees are used for the
purpose of classification. The training data is a set S of
already classified samples. The standard decision tree
structure is shown in Fig 2.
In this tree, at every node, C4.5 chooses the attribute of the
dataset that most effectively splits the set of samples into
subsets abundant in one class or the other. The measure for
splitting is the normalized information gain (difference
in entropy). The attribute which has the highest normalized
information gain is chosen to make the decision. The C4.5
algorithm then recurs on the smaller sub lists.[9]
Fig 1: Naïve Bayes Formula
The Conditional independence of Bayes theorem states that,
the presence or absence of one of the parameters is not
dependent to the presence or absence of some other
parameters, thus each parameter’s value has an independent
effect on the result. For example for a parameter “Frequency
of Urination”, the probability of both Diabetic =’True’ or
‘False’ is calculated as: P(Diabetic=’True’) given “Frequency
of Urination” = ‘Value from Test Data’ & P(Diabetic=’False’)
given “Frequency of Urination” = ‘Value from Test Data’. So
individually, probabilities of all parameters are stored and we
can calculate their individual contribution to the final result in
different variables. For the condition of zero probability
values for some parameter, Laplace Correction was used to
deal with that. At last the test data gets classified into
Classification and Regression Trees, or simply, CART is a
data mining algorithm which is capable of finding hidden
patterns from complex data. While forming the decision tree,
CART uses a Gini index to decide the splitting node. Gini
index is calculated for all attributes and the attribute which
has the smallest gini index is selected as the splitting attribute.
Gini index is calculated as follows:
G= 1−
∑𝑘𝑖=1 𝑝𝑖 2
Here, 𝑝𝑖 refers to the probability of each factor. [15] After
identifying best split, search process is repeated recursively
until splitting is stopped. Once a decision tree is generated it is
pruned, this is done by removing sections of the tree that do
not provide much impact in classifying the data. Pruning
reduces the prediction error rate. CART can use exhaustive
searches as well as computer based testing to find patterns and
relationships in given data. It can be applied on any data set
and needs very small input from user. [16]
Fig 2: Standard Decision Tree Structure
When the decision tree has been formed based on training
data it follows hierarchical pattern in reaching a leaf node
which specifies the class variable either Diabetic or non
The Pima Indian Data Set of National Institute of Diabetes
and Digestive and Kidney Diseases from UCI Machine
Learning Repository are used. It consists of 9 attributes
namely [10]:
Number of times pregnant
Plasma glucose concentration a 2 hours in an oral
glucose tolerance test
Diastolic blood pressure (mm Hg)
Triceps skin fold thickness (mm)
2-Hour serum insulin (mu U/ml)
Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function
Age (years)
Class variable (0 or 1)
The Open source tool - WEKA [11], which is a collection of
machine learning algorithms to carry out data mining tasks. It
was used to carry out the testing, the Pima Indian data set,
which consisted of 768 records. Simulation was conducted for
both algorithms. For C4.5, its Java extension J48 was used.
SimpleCart Decision Tree was generated to process results
using CART and for Naïve Bayes, the Bayesian Classifiers’
fixed Naïve Bayes search was used.
For the 1st case (Show in Blue in Fig 3) 10% of the data was
supplied as training data and the remaining 691 instances
were tested. For the 2nd case (Shown in Red in Fig 3) 80% of
the data was supplied as training data and the remaining 154
instances were tested.
After studying the various methods of diagnosing diabetes and
performing experiment on test samples, we now compare and
contrast the results in Table 2.
Fig 3: Shows Correctness rate of Naïve Bayes, CART and
C4.5 as obtained using WEKA tool
Table 2. Comparing Naïve Bayes, CART and C4.5
of Parameters
Naïve Bayes
Factors Used for Prediction
are independent to the
presence or absence of other
Probability of one factor
does not affect the overall
All factors are considered
CART generates a decision
Each leaf node is dependent
on the parent nodes.
However, Gini factor is
calculated to decide the
splitting node.
Some nodes have higher
contribution than other nodes
in final classification.
CART has an error rate of
approximately 25%, given a small
training data set.
C4.5 generates a decision
Each leaf node is dependent
on the parent nodes.
Thus, parent factor leads to
the factors in child nodes.
Error Rate
(Figure 3)
When small amount of training
data is provided, the error rate
approx (30%) is much lower than
Runtime of 0.02s
Runtime of 0.35s
Runtime of 0.08s
Naïve Bayes algorithm scales
linearly, as number of
predictors and rows increase.
It can handle real as well as
discrete data equally well.
The independence of factors may
introduce errors in cases where
one feature is absolutely
dependent on another feature.
It can generate regression
trees, where each leaf
predicts a real number and
not just a class.
CART can identify the most
significant variables and
eliminate non-significant
ones by referring to the Gini
index. [12]
CART may have an unstable
decision tree. When learning
sample is modified there may be
increase or decrease of tree
complexity, changes in splitting
variables and values. [12]
Results obtained from WEKA
show that C4.5 has lower error
rate (approx. 20%) given large
amount of training data.
It can have continuous as
well as discrete
C4.5 allows the use of
missing attributes.
It Suffers from the problem of
over fitting whenever the
algorithm picks up data with
uncommon characteristics. [14]
We have studied three methods of data mining, which are
used for monitoring the presence of diabetes in a person.
Experiments were conducted on Pima Indian diabetes
dataset. The evaluation of results indicated that, C4.5 and
CART perform better when large training data is present,
their accuracy (approx. 80%) is better than Naïve Bayes.
However, Naïve Bayes has a better execution accuracy
compared to CART, given a practical amount of training
data. CART also suffers from a longer execution time
compared to both Naïve Bayes and C4.5. At a low amount
of training data, C4.5 has the highest error rate. In practical
application of medical diagnosis, training set size is
variable in which case Naïve Bayes classifier’s accuracy
rate combined with the quick execution rate even for large
data sets is more promising and should be the preferred
