An incisive study of the Naïve Bayes Classifier and Decision Trees C4.5 and CART for the Diagnosis of Diabetes Umang Shah Prof. Khushali Deulkar D. J. Sanghvi College of Engineering Mumbai, India D. J. Sanghvi College of Engineering umang.k.shah@gmail.com khushali.deulkar@djsce.ac.in Mumbai, India Table 1. Literature Survey ABSTRACT Data Mining is a technique used to extract meaningful information from a large set of unstructured data and combined with machine learning has been used as a valuable tool in medical diagnosis, for the purpose of prediction as well as diagnosing the presence of diseases based on common factors or symptoms. Various classification algorithms aid in diagnosis and prediction of occurrence of disease by identifying required information. This paper aims at analyzing and contrasting the C4.5, CART and Naïve Bayes methods of classification mining, for diagnosing the presence of diabetes mellitus in patients. The open source WEKA tool was used to evaluate the performance of both methods. Author Method Used. Barakat.et.al (2004) [3] Three methods: Genetic Algorithm, EM algorithm and H-means+ clustering, A study undertaken planned for discovering the hidden knowledge from a specific dataset for improving healthcare’s quality for diabetic patients. An effective method for classifying diabetes patients was formed. Humar Kahramanli, Novruz Allahveri (2008) [4] Artificial neural networks (ANN) and fuzzy neural networks (FNN). Achieved a very high accuracy for diagnosis of Diabetes and Heart diseases. Sathyadevi (2011) [5] Data transformation and discretization methods are applied for improving the quality of data. Data mining technique that groups diabetes patients in healthcare into diverse subpopulations using decision tree algorithm. Determining treatments and health policies for diabetes patients. A.Ambica, Satyanarayana Gandi , Amarendra Kothalanka (2013) [6] A two level approach: Initially, optimal features are extracted from the existing training data and the positive and negative probability is calculated, until a new dataset is formed. An efficient knowledge expert system by the Naïve classification, which is highly accurate in classifying diabetes patients. Keywords Data Mining, Diabetes Mellitus, Decision Trees, CART, C4.5, Naïve Bayes. INTRODUCTION Diabetes mellitus is a group of metabolic disease where, either, cells do not respond properly to insulin, or there is no sufficient insulin produced by the body. Diabetes if left unconstrained, causes other complications such as heart disease, stroke, high blood pressure, liver disease, kidney disease, neuropathy and the loss of some organs in the body. It gives rise to high blood sugar levels whose common symptoms include frequent urination, increased thirst and hunger. [1] More than 380 million people worldwide are afflicted by it. WHO estimates this number to double by 2030, as the numbers are considerably increasing, day by day. [2] The same way, in which doctors identify various symptoms and body parameters to diagnose diseases, medical diagnosis is possible using data mining and machine leaning. Using the Data Mining Techniques based on Decision tree generation using CART technique and C4.5 Algorithm, and Naïve Bayes Classification, all of which use training data to form empirical models or patterns for diagnosis of Diabetes Mellitus. It is possible to develop classification models, by taking into account a number of factors, which can analyze given symptoms to determine whether patient is diagnosed with Diabetes. LITERATURE SURVEY Due to the urgency associated with a disease like diabetes, a large amount of work has been conducted in this field and high importance is given to medical diagnosis and the use of machine learning and data mining for the same. As we can see in Table 1, some of the advances and findings. Solution At the next level, they classify the testing data features in the optimal dataset using Naïve Bayes classifier. NAÏVE BAYES CLASSIFIER Naïve Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naïve Bayes overcomes various confines including omission of complex iterative estimations of the parameter and can is applicable to a large dataset in real time. Naïve Bayes can be applied to a large data set in real time situations like prediction of Diabetes. The Formula of Probability is shown in Fig. 1. [7] C4.5 ALGORITHM C4.5, which is the extension of the ID3 algorithm given by Quinlan, is widely used for medical diagnosis. C4.5 uses training data to build decision trees using the concept of information entropy. These decision trees are used for the purpose of classification. The training data is a set S of already classified samples. The standard decision tree structure is shown in Fig 2. In this tree, at every node, C4.5 chooses the attribute of the dataset that most effectively splits the set of samples into subsets abundant in one class or the other. The measure for splitting is the normalized information gain (difference in entropy). The attribute which has the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sub lists.[9] Fig 1: Naïve Bayes Formula The Conditional independence of Bayes theorem states that, the presence or absence of one of the parameters is not dependent to the presence or absence of some other parameters, thus each parameter’s value has an independent effect on the result. For example for a parameter “Frequency of Urination”, the probability of both Diabetic =’True’ or ‘False’ is calculated as: P(Diabetic=’True’) given “Frequency of Urination” = ‘Value from Test Data’ & P(Diabetic=’False’) given “Frequency of Urination” = ‘Value from Test Data’. So individually, probabilities of all parameters are stored and we can calculate their individual contribution to the final result in different variables. For the condition of zero probability values for some parameter, Laplace Correction was used to deal with that. At last the test data gets classified into categories either Diabetic or Not Diabetic. [8]pages other than the first page, start at the top of the page, and continue in double-column format. The two columns on the last page should be as close to equal length as possible. CART ALGORITHM Classification and Regression Trees, or simply, CART is a data mining algorithm which is capable of finding hidden patterns from complex data. While forming the decision tree, CART uses a Gini index to decide the splitting node. Gini index is calculated for all attributes and the attribute which has the smallest gini index is selected as the splitting attribute. Gini index is calculated as follows: G= 1− ∑𝑘𝑖=1 𝑝𝑖 2 Here, 𝑝𝑖 refers to the probability of each factor. [15] After identifying best split, search process is repeated recursively until splitting is stopped. Once a decision tree is generated it is pruned, this is done by removing sections of the tree that do not provide much impact in classifying the data. Pruning reduces the prediction error rate. CART can use exhaustive searches as well as computer based testing to find patterns and relationships in given data. It can be applied on any data set and needs very small input from user. [16] Fig 2: Standard Decision Tree Structure When the decision tree has been formed based on training data it follows hierarchical pattern in reaching a leaf node which specifies the class variable either Diabetic or non Diabetic. CLASSIFICATION OF DATA SET AND RESULT ANALYSIS The Pima Indian Data Set of National Institute of Diabetes and Digestive and Kidney Diseases from UCI Machine Learning Repository are used. It consists of 9 attributes namely [10]: Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-Hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable (0 or 1) The Open source tool - WEKA [11], which is a collection of machine learning algorithms to carry out data mining tasks. It was used to carry out the testing, the Pima Indian data set, which consisted of 768 records. Simulation was conducted for both algorithms. For C4.5, its Java extension J48 was used. SimpleCart Decision Tree was generated to process results using CART and for Naïve Bayes, the Bayesian Classifiers’ fixed Naïve Bayes search was used. For the 1st case (Show in Blue in Fig 3) 10% of the data was supplied as training data and the remaining 691 instances were tested. For the 2nd case (Shown in Red in Fig 3) 80% of the data was supplied as training data and the remaining 154 instances were tested. COMPARATIVE ANALYSIS OF NAÏVE BAYES, CART AND C4.5 After studying the various methods of diagnosing diabetes and performing experiment on test samples, we now compare and contrast the results in Table 2. Fig 3: Shows Correctness rate of Naïve Bayes, CART and C4.5 as obtained using WEKA tool Table 2. Comparing Naïve Bayes, CART and C4.5 Factor Interdependence of Parameters Naïve Bayes Factors Used for Prediction are independent to the presence or absence of other variables. Probability of one factor does not affect the overall classification. All factors are considered independently. CART CART generates a decision tree. Each leaf node is dependent on the parent nodes. However, Gini factor is calculated to decide the splitting node. Some nodes have higher contribution than other nodes in final classification. CART has an error rate of approximately 25%, given a small training data set. C4.5 C4.5 generates a decision tree. Each leaf node is dependent on the parent nodes. Thus, parent factor leads to the factors in child nodes. Error Rate (Figure 3) When small amount of training data is provided, the error rate approx (30%) is much lower than C4.5. Time Runtime of 0.02s Runtime of 0.35s Runtime of 0.08s Advantages Disadvantages Naïve Bayes algorithm scales linearly, as number of predictors and rows increase. [13] It can handle real as well as discrete data equally well. The independence of factors may introduce errors in cases where one feature is absolutely dependent on another feature. It can generate regression trees, where each leaf predicts a real number and not just a class. CART can identify the most significant variables and eliminate non-significant ones by referring to the Gini index. [12] CART may have an unstable decision tree. When learning sample is modified there may be increase or decrease of tree complexity, changes in splitting variables and values. [12] Results obtained from WEKA show that C4.5 has lower error rate (approx. 20%) given large amount of training data. It can have continuous as well as discrete attributes.[12] C4.5 allows the use of missing attributes. It Suffers from the problem of over fitting whenever the algorithm picks up data with uncommon characteristics. [14] CONCLUSION We have studied three methods of data mining, which are used for monitoring the presence of diabetes in a person. Experiments were conducted on Pima Indian diabetes dataset. The evaluation of results indicated that, C4.5 and CART perform better when large training data is present, their accuracy (approx. 80%) is better than Naïve Bayes. However, Naïve Bayes has a better execution accuracy compared to CART, given a practical amount of training data. CART also suffers from a longer execution time compared to both Naïve Bayes and C4.5. At a low amount of training data, C4.5 has the highest error rate. In practical application of medical diagnosis, training set size is variable in which case Naïve Bayes classifier’s accuracy rate combined with the quick execution rate even for large data sets is more promising and should be the preferred approach. REFERENCES [1] Diabetes mellitus - Wikipedia, the free encyclopedia [Online]. https://en.wikipedia.org/wiki/Diabetes_mellitus [2] Diabetes Research [Online]. http://www.diabetesresearch.org/what-is-diabetes TECHNIQUES. Dubai: International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.5, No.1, January 2015. [8] Sarvar, A., & Vinod Sharma, D. o. (2012). Intelligent Naïve Bayes Approach to Diagnose. Jammu: Special Issue of International Journal of Computer Applications (0975 – 8887) on Issues and Challenges in Networking, Intelligence and Computing Technologies – ICNICT 2012, November 2012. [9] C4.5 Algorithm [Online] Available: https://en.wikipedia.org/wiki/C4.5_algorithm [10] UCI Machine Learning Repository- Center for Machine Learning and Intelligent System [Online]. http://archive.ics.uci.edu [11] Weka Tool http://www.cs.waikato.ac.nz/ml/weka/ [Online] [12] Singh, Soniya & Priyanka Gupta 2014.Comparative Study ID3, CART and C4.5 Decision Tree Algorithm: A Survey. International Journal of Advanced Information Science and Technology (IJAIST) Vol.27, No.27, July 2014. [3] Barakat, N. 2004 “Learning-based rule-extraction from support vector machines”. [13] Oracle Documentation – Naïve Bayes [Online] Available: http://docs.oracle.com/cd/B28359_01/datamine.111/b 28129/algo_nb.htm [4] Humar Kahramanli , Novruz Allahverdi. 2008 “Design of a hybrid system for the diabetes and heart disease”. Expert Systems with Applications 35, 2008, p 82–89 [14] Mohammad M Mazid,A B M Shawkat Ali, Kevin Tickle, “Improved C4.5 Algorithm for Rule Based Classification” School of Computing Science, Central Queensland University, Australia.. [5] Sathyadevi, G. 2011. Application of CART algorithm in hepatitis disease diagnosis. Recent Trends in Information Technology , 44-78. [15] Ding, W. and Marchionini, G. 1997 A Study on Video Browsing Strategies. Technical Report. University of Maryland at College Park. [6] A.Ambica, Satyanarayana Gandi , Amarendra Kothalanka (2013). An Efficient Expert System For Diabetes By Naïve Bayesian Classifier. International Journal of Engineering Trends and Technology (IJETT) – Volume 4 Issue 10 - Oct 2013. [16] Bel, L. 2009. CART algorithm for spatial data: Application to environmental and ecological data. Statistics & Data Analysis, 33-78. [7] Aiswarya Iyer, S. J. (2015). DIAGNOSIS OF DIABETES USING CLASSIFICATION MINING