International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 A DATA MINING APPROACH FOR CLASSIFICATION OF HEART DISEASE DATASET USING NEURAL NETWORK 1 Miss. Manjusha B. Wadhonkar , Prof. P. A. Tijare2 , Prof. S. N. Sawalkar3 1 M.E Computer Engineering ,Computer Science and Engineering Department. Sipna College of Engineering & Technology, Amaravati 2 Associate Professor, Computer science and Engineering Department Sipna College of Engineering & Technology, Amaravati 3 Assistant Professor, Computer Science and Engineering Department. Sipna College of Engineering & technology, Amaravati ABSTRACT Data mining is the process of automating information discovery. ANN is widely used data mining method to extract pattern. Classification is one of the important data mining techniques for classifying given set of input data. In this experiment classification of heart disease dataset is done with the use of Cleveland Heart Disease Dataset. Classification is carried out using neural network classifier MLP .In this experiment performance measures are compared with chosen optimal parameter of MLP neural network, when it is trained and tested over cross validation, the training percentage of 98±0.5 %, testing percentage of 98±1.5% and 97± 1.2% overall accuracy, sensitivity 95±0.5%,specificity 100% are achieved. It shows the consistent performance of MLP neural network as compare to other models. In this work heart disease dataset is classified using 13 input attributes as well as by using 16 inputs attributes. The accuracy difference between 13 attributes and 16 attributes in training dada is 1.67 % and in testing data is 3.7% and in overall accuracy is.1.47%.The results obtained in this experiment shows the efficiency and accuracy of MLP NN. Keywords: Heart Disease dataset, MLP, Neural Network, Back-Propagation Algorithm, Classification, PE, Knowledge Data Discovery 1. INTRODUCTION Data mining is an important step in discovery of knowledge from large dataset. In recent years data mining has found its significance in every field including healthcare [1]. A major challenge for healthcare organizations is the provision of quality services at affordable costs. Services imply diagnosing patients correctly and administrating effective treatment [2]. Medical data comprises number of tests essential to diagnose to a particular disease. Integration of clinical decision support with computer-based patient records could reduce medical errors, increase patient safety, decrease unwanted practice variation, and improve patient outcome [15]. Global burden of disease estimates for 2001 by World Bank Country Groups shows severity statistics indicated in year 2001 is 25.2 % for India and from literature survey now it has increased to 46% [3]. Effective and efficient automated heart disease classification systems can be beneficial in healthcare sector for heart disease classification. The aim of our work is to introduce a classification approach using Multilayer Perceptron (MLP) with Back Propagation learning algorithm with heart disease dataset. Classification is one of the important techniques of data mining [5]. Classification is the processing of finding a set of models or functions which describe and distinguish data classes or concepts. In classification, inputs are given a set of data, called training set, where each record consists of several fields or attributes. One of the attributes, called the classifying attribute, indicate the class to which each dataset belong. The aim of the classification is to build a model of the classifying attribute based upon the other attributes which are not from the training dataset [4].Artificial neural network is widely used technique for extraction of patterns in data mining. ANN has some advantages such as it automatically allow arbitrary nonlinear relations between the independent and dependent variables, and allows all possible interactions between the dependent variables, due to above said advantages of ANN the use of neural network technique is adopted for the classification of dataset [5]. Parallel processing approach is implemented at each node to increase the efficiency of classification. Volume 4, Issue 5, May 2015 Page 426 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 2. LITERATURE SURVE AND RELATED WORK Integration of clinical decision support with computer-based patient records could reduce medical errors, increase patient safety, decrease unwanted practice variation, and improve patient outcome [15]. Global burden of disease estimates for 2001 by World Bank Country Groups shows severity statistics indicated in year 2001 is 25.2 % for India and from literature survey now it has increased to 46% [6]. In spite of the rapid development of pathological research more than 60,000 people die suddenly each year in India due to cardiovascular diseases [5]. A number of techniques have been used for identification or prediction of heart diseases such as waveform analysis, time frequency analysis, complexity measures, Neuro Fuzzy RBF NN, but it has been observed that classification accuracies were only up to 79 % [5]. Classification of heart disease dataset using ANN with feature selection gives only 80% result with these techniques and still have enough scope in improving it by choosing appropriate NN model [7]. The above analysis shows that Neural Network with 8 input attribute and 13 input attributes have shown the approximate accuracy of 81% so far [8]. 3. DATASET FOR THE CLASSIFUCATION OF HEART DISEASE Data for the classification of heart disease dataset is obtained from four different datasets of UCI [5], centre for machine learning and intelligent system .This database contains total 76 attributes , but for classification ,a subset of 17 of them namely Age(in years),Sex, Chest Pain type, Resting blood Pressure, Serum cholesterol, fasting blood sugar, Resting ECG, Maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, The slope of the peak exercise ST segment, Number of major vessels ,number of cigarettes per day, years as a smoker and fam_hisory and last feature is output based on classification of heart disease. Table 1 shows name of datasets and number of their instances [9]. Table 1: Database Names and their Instances [5] Name of Database Number instances Cleveland 303 Hungarian Switzerland Long Beach V.A 294 123 200 of The goal or output field is a five bit value which will represent five different classes as class 0-normal person, class 1first stroke, class 2- second stroke and class 3- third stroke and class 4-end of life. 4. DESIGN OF CLASSIFICATION SYSTEM The design of the neural network mainly consist of topology(i.e. arrangement of PE’s, connections, and patterns in the neural network) and architecture of the network[10].For the classification of Heart disease dataset using 13 input attributes and one output testing results gives maximum 90.6% accuracy for single layer and 94% for multilayer feed forward network [11]. To increase the accuracy of classification of heart disease dataset, in our system three other input attributes as number of cigarettes per day, years as a smoker and fam_History are used which increases the risk of cardiovascular disease. Thus this system is an attempt to introduce a classification approach using multilayer Perceptron (MLP) with back-propagation algorithm which includes 16 input attributes and an output attribute. An output attribute is a resultant class to which patient belong and is displayed as a combination of five bits such as [1 0 0 0 0] represent class 0, [0 1 0 0 0] for class 1 and so on. Attribute values for cig-per-day, yrs and family-history are tabulated in below table 2. Table 2: Proposed attribute values Name Description Range Cig-per-day Value Continuous Yrs Family history Value 1=True,0=Fals e Continuous 1,0 Volume 4, Issue 5, May 2015 Page 427 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 4.1 Multilayer Perceptron Neural Network For more complex decision function the inputs each with its own set of weights and threshold are fed into a number of perceptions nodes [5].The output of these input nodes are given as an output to another layer of nodes. Output of final layer of nodes is the output of the network. Such type of network is termed as MLP [7].The layers of nodes whose inputs and outputs are seen only by other nodes are termed hidden [8]. Back_propagation learning algorithm with supervised learning methods is used to compute the connection weights. There are different variants of back_propagation algorithm in the literature [12]-[13]. 4.2 Data Flow for the Classification of Heart Disease Dataset The workflow for the classification of dataset is as shown in figure 1(a) and 1(b) which provides brief description of fundamental steps that should be followed to apply ANNs for the classification of heart disease dataset Start Collect patient Data from Database Start Preprocessing New Patient Data Acquisition ANN Training using Back Propagation Algorithm No Are Test Result OK Preprocessing Processing for Prediction of Class Yes Trained and Verified ANN Predicted Result Stop Stop Figure 1 (a): Training Procedure for Classification of Heart Disease Dataset Using MLP Network [5]. Figure 1 (b): Testing of ANN based Classification of heart Disease dataset for new patient data [5]. 4.2.1 Data Collection Neural network is trained using Cleveland dataset of example cases. This dataset is nothing but records of patient’s stored in a database. Database contains number of reliable examples to be given as an input to the training network. 4.2.2 Pre-processing Data in the training dataset must be pre-processed before the evaluation by the neural network. Data to be given as input are scaled within the interval (0, 1) because the interference function used is logistic one. During pre-processing Cleveland dataset, 11 records contain missing attribute values that should be removed from the dataset to improve the classification performance. Thus total 272 records are given as an input to the neural network. Volume 4, Issue 5, May 2015 Page 428 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 4.2.3 Training & verification using ANN The neural network is trained with Heart Diseases database by using feed forward neural network model and backpropagation learning algorithm with momentum and variable learning rate. The input layer of the network consists of 16 neurons to represent each attribute as the database consists of 16 attributes. Several neural networks are constructed with single hidden layers network and trained with heart disease dataset. A selection of maximum number of epochs is provided prior to training within which the training is expected to converge. The convergence is said to be achieved when the error between the output generated by the trained network and the actual output from the database matches within a certain error limit preset before the training. If a convergence is not achieved then training with new network configuration (i.e. hidden neuron count) is carried out. Below figure 2(a) and 2(b) shows training graph and the error graph which depicts the actual output, predicted output by the trained neural network and the absolute error difference between actual and predicted output. Figure 2(a): Training graph converged at 704 epochs. X axis=Number of epoch, Y axis=Error difference Figure 2(b): Error graph converged at 704 epochs. X axis=Number of instances, Y axis=Scaled output value and Error difference 4.2.4 Verification Once a convergence is achieved the ANN is declared to be trained and its verification is initiated which normally is similar to the verification carried out during training by comparing the predicted outputs of the ANN with the actual ones, only difference being the dataset used this time is different from the one used in training. Once this verification results match then the ANN is declared as trained and verified for application purpose. Periodic verification of ANN and retraining if verification fails is a normal process with the ANNs. 4.2.5 Testing Once an ANN is declared to be trained and verified it is usable for application to the classification problem. In this phase it is provided with new user’s heart disease data and asked to classify. The results are used as correctly generated. 4.3 Architecture for the classification of heart Disease Dataset The architecture for the classification of heart disease dataset is as shown in figure 3. Initially Cleveland database (76 attributes) and its subset of (16) have been acquired and a Database structure for the system is being set into place for the loading of the Database as well as Help presentation on the database. A scalable approach is used with the use of Database module which uses two scripts labeled as “Database Info” and “Database Load”. The first one provides the information about the Database features/attributes and their naming, the second one is provided for loading the Volume 4, Issue 5, May 2015 Page 429 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 database in memory for processing. GUI for the Classification of Heart Disease Dataset Using MLP with Back Propagation Algorithm Database Modules (Load& Info.) Preprocessing Modules Training Module Verification Module Application Module Training result Verification Classification Result Heart Database Figure 3: Architecture of the System for classification of Heart Disease dataset using MLP [5] Training Module:-ANN is trained by using MLP with Back-propagation learning algorithm. Training Result: - Are predicted results, obtained by summing the results of inputs with adjusted weights. Verification Module:-In this module predicted output of ANN is compared with actual output. Verification Result: - Once this verification results match then the ANN is declared as trained and verified for application purpose. Application Module: - Once an ANN is declared to be trained and verified weights from input to hidden layer and hidden to output layer are stored and reloaded for application to the classification problem. In this phase it is provided with new patient’s heart disease data as an input and display result as a class to which patient belong. Classification Result: - For inputs of any new patient’s heart disease dataset, it provides results such as whether the patient is a healthy person or if not then to which class it belongs. If input is given in the form of file containing patient records then classification result is the form of confusion matrix. 5. EXPERIMENTAL RESULTS AND DISCUSSION OF MLP NN CLASSIFIER The network is trained several times with different random initialization of connection weights to ensure the true learning. Termination is when training gets convergence i.e. the error difference between the actual and predicted output is less than or equal to error limit. It is also established from the results that, the 90 % training and 10 % testing data partition gives best results. It is clear that transfer function of neurons in hidden layer as well as output layer should be tanh (hyperbolic tangent). Details about the training algorithm and its parameter are as given in table 3. The MLP neural network should be trained using back propagation algorithm with supervised learning rule. The designed classifier is evaluated on cross validation with regard to percent classification accuracy, specificity and sensitivity. Table 3: Variable Parameters of MLP NN (16-60-05) Parameter Range Optimal Values Exemplars for training 10% to 90 % 90% Exemplars for cross validation Number of epochs 10% to 90 % 10% 1000 to 10000 Number layer Number of Hidden 1 to 3 Class 0-704 Class 1-954 Class 2-604 Class 3-1742 Class 4 -689 1 of hidden 2 to 100 60 Volume 4, Issue 5, May 2015 Page 430 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 neurons Transfer function of the neurons in hidden layer Transfer function of the neuron in the output layer Supervised Learning Rule Step Size at hidden and output Layer Error limit Tanh, sigmoid, Linear tanh, Log sigmoid, Bias axon Linear Axon, Axon Tanh,sigmoid,Linear tanh,Log sigmoid, Bias axon Linear Axon, Axon Step, Momemtum,conjugate Gradiant(CG) 0 to 1 Tanh 0 to 0.9 0.1 ISSN 2319 - 4847 Tanh Step 0.1 5.1 Selection of Error Criteria Normally Euclidian or L2 norm is used. When problem incorporates very high degree of nonlinearity different error norms could be examined for their suitability in computation of error between output of NN model and the desired output. For MLP NN L2 norm provides the highest classification accuracy on training, testing and cross validation. 5.2 Performance Measures of MLP NN Proposed neural network is trained using back propagation algorithm and confusion matrix for cross validation so as to ensure that its performance does not depend on any specific data partitioning scheme. In this, rows are selected randomly by factor n which depends upon the data partitioning percentage of train and cross validation. Table 4 shows the performance measures for the MLP NN classifier with different dataset with respect to normal and diseased heart instances. Table 4: Performance Measures for MLP NN Classifiers Data Sets 13:60:05 MLP 90% training data 16:60:05 MLP 90 % training data % Classification Accuracy 90% 10% Trainin Testing g Data Data 96.69% 96.29% % Sensitivit y % Specificit y Error Limit 92.56% 100% 0.000 1 98.367% 95.86% 100% 0.1 100% From above table it implies that MLP NN as a classifier in this work possesses more learning ability compare to previously implemented techniques. The most important criterion in this work is to what extent the MLP NN classifier is able to correctly classify the exemplars [16]. To confirm whether the proposed model is really consistently capable of more accurate classification, different data partition sets are used to train the network. As per the confusion matrices it was found that the MLP Neural classifier has the advantage of reducing misclassifications among the neighborhood classes as compare to other NN classifiers [13]. 6. RESULTS AND DISCUSSION Table 5: Performance comparison of other technique with others on same dataset Previous Technique 13input 2 output Performance % Accuracy, % Sensitivity, % Specificity, error limit Accuracy 94% ,Error Rate 0.1 13:60:05 MLP 90% train data Accuracy 96.69%,Sensitivity 92,56%, Specificity 100%,Error Rate 0.0001, Volume 4, Issue 5, May 2015 References [11] Page 431 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 16:60:05 MLP 90 % train data ISSN 2319 - 4847 Accuracy 98.161%,Sensitivity 96.86%, Specificity 100%,Error Rate 0.1, From the performance comparison of proposed technique with others on same dataset as shown in above table 5, it is proved that the proposed MLP NN classifier with 16 input attributes clearly outperforms earlier researcher’s techniques. Previous related research analysis for heart diseases dataset shows report 94 % accuracy .With selected parameters of MLP NN, when it is trained several times and tested over cross validation, then overall accuracy 98.16%, sensitivity 95.86% and 100 % specificity are achieved which shows consistent performance than other neural network. 7. CONCLUSION Proposed neural network method proved to be reliable for diagnosis of angina in patients with heart disease. Additional studies with larger number of patients are required to improve accuracy and usefulness of artificial neural network. It is observed that MLP NN is fastest, simple in design, lowest MSE and highest accuracy. As per wide range of applicability of ANN, neural networks are well suited to solve complex problems due to their ability to learn complex and nonlinear relationships including noisy or less precise information. From the design of neural networks, it is evident that MLP NNs required a compact architecture as compared to other NNs, in terms of number of hidden nodes required for the classification. The number parameters such as weights and biases required for the designing of MLP NN is sufficiently lower than other. This simple and compact structure indicates the feasibility of MLP NN for online implementation and the hardware implementation [14]. Whenever new dataset findings are listed, this classification system can be retrained to accommodate new knowledge. This MLP NN classifier can be used to assist physicians to detect heart disease class for preliminary diagnosis, thus they can attempt perfection in the diagnosis of heart disease. REFERENCES [1] Anamica Gupta, Naveen Kumar and Vasuda Bhatnagar, “Analysis of Medical Data using Data Mining and formal Concept Analysis”, Proceedings of World Academy Of Science, Engineering and Technology, Vol. 6 , June 2005. [2] Bonow, Libby, Mann, Zipes, “Heart Disease: a textbook of Cardiovascular Medicine”, Eight edition, Saunders, Elsevier, 2006. [3] Mathers C.D., Lopez A., Stein D., “Deaths and disease burden by cause: Global burden of Disease estimates by World Bank Country Group, 2004. [4] John Shafer, Rakesh Agarwal, and Manish Mehta, “SPRINT: A Scalable parallel classifier for Data Mining”, In Proceedings of the VLDB Conference, Bombay, India, 1996. [5] Manjusha B. wadhonkar, P.A.Tijare,S.N Sawalkar, “ Artificial Neural Network Approach for Classification of Heart Disease Dataset”, International Journal of Application or Innovation in Engineering & Management(IJAIEM),Vol.3,Issue 4,pp.388-392,April 2014. [6] R.Rojas, “Neural Networks: a systematic introduction”, Springer-Verleg, 1996. [7] R.P.Lippmann, “Pattern Classification using Neural Networks”, IEEE commun.Mag.pp.47-64, 1989. [8] Simon Haykin, “Neural Network: A Comprehensive foundation”, Pearson Prentice Hall, New Delhi, 2007. [9] Murphy P.M. and Aha D. W., “UCI Machine Learning Databases Repository Irvine C.A: University of California, Department of Information and Computer Science”,ftp://ftp.ics.uci.edu/pub/machine-learningdatabases/heart/,2004. [10] Bose, N.K. and Liang, P. “Neural Network Fundamentals with graphs, algorithms and applications: Tata McGrawHill publishing company Ltd., New Delhi, 2001. [11] Dr. K Usha Rani, “ Analysis of Heart Disease Dataset using Neural Network Approach” , International journal of Data Mining & Knowledge Management(IJDKP),Vol.1,No.5,pp. 1-6,September 2011. [12] Hagan, M.T, Demuth H.B, Beale M.H., “Neural Network Design”, PWS Publishing, Boston, MA.1997. [13] Ranjana Raut, Dr. S.V. Dudul, “Intelligent Diagnosis of Heart Diseases using Neural Network Approach”, International Journal of Computer Applications(0975-8887),Vol.1-No.2,pp. 97-102,2010. [14] Reyneri, L.M., “Implementation Issues of Neuro –Fuzzy Hardware: going towards HW/SW co design”, IEEE Trans. On Neural Networks, Vol.14, no.1, pp.176-194, 2003. [15] Sahana Devanathan,Ambika R, “ Heart Disease Prediction System using Bayes Theorem”, International Journal of scientific Engineering Research, Vol. 4,Issue 4,pp. 1914-1918,Apr 2013. [16] Nadir N.Chamiya, Sanjay V. Dudul, “ Classification of material type and its surface properties using Digital signal Processing techniques and neural network”, Applied Soft Computing, ELSEVIER, Vol. 11,Issue 1,pp. 11081116,Jan 2011. Volume 4, Issue 5, May 2015 Page 432 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org Volume 4, Issue 5, May 2015 ISSN 2319 - 4847 AUTHOR Manjusha B. Wadhonkar M.E (Computer Engineeing) Second Year. Computer Science and Engineering department. Sipna College of Engineering and Technology, Amaravati( M.S). Volume 4, Issue 5, May 2015 Page 433