International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ Improved the Classification Ratio of ID3 Algorithm Using Attribute Correlation and Genetic Algorithm 1 Priti Bhagwatkar, 2Parmalik Kumar Department of Computer science & Engineering PIT,BHOPAL, INDIA, PCST,BHOPAL,INDIA Email: bhagwatkarpriti@gmail.com, Parmalik.kumar@patelcollege.com features. There are generally two types of multiple classifiers combination: multiple classifiers selection and multiple classifiers fusion[3,4]. Multiple classifiers selection assumes that each classifier has expertise in some local regions of the feature space and attempts to find which classifier has the highest local accuracy in the vicinity of an unknown test sample. Then, this classifier is nominated to make the final decision of the system[8]. Attribute correlation technique is new method to find the relation of attribute using correlation coefficient factor. The correlation coefficient factor estimates the correlation value and passes through genetic algorithm. Genetic algorithm is population based searching technique and finds to know best possible set of value for the classification process. Decision tree is one of widely used classification method in Data Mining field whose core problem is the choice of splitting attributes. In ID3 algorithm, information theory is applied to choose the attribute that has the biggest information gain as the splitting attribute in each step[10,11,12]. And a recursive way is used to generating decision tree until certain condition is reached. ID3 algorithm has been extensively applied in many fields already, but some inherent defects still exist. The most obvious one is its inclining to attributes with many values. About improving this inclination, related researchers proposed lots of improved methods, such as, modify the information gain of any attributes by weighting add the number of attributes values, the users’ interestingness attribute similarity to information gain as weight. However, there are specific conditions and restriction in above methods[14,15]. Therefore, on the basis of many research achievements, an improved ID3 based on weighted modified information gain using genetic algorithm called GA_ID3 is proposed in this paper. Abstract— Increasing the size of data classification performance of ID3 algorithm is decrease. Improvement of ID3 algorithm various techniques are used such as attribute selection method, neural network, and fuzzy based ANT colony optimization. All this technique faced a problem of attribute correlation and decreases the performance of ID3 algorithm. In this paper proposed a GA based ID3 algorithm for data classification. In the proposed algorithm generate attribute correlation using genetic algorithm. the proposed algorithm implemented in MATLAB software and used some standard data set from UCI machine learning repository. our experimental result shows better classification ratio instead of ID3 and fuzzyID3. Index Terms—- ID3, Attribute correlation, data mining and GA I. INTRODUCTION The diversity and applicability of data mining are increase day to day in the field of engineering and science for the predication of product market analysis. The data mining provide lots of technique for mine data in several field, the technique of mining as association rule mining, clustering technique, classification technique and emerging technique such as called ensemble classification technique. The process of ensemble classifier increases the classification rate and improved the majority voting of classification technique for individual classification algorithm such as KNN, Decision tree and support vector machine. The new paradigms of ID3 are GA technique for classification of data[1,2]. This paper apply classification proceed based on GA selection to data and propose an ID3 classifier selection method. In the method, many features are selected for a hybrid process. Then, the standard presentation of each feature on selected ID3 is calculated and the classifier with the best average performance is chosen to classify the given data. In the computation of normal act, weighted average is technique is used. Weight values are calculated according to the distances between the given data and each selected The above section discuss introduction of ID3 algorithm and feature correlation attribute. In section II we describe related work in ID3 algorithm. In section III discuss attribute correlation and genetic algorithm. In section IV discuss proposed methodology for classification. In _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 21 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ section V discuss Experimental result and finally conclude in section VI. [5] Author proposed a new attribute based method for multiclass data classification described as a method Graph-based representation has been successfully used to support various machine learning and data mining algorithms. The learning algorithms strongly rely on the algorithm employed for constructing the graph from input data, given as a set of vector-based patterns. A popular way to build such graphs is to treat each data pattern as a vertex; vertices are then connected according to some similarity measure, resulting in an structure known as data graph. II. RELATED WORK In this section discuss the related work of feature correlation and attribute selection for ID3 algorithm. Now a day’s ID3 algorithm are used for prediction and classification process in data science. The diversity of data decreases the performance of ID3 algorithm. now improvement of performance of ID3 algorithm used various technique such as weighted technique ,ANT colony optimization technique and fuzzy logic. All technique discuss here as contribution for improvement of ID3 algorithm. [6] Author proposed an improved decision tree ID3 algorithm described as a Decision tree is an important method for both induction research and data mining, which is mainly used for model classification and prediction. ID3 algorithm is the most widely used algorithm in the decision tree so far. Through illustrating on the basic ideas of decision tree in data mining, in this paper, the shortcoming of ID3’s inclining to choose attributes with many values is discussed, and then a new decision tree algorithm combining ID3 and Association Function (AF) is presented. [1] Author proposed a new method described as Many Qualitative Bankruptcy Prediction models are available. These models use non-financial information as Qualitative factors to predict Bankruptcy. However this Model uses only very less number of Qualitative factors and the generated rules has redundancy and overlapping. To improve the Prediction accuracy we have proposed a model which applies more number of Qualitative factors which can be categorized using Fuzzy ID3 Algorithm and Prediction Rules are generated using Ant Colony Optimization Algorithm (ACO). In Fuzzy ID3 the concept of Entropy and Information Gain helps to rank the qualitative parameters and this can be used to generate prediction rules in qualitative Bankruptcy prediction. [7] Author proposed a new approach of Detecting Network Anomalies using improved ID3 with horizontal portioning based decision tree. During the last decades, different approaches to intrusion detection have been explored. The two most common approaches are misuse detection and anomaly detection. In misuse detection, attacks are detected by matching the current traffic pattern with the signature of known attacks. Anomaly detection keeps a profile of normal system behavior and interprets any significant deviation from this normal profile as malicious activity. One of the strengths of anomaly detection is the ability to detect new attacks. Anomaly detection’s most serious weakness is that it generates too many false alarms. [2] In this paper author describes an ID3 algorithm is a mining one based on decision tree, which selects property value with the highest gains as the test attribute of its sample sets, establishes decision-making node, and divides them in turn. ID3 algorithm involves repeated logarithm operation, and it will affect the efficiency of generating decision tree when there are a large number of data, so one must change the selection criteria of data set attributes, using the Taylor formula to transform the algorithm to reduce the amount of data calculation and the generation time of decision trees and thus improve the efficiency of the decision tree classifier. It is shown that the use of improved ID3 algorithm to deal with the customer base data samples can reduce the computational cost, and improve the efficiency of the decision tree generation. [8] In this paper author solving the problem a decision tree algorithm based on attribute importance is proposed. The improved algorithm uses attribute-importance to increase information gain of attribution which has fewer attributions and compares ID3 with improved ID3 by an example. The experimental analysis of the data shows that the improved ID3 algorithm can get more reasonable and more effective rules. The improved algorithm through introducing attribute importance emphasizes the attributes with fewer values and higher importance, dilute the attributes with more values and lower importance, and solve the classification defect of inclining to choose attributions with more values. [3] Author proposed a Fuzzy Decision tree for Stock market analysis has traditionally been proven to be difficult due to the large amount of noise present in the data. Decision trees based on the ID3 algorithm are used to derive short-term trading decisions from candlesticks. To handle the large amount of uncertainty in the data, both inputs and output classifications are fuzzified using well defined membership functions. Testing results of the derived decision trees show significant gains compared to ideal mid and long-term trading simulations both in frictionless and realistic markets. [9] This paper attempts to summarize the advances in RST, its extensions, and their applications. It also identifies important areas which require further investigation. Typical example application domains are examined which demonstrate the success of the application of RST to a wide variety of areas and _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 22 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ disciplines, and which also exhibit the strengths and limitations of the respective underlying approaches” .Formally, a rough set is the approximation of a vague concept (set) by a pair of precise concepts, called lower and upper approximations, which are a classification of the domain of interest into disjoint categories. Equation can only be used when the true values for the covariance and variances are known. When these values are unknown, an estimate of the correlation can be made using Pearson's product-moment correlation coefficient over a sample of the population (xi, y). This formula only requires finding the mean of each feature and the target to calculate: [10] Author proposed a method for anti spam filtering described as The task of anti spam filter is to rule out unsolicited bulk e-mail (junk) automatically from a user's mail stream. The two approaches that are used for classification here are based on the fuzzy and decision trees to build an automatic anti-spam filter to classify emails as spam or legitimate. Fuzzy similarity and ID3 approach based systems derives the classification from training data using learning techniques. The fuzzy based method uses fuzzy sets and the decision tree method uses a set of heuristic rules to classify e-mail messages. ………(2) Where m is the number of data points. Correlation coefficients can be used for both repressors and classifiers. When the machine is a repressor, the range of values of the target may be any ratio scale. When the learning machine is a classifier, we restrict the range of values for the target to ±1. We then use the coefficient of determination, or R(i)2 , to enforce a ranking of the features according to the goodness of linear fit between individual features and the target [25]. When using the correlation coefficient as a feature selection metric, we must remember that the correlation only finds linear relationships between a feature and the target. Thus, a feature and the target may be perfectly related in a non-linear manner, but the correlation could be equal to 0. We may lift this restriction by using simple non-linear pre-processing techniques on the feature before calculating the correlation coefficients to establish a goodness of non-linear relationship fit between a feature and the target [12]. [11] Here Author study on various data mining algorithm based on decision tree described as a Decision tree algorithm is a kind of data mining model to make induction learning algorithm based on examples. It is easy to extract display rule, has smaller computation amount, and could display important decision property and own higher classification precision. For the study of data mining algorithm based on decision tree, this article put forward specific solution for the problems of property value vacancy, multiple-valued property selection, property selection criteria, propose to introduce weighted and simplified entropy into decision tree algorithm so as to achieve the improvement of ID3 algorithm. Genetic algorithm is a population based heuristic function used for optimization process. They combine survival of the fittest among string structures with a structured yet randomized information exchange to form a search algorithm with some innovative flair of human search. These algorithms are started with a set of random solution called initial population. Each member of this population is called a chromosome. Each chromosome of this problem which consists of the string genes. The number of genes and their values in each chromosome depends on the population specification. In the algorithm of this paper, the number of genes of each chromosome is equal to the number of the nodes in the tree and the gene values demonstrate the selection priority of the classification to the node, where the higher priority means that task must executed early[16]. Set of chromosomes in each III. ATTRIBUTE CORRELATION & GA ALGORITHM The correlation coefficient is a statistical test that measures the strength and quality of the relationship between two variables. Correlation coefficients can range from -1 to 1. The absolute value of the coefficient gives the strength of the relationship; absolute values closer to 1 indicate a stronger relationship. The sign of the coefficient gives the direction of the relationship: a positive sign indicates then the two variables increase or decrease with each other and a negative sign shows that one variable increases as the other decreases. In machine learning problems, the correlation coefficient is used to evaluate how accurately a feature predicts the target independent of the context of other features. The features are then ranked based on the correlation score [11]. For problems where the covariance cov( Xi , Y) between a feature ( Xi ) and the target (Y) and the variances of the feature (var( Xi )) and target (var(Y)) are known, the correlation can be directly calculated: iteration of GA is called a generation, which are evaluated by their fitness functions. The new evaluated by their fitness functions. The new generation i.e., the offspring’s are created by applying some operators on the current generation. These are called crossover which selects two chromosomes of the current population, combines them and generates a new child (offspring), and mutation which changes randomly some gene values of chromosomes and creates a new offspring. Then, the best offspring’s are selected by evolutionary select operator according to their fitness values. The GA has four steps as shown below in ………………1) _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 23 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ figure: increase among the trees, decide an associated value of feature. 7. Resulting classifier set is classified Finally to estimate the entire model, misclassification Conversion of binary attribute in actual value V. EXPERIMENTAL RESULT ANALYSIS For the process of experimental result analysis of proposed algorithm we collected 3 datasets from the UCI Machine Learning Repository. The datasets have item sizes vary from 150 to 1000 and feature sizes from 4 to 10. A few datasets have missing values and we replaced them with negative values. The nominal data types are changed to integers and are numbered starting from 1 based on the order of the appearance. For those dataset with multiple classes, we use class 1 as the positive class and all other classes as the negative class. We used a 10-fold cross validation for each experiment. For the total of 10 rounds of cross validation for each dataset in each experiment, we recorded the mean of the average accuracy of individual classifiers. Our all process performs in matlab 7.8.0 and show the result in table form. Table 1 shows that comparative result of wine dataset Dataset Algorithm Accuracy Time ID3 84.82 34.23 FUZZY_ID 87.46 32.82 ID3-GA 93.17 17.89 Fig. 1:- Working process of genetic algorithm IV. PROPOSED METHODOLOGY In this section discuss the proposed algorithm for data classification. the proposed algorithm is a combination of ID3 and genetic algorithm. Feature correlation is very important function for the process of genetic algorithm and ID3 algorithm. GA algorithm is creating for data training for minority and majority class data sample for processing of tree classification. The input processing of training phase is data sampling technique for classifier. While fitness function select the initial input of ID3 algorithm, GA function optimized with single value might find relationships more quickly. Wine Dataset Table 2 shows that comparative result of iris dataset 1. 2. 3. 4. Sampling of data of sampling technique Split data into two parts training and testing part Apply GA function for training a sample value Using 2/3 of the sample, fit a tree the split at each node For each tree. . Calculate classification of the available 1/3 using the tree, and calculate the misclassification rate = out of GA. 5. For each variable in the tree 6. Compute Error Rate: Calculate the overall percentage of misclassification Variable selection: Average increase in GA error over all trees and assuming a normal division of the Dataset Algorithm Accuracy Time Iris Dataset ID3 84.52 35.11 FUZZY_ID 86.76 31.33 ID3-GA 94.67 16.86 Table 3 shows that comparative result of cancer dataset Dataset Algorithm Accuracy Time Cancer Dataset ID3 83.34 37.43 _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 24 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ FUZZY_ID 88.48 34.26 ID3-GA 95.23 15.46 VI. CONCLUSION In this paper we proposed an optimized ID3 method based on genetic algorithm. Our method combined feature correlation factor for estimating for attributes selection. These three are combined together and form GA-ID3 model , these GA-ID3 model passes through data and reduces the unclassified data improve the majority voting of classifier. Our experimental result shows better in compression of old and traditional ID3 classifier. our experimental task perform in UCI data set such as, wine, iris and cancer etc. The model is stable under different machine learning algorithms, dataset sizes, or feature sizes REFERENCES Fig.2:- Comparative result analysis of classification accuracy and execution time of three algorithm for wine dataset Fig. 3:- Comparative result analysis of classification accuracy and execution time of three algorithm for iris dataset Fig. 4:- Comparative result analysis of classification accuracy and execution time of three algorithm for Cancer dataset [1] A.Martin, Research, Aswathy.V & Balaji.S, T. Miranda Lakshmi, V.Prasanna Venkatesan” An Analysis on Qualitative Bankruptcy Prediction Using Fuzzy ID3 and Ant Colony Optimization Algorithm” IEEE 2012, Pp 56-67. [2] Feng Yang,Hemin Jin, Huirnin Qi “Study on the Application of Data Mining for Customer Groups Based on the Modified ID3 Algorithm in the E-commerce” IEEE 2012, Pp 78-87. [3] Carlo Noel Ochotorena, Cecille Adrianne Yap, Elmer Dadios, and Edwin Sybingco” Robust Stock Trading Using Fuzzy Decision Trees” IEEE 2012, Pp 24-33. [4] Joao Roberto Bertini Junior, Maria do Carmo Nicoletti, Liang Zhao”Attribute-based Decision Graphs for Multiclass Data Classification” IEEE 2013, Pp 97-106. [6] Chen Jin, Luo De lin, Mu Fen xiang “An Improved ID3 Decision Tree Algorithm” ,IEEE 2009, Pp 76-87. [7] Sonika Tiwari, Roopali Soni “Horizontal partitioning ID3 algorithm A new approach of detecting network anomalies using decision tree” (IJERT) ISSN: 2278-0181,Vol 1 Issue 7, September 2012. [8] Liu Yuxun, Xie Niuniu “Improved Algorithm” IEEE 2010, Pp 34-42. [9] N.MAC PARTHALA´ IN and Q. SHEN “On rough sets, their recent extensions and Applications” The Knowledge Engineering Review, Vol. 25:4, 365–395. & Cambridge University Press, 2010. ID3 _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 25 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) _______________________________________________________________________________________________ [10] Binsy Thomas, Dr. J.W.Bakal “Fuzzy Similarity and ID3 algorithm for anti spam Filtering” IJEA ISSN: 2320-0804 Vol. 2 Issue7 2013. [11] Linna Li, Xuemin Zhang “Study of Data Mining Algorithm Based on Decision Tree”, ICCDA IEEE 2010, Pp 78-88. [12] C. H. L. Lee, Y. C. Liaw, L. Hsu "Investment Decision Making by Using Fuzzy Candlestick Pattern and Genetic Algorithm" in IEEE International Conference on Fuzzy Systems 2011, Pp 2696-2701. [13] on Machine Learning (ICML 2011). ACM, 2011, Pp 17–24. W. Bi and J. Kwok “Multi-label classification on tree and DAG structured hierarchies” in Proceedings of the 28th International Conference [14] Narasimha Prasad, Prudhvi Kumar Reddy, Naidu MM “An Approach to Prediction of Precipitation Using Gini Index in SLIQ Decision Tree” 4th International Conference on Intelligent Systems, 2013. Pp 56-60. [15] B. Chandra, P. Paul Varghese "Fuzzy SLIQ Decision Tree Algorithm" IEEE Transactions on Systems, Man and Cybernetics Part B: Cybernetics. Vol.38, 2008. [16] Sung-Hwan Min, Jumin Lee, Ingoo Han “Hybrid genetic algorithms and support vector machines for bankruptcy prediction” Elsevier Ltd. Expert Systems with Applications, 2010 Pp 5689-5697. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014 26