iii. attribute correlation & ga algorithm

advertisement
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
Improved the Classification Ratio of ID3 Algorithm Using Attribute
Correlation and Genetic Algorithm
1
Priti Bhagwatkar, 2Parmalik Kumar
Department of Computer science & Engineering
PIT,BHOPAL, INDIA, PCST,BHOPAL,INDIA
Email: bhagwatkarpriti@gmail.com, Parmalik.kumar@patelcollege.com
features. There are generally two types of multiple
classifiers combination: multiple classifiers selection and
multiple classifiers fusion[3,4]. Multiple classifiers
selection assumes that each classifier has expertise in
some local regions of the feature space and attempts to
find which classifier has the highest local accuracy in the
vicinity of an unknown test sample. Then, this classifier is
nominated to make the final decision of the system[8].
Attribute correlation technique is new method to find the
relation of attribute using correlation coefficient factor.
The correlation coefficient factor estimates the
correlation value and passes through genetic algorithm.
Genetic algorithm is population based searching
technique and finds to know best possible set of value for
the classification process. Decision tree is one of widely
used classification method in Data Mining field whose
core problem is the choice of splitting attributes. In ID3
algorithm, information theory is applied to choose the
attribute that has the biggest information gain as the
splitting attribute in each step[10,11,12]. And a recursive
way is used to generating decision tree until certain
condition is reached. ID3 algorithm has been extensively
applied in many fields already, but some inherent defects
still exist. The most obvious one is its inclining to
attributes with many values. About improving this
inclination, related researchers proposed lots of improved
methods, such as, modify the information gain of any
attributes by weighting add the number of attributes
values, the users’ interestingness attribute similarity to
information gain as weight. However, there are specific
conditions and restriction in above methods[14,15].
Therefore, on the basis of many research achievements,
an improved ID3 based on weighted modified information
gain using genetic algorithm called GA_ID3 is proposed
in this paper.
Abstract— Increasing the size of data classification
performance of ID3 algorithm is decrease. Improvement of
ID3 algorithm various techniques are used such as attribute
selection method, neural network, and fuzzy based ANT
colony optimization. All this technique faced a problem of
attribute correlation and decreases the performance of ID3
algorithm. In this paper proposed a GA based ID3
algorithm for data classification. In the proposed algorithm
generate attribute correlation using genetic algorithm. the
proposed algorithm implemented in MATLAB software and
used some standard data set from UCI machine learning
repository. our experimental result shows better
classification ratio instead of ID3 and fuzzyID3.
Index Terms—- ID3, Attribute correlation, data mining and
GA
I. INTRODUCTION
The diversity and applicability of data mining are increase
day to day in the field of engineering and science for the
predication of product market analysis. The data mining
provide lots of technique for mine data in several field, the
technique of mining as association rule mining, clustering
technique, classification technique and emerging
technique such as called ensemble classification
technique. The process of ensemble classifier increases
the classification rate and improved the majority voting of
classification technique for individual classification
algorithm such as KNN, Decision tree and support vector
machine. The new paradigms of ID3 are GA technique for
classification of data[1,2]. This paper apply classification
proceed based on GA selection to data and propose an
ID3 classifier selection method. In the method, many
features are selected for a hybrid process. Then, the
standard presentation of each feature on selected ID3 is
calculated and the classifier with the best average
performance is chosen to classify the given data. In the
computation of normal act, weighted average is technique
is used. Weight values are calculated according to the
distances between the given data and each selected
The above section discuss introduction of ID3 algorithm
and feature correlation attribute. In section II we describe
related work in ID3 algorithm. In section III discuss
attribute correlation and genetic algorithm. In section IV
discuss proposed methodology for classification. In
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
21
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
section V discuss Experimental result and finally
conclude in section VI.
[5] Author proposed a new attribute based method for
multiclass data classification described as a method
Graph-based representation has been successfully used to
support various machine learning and data mining
algorithms. The learning algorithms strongly rely on the
algorithm employed for constructing the graph from input
data, given as a set of vector-based patterns. A popular
way to build such graphs is to treat each data pattern as a
vertex; vertices are then connected according to some
similarity measure, resulting in an structure known as data
graph.
II. RELATED WORK
In this section discuss the related work of feature
correlation and attribute selection for ID3 algorithm. Now
a day’s ID3 algorithm are used for prediction and
classification process in data science. The diversity of
data decreases the performance of ID3 algorithm. now
improvement of performance of ID3 algorithm used
various technique such as weighted technique ,ANT
colony optimization technique and fuzzy logic. All
technique discuss here as contribution for improvement of
ID3 algorithm.
[6] Author proposed an improved decision tree ID3
algorithm described as a Decision tree is an important
method for both induction research and data mining,
which is mainly used for model classification and
prediction. ID3 algorithm is the most widely used
algorithm in the decision tree so far. Through illustrating
on the basic ideas of decision tree in data mining, in this
paper, the shortcoming of ID3’s inclining to choose
attributes with many values is discussed, and then a new
decision tree algorithm combining ID3 and Association
Function (AF) is presented.
[1] Author proposed a new method described as Many
Qualitative Bankruptcy Prediction models are available.
These models use non-financial information as
Qualitative factors to predict Bankruptcy. However this
Model uses only very less number of Qualitative factors
and the generated rules has redundancy and overlapping.
To improve the Prediction accuracy we have proposed a
model which applies more number of Qualitative factors
which can be categorized using Fuzzy ID3 Algorithm and
Prediction Rules are generated using Ant Colony
Optimization Algorithm (ACO). In Fuzzy ID3 the
concept of Entropy and Information Gain helps to rank
the qualitative parameters and this can be used to generate
prediction rules in qualitative Bankruptcy prediction.
[7] Author proposed a new approach of Detecting
Network Anomalies using improved ID3 with horizontal
portioning based decision tree. During the last decades,
different approaches to intrusion detection have been
explored. The two most common approaches are misuse
detection and anomaly detection. In misuse detection,
attacks are detected by matching the current traffic pattern
with the signature of known attacks. Anomaly detection
keeps a profile of normal system behavior and interprets
any significant deviation from this normal profile as
malicious activity. One of the strengths of anomaly
detection is the ability to detect new attacks. Anomaly
detection’s most serious weakness is that it generates too
many false alarms.
[2] In this paper author describes an ID3 algorithm is a
mining one based on decision tree, which selects property
value with the highest gains as the test attribute of its
sample sets, establishes decision-making node, and
divides them in turn. ID3 algorithm involves repeated
logarithm operation, and it will affect the efficiency of
generating decision tree when there are a large number of
data, so one must change the selection criteria of data set
attributes, using the Taylor formula to transform the
algorithm to reduce the amount of data calculation and the
generation time of decision trees and thus improve the
efficiency of the decision tree classifier. It is shown that
the use of improved ID3 algorithm to deal with the
customer base data samples can reduce the computational
cost, and improve the efficiency of the decision tree
generation.
[8] In this paper author solving the problem a decision
tree algorithm based on attribute importance is proposed.
The improved algorithm uses attribute-importance to
increase information gain of attribution which has fewer
attributions and compares ID3 with improved ID3 by an
example. The experimental analysis of the data shows that
the improved ID3 algorithm can get more reasonable and
more effective rules. The improved algorithm through
introducing attribute importance emphasizes the
attributes with fewer values and higher importance, dilute
the attributes with more values and lower importance, and
solve the classification defect of inclining to choose
attributions with more values.
[3] Author proposed a Fuzzy Decision tree for Stock
market analysis has traditionally been proven to be
difficult due to the large amount of noise present in the
data. Decision trees based on the ID3 algorithm are used
to derive short-term trading decisions from candlesticks.
To handle the large amount of uncertainty in the data, both
inputs and output classifications are fuzzified using well
defined membership functions. Testing results of the
derived decision trees show significant gains compared to
ideal mid and long-term trading simulations both in
frictionless and realistic markets.
[9] This paper attempts to summarize the advances in
RST, its extensions, and their applications. It also
identifies important areas which require further
investigation. Typical example application domains are
examined which demonstrate the success of the
application of RST to a wide variety of areas and
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
22
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
disciplines, and which also exhibit the strengths and
limitations of the respective underlying approaches”
.Formally, a rough set is the approximation of a vague
concept (set) by a pair of precise concepts, called lower
and upper approximations, which are a classification of
the domain of interest into disjoint categories.
Equation can only be used when the true values for the
covariance and variances are known. When these values
are unknown, an estimate of the correlation can be made
using Pearson's product-moment correlation coefficient
over a sample of the population (xi, y). This formula only
requires finding the mean of each feature and the target to
calculate:
[10] Author proposed a method for anti spam filtering
described as The task of anti spam filter is to rule out
unsolicited bulk e-mail (junk) automatically from a user's
mail stream. The two approaches that are used for
classification here are based on the fuzzy and decision
trees to build an automatic anti-spam filter to classify
emails as spam or legitimate. Fuzzy similarity and ID3
approach based systems derives the classification from
training data using learning techniques. The fuzzy based
method uses fuzzy sets and the decision tree method uses
a set of heuristic rules to classify e-mail messages.
………(2)
Where m is the number of data points. Correlation
coefficients can be used for both repressors and
classifiers. When the machine is a repressor, the range of
values of the target may be any ratio scale. When the
learning machine is a classifier, we restrict the range of
values for the target to ±1. We then use the coefficient of
determination, or R(i)2 , to enforce a ranking of the
features according to the goodness of linear fit between
individual features and the target [25]. When using the
correlation coefficient as a feature selection metric, we
must remember that the correlation only finds linear
relationships between a feature and the target. Thus, a
feature and the target may be perfectly related in a
non-linear manner, but the correlation could be equal to 0.
We may lift this restriction by using simple non-linear
pre-processing techniques on the feature before
calculating the correlation coefficients to establish a
goodness of non-linear relationship fit between a feature
and the target [12].
[11] Here Author study on various data mining algorithm
based on decision tree described as a Decision tree
algorithm is a kind of data mining model to make
induction learning algorithm based on examples. It is easy
to extract display rule, has smaller computation amount,
and could display important decision property and own
higher classification precision. For the study of data
mining algorithm based on decision tree, this article put
forward specific solution for the problems of property
value vacancy, multiple-valued property selection,
property selection criteria, propose to introduce weighted
and simplified entropy into decision tree algorithm so as
to achieve the improvement of ID3 algorithm.
Genetic algorithm is a population based heuristic function
used for optimization process. They combine survival of
the fittest among string structures with a structured yet
randomized information exchange to form a search
algorithm with some innovative flair of human search.
These algorithms are started with a set of random solution
called initial population. Each member of this population
is called a chromosome. Each chromosome of this
problem which consists of the string genes. The number of
genes and their values in each chromosome depends on
the population specification. In the algorithm of this
paper, the number of genes of each chromosome is equal
to the number of the nodes in the tree and the gene values
demonstrate the selection priority of the classification to
the node, where the higher priority means that task must
executed early[16]. Set of chromosomes in each
III. ATTRIBUTE CORRELATION & GA
ALGORITHM
The correlation coefficient is a statistical test that
measures the strength and quality of the relationship
between two variables. Correlation coefficients can range
from -1 to 1. The absolute value of the coefficient gives
the strength of the relationship; absolute values closer to 1
indicate a stronger relationship. The sign of the
coefficient gives the direction of the relationship: a
positive sign indicates then the two variables increase or
decrease with each other and a negative sign shows that
one variable increases as the other decreases. In machine
learning problems, the correlation coefficient is used to
evaluate how accurately a feature predicts the target
independent of the context of other features. The features
are then ranked based on the correlation score [11]. For
problems where the covariance cov( Xi , Y) between a
feature ( Xi ) and the target (Y) and the variances of the
feature (var( Xi )) and target (var(Y)) are known, the
correlation can be directly calculated:
iteration of GA is called a generation, which are evaluated
by their fitness functions. The new evaluated by their
fitness functions. The new generation i.e., the offspring’s
are created by applying some operators on the current
generation. These are called crossover which selects two
chromosomes of the current population, combines them
and generates a new child (offspring), and mutation which
changes randomly some gene values of chromosomes and
creates a new offspring. Then, the best offspring’s are
selected by evolutionary select operator according to their
fitness values. The GA has four steps as shown below in
………………1)
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
23
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
figure:
increase among the trees, decide an associated value
of feature.
7. Resulting classifier set is classified
 Finally
to estimate the entire model,
misclassification
Conversion of binary attribute in actual value
V. EXPERIMENTAL RESULT ANALYSIS
For the process of experimental result analysis of
proposed algorithm we collected 3 datasets from the UCI
Machine Learning Repository. The datasets have item
sizes vary from 150 to 1000 and feature sizes from 4 to 10.
A few datasets have missing values and we replaced them
with negative values. The nominal data types are changed
to integers and are numbered starting from 1 based on the
order of the appearance. For those dataset with multiple
classes, we use class 1 as the positive class and all other
classes as the negative class. We used a 10-fold cross
validation for each experiment. For the total of 10 rounds
of cross validation for each dataset in each experiment, we
recorded the mean of the average accuracy of individual
classifiers. Our all process performs in matlab 7.8.0 and
show the result in table form.
Table 1 shows that comparative result of wine dataset
Dataset
Algorithm
Accuracy
Time
ID3
84.82
34.23
FUZZY_ID
87.46
32.82
ID3-GA
93.17
17.89
Fig. 1:- Working process of genetic algorithm
IV. PROPOSED METHODOLOGY
In this section discuss the proposed algorithm for data
classification. the proposed algorithm is a combination of
ID3 and genetic algorithm. Feature correlation is very
important function for the process of genetic algorithm
and ID3 algorithm. GA algorithm is creating for data
training for minority and majority class data sample for
processing of tree classification. The input processing of
training phase is data sampling technique for classifier.
While fitness function select the initial input of ID3
algorithm, GA function optimized with single value might
find relationships more quickly.
Wine
Dataset
Table 2 shows that comparative result of iris dataset
1.
2.
3.
4.
Sampling of data of sampling technique
Split data into two parts training and testing part
Apply GA function for training a sample value
Using 2/3 of the sample, fit a tree the split at each
node
For each tree. .
 Calculate classification of the available 1/3 using the
tree, and calculate the misclassification rate = out of
GA.
5. For each variable in the tree
6. Compute Error Rate: Calculate the overall
percentage of misclassification
 Variable selection: Average increase in GA error
over all trees and assuming a normal division of the
Dataset
Algorithm
Accuracy
Time
Iris Dataset
ID3
84.52
35.11
FUZZY_ID
86.76
31.33
ID3-GA
94.67
16.86
Table 3 shows that comparative result of cancer dataset
Dataset
Algorithm
Accuracy
Time
Cancer
Dataset
ID3
83.34
37.43
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
24
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
FUZZY_ID
88.48
34.26
ID3-GA
95.23
15.46
VI. CONCLUSION
In this paper we proposed an optimized ID3 method based
on genetic algorithm. Our method combined feature
correlation factor for estimating for attributes selection.
These three are combined together and form GA-ID3
model , these GA-ID3 model passes through data and
reduces the unclassified data improve the majority voting
of classifier. Our experimental result shows better in
compression of old and traditional ID3 classifier. our
experimental task perform in UCI data set such as, wine,
iris and cancer etc. The model is stable under different
machine learning algorithms, dataset sizes, or feature
sizes
REFERENCES
Fig.2:- Comparative result analysis of classification
accuracy and execution time of three algorithm for wine
dataset
Fig. 3:- Comparative result analysis of classification
accuracy and execution time of three algorithm for iris
dataset
Fig. 4:- Comparative result analysis of classification
accuracy and execution time of three algorithm for Cancer
dataset
[1]
A.Martin, Research, Aswathy.V & Balaji.S, T.
Miranda Lakshmi, V.Prasanna Venkatesan” An
Analysis on Qualitative Bankruptcy Prediction
Using Fuzzy ID3 and Ant Colony Optimization
Algorithm” IEEE 2012, Pp 56-67.
[2]
Feng Yang,Hemin Jin, Huirnin Qi “Study on the
Application of Data Mining for Customer Groups
Based on the Modified ID3 Algorithm in the
E-commerce” IEEE 2012, Pp 78-87.
[3]
Carlo Noel Ochotorena, Cecille Adrianne Yap,
Elmer Dadios, and Edwin Sybingco” Robust Stock
Trading Using Fuzzy Decision Trees” IEEE 2012,
Pp 24-33.
[4]
Joao Roberto Bertini Junior, Maria do Carmo
Nicoletti, Liang Zhao”Attribute-based Decision
Graphs for Multiclass Data Classification” IEEE
2013, Pp 97-106.
[6]
Chen Jin, Luo De lin, Mu Fen xiang “An
Improved ID3 Decision Tree Algorithm” ,IEEE
2009, Pp 76-87.
[7]
Sonika Tiwari, Roopali Soni “Horizontal
partitioning ID3 algorithm A new approach of
detecting network anomalies using decision tree”
(IJERT) ISSN: 2278-0181,Vol 1 Issue 7,
September 2012.
[8]
Liu Yuxun, Xie Niuniu “Improved
Algorithm” IEEE 2010, Pp 34-42.
[9]
N.MAC PARTHALA´ IN and Q. SHEN “On
rough sets, their recent extensions and
Applications” The Knowledge Engineering
Review, Vol. 25:4, 365–395. & Cambridge
University Press, 2010.
ID3
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
25
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
[10]
Binsy Thomas, Dr. J.W.Bakal “Fuzzy Similarity
and ID3 algorithm for anti spam Filtering” IJEA
ISSN: 2320-0804 Vol. 2 Issue7 2013.
[11]
Linna Li, Xuemin Zhang “Study of Data Mining
Algorithm Based on Decision Tree”, ICCDA
IEEE 2010, Pp 78-88.
[12]
C. H. L. Lee, Y. C. Liaw, L. Hsu "Investment
Decision Making by Using Fuzzy Candlestick
Pattern and Genetic Algorithm" in IEEE
International Conference on Fuzzy Systems 2011,
Pp 2696-2701.
[13]
on Machine Learning (ICML 2011). ACM, 2011,
Pp 17–24.
W. Bi and J. Kwok “Multi-label classification on
tree and DAG structured hierarchies” in
Proceedings of the 28th International Conference
[14]
Narasimha Prasad, Prudhvi Kumar Reddy, Naidu
MM “An Approach to Prediction of Precipitation
Using Gini Index in SLIQ Decision Tree” 4th
International Conference on Intelligent Systems,
2013. Pp 56-60.
[15]
B. Chandra, P. Paul Varghese "Fuzzy SLIQ
Decision Tree Algorithm" IEEE Transactions on
Systems, Man and Cybernetics Part B:
Cybernetics. Vol.38, 2008.
[16]
Sung-Hwan Min, Jumin Lee, Ingoo Han “Hybrid
genetic algorithms and support vector machines
for bankruptcy prediction” Elsevier Ltd. Expert
Systems with Applications, 2010 Pp 5689-5697.

_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
26
Download