A Data Mining Algorithm based on Genetic Algorithm

advertisement
A Data Mining Algorithm based on Genetic Algorithm
Yun Bai1 Deyu Qi2 Qiangguo Pu1 Nikos Mastorakis3
Computer Center, University of Science and Technology of Suzhou, Jiangsu 215011, China
2
College of Computer Science, South China University of Technology, Guangzhou 510640, China
3
Hellenic Naval Academy, Terma Hatzikyriakou, 18539 Piraeus, Greece
Also: WSEAS, Ag.I.Theologou 17-23, 15773, Zografou, Athens, Greece
1
Abstract: -Data Mining (DM) is a new hot research point in Database technology area. In this paper, we discuss
the application of genetic algorithm (GA) in DM, and bring forward the basis idea and key design question of the
new DM algorithm based on GA, such as knowledge rule expression, knowledge rule coding, fitness function
definition. We illustrate the validity of the new DM algorithm by the given instance.
Keywords: -Genetic Algorithm, Data Mining, Knowledge Expression, Knowledge Rule
policy which adopts the genetic algorithm to
search within the database for evolution on a
group of rules as generated randomly until the
database is covered by the rule group so as to
mine any useful rule as hidden in the database.
As the production rule is expressed simply
and clearly, it is also mostly applied within the
current expert system. Its expression is
composed of two parts: premise and conclusion.
The premise is just the precondition which must
be met in rule operation, while the conclusion is
just the action or result in rule operation. A
production rule may be described as:
IF E1(A1, A2,…, Am)∧E2(A1, A2,…, Am)
1. Data Mining and Knowledge Expression
1.1 Data Mining
Data Mining (DM) is the process to extract
any unknown and useful information and
knowledge from a number of incomplete,
uncertain, blurred and random data. Similar
process also includes the Knowledge Discovery,
Data
Analysis,
Data
Fusion
and
Decision-Making Support[1]. There are also the
following types of knowledge as discovered by
DM: generalized, Characteristic, Contrast,
correlative, predicting and Deviation, while the
classification, clustering, reduced dimension,
pattern recognition, visualization, decision tree,
genetic algorithm and uncertainty disposal are
normally used as the tools and methods for the
discovery[2].
∧ … ∧ En(A1, A2, … , Am) THEN
H(Conclusion)
In which, Ei(A1, A2,…, Am)(1≤i≤n) is the
1.2 Knowledge Expression
premise attributed by Ai(1≤i≤m). If they have
As very important to DM, the correct
selection and use of Knowledge Expression
would greatly improve the efficiency on
problem solution[3]. The knowledge may be
expressed as predicate logic, production rules,
semantic network, framework and neural
network. The knowledge extracted in DM is
also normally expressed as concepts, rules,
regularities and patterns[1]. Most of DM would
be deemed as search, while the database is the
search space. The DM Algorithm as described
in this paper may be considered as a search
the relation of OR(∨) between the attributes,
Ei(A1, A2,…, Am) is expressed as an disjunction
form, that is, Ei=A1 OR Ei=A2… OR Ei=Am.
Where AND( ∧ ) exists between attributes,
Ei(A1, A2,…, Am)is expressed as a coincidence
form, that is, Ei=A1 AND Ei=A2… AND Ei=
Am. Following is an example of the rule
expression and its significance.
A rule on knowledge classification:
IF
PRICE(moderate,
low)
∧
1
QUALITY(high, normal) ∧ SERVICE(good)
THEN CLASS=acc, equivalent to IF
(PRICE=moderate
∨
PRICE=low)
∧
IF
PRICE(moderate,
low)
∧
QUALITY(high, normal) ∧ SERVICE(good)
THEN CLASS=acc, supposed that the type of
the characteristic attribute is discrete(any
attribute on numerical characteristic would be
dispersed), if the discrete attribute has k
possible results, then the k bits would be
allocated in the binary string, each bit
corresponds to the given value, 0 means no
value in disjunction form, while 1 is reversed.
Any individual sort is expressed as the
continuous binary string on sort attribute for the
purpose of simplifying algorithm. As known
from the above rule, if the value field of the
characteristic attribute PRICE is {high,
moderate, low}, the value field of QUALITY is
{high, normal, poor}, the value field of
SERVICE is {good, bad}, the value field of the
sort attribute is {uacc, acc, good, vgood}
respectively corresponded by the codes of 00,
01, 10 and 11, then the rule may express the
following genome(binary string)form as
0111101001, in which the rule corresponds to
the genome type one by one. The binary string
001100111 also corresponds to the following
rules as:
IF PRICE(low) ∧ QUALITY(high) ∧
SERVICE(good) THEN CLASS=vgood
If a characteristic attribute code is full of 1,
then the attribute would not affect any validity
of rule with whatever value. For example, if the
PRICE is coded as 111 and the rule on the code
is 1111001010, then the signification of the rule
is that if the commodity quality is good while
the service is good, then the commodity is
accepted taking no account of price. In contrary
saying, where the rule of “if the commodity
service is poor, then the commodity is
unaccepted”, that is, IF SERVICE(bad) THEN
CLASS=uacc, which corresponds to the binary
code string 1111110100.
(QUALITY=high ∨ QUALITY=normal) ∧
(SERVICE=good) THEN CLASS=acc. This
rule is expressed with the significance as, if the
price of commodity is moderate or low, while
the quality is high or normal and service is good,
then the commodity would be accepted.
2. Genetic algorithm and realization of
genetic operation
2.1 Genetic algorithm
The genetic algorithm (GA) is an
algorithm simulating the process of biological
evolution to complete the optimized search, it is
essentially a random search method based on
simulation of biological evolution process[6]. Its
basic principle may be concluded to
comprehend or transfer the target function on
optimization to the fitness of any biological
population in environment, to correspond the
optimized mutations to any individual of
biological population, and to analogize any
algorithm on optimized solution with the
evolution of the biological population[7].
There are three apparent features between
the genetic algorithm and the traditional
optimized algorithm, that is, high robust, whole
search capacity and internal parallelism.
2.2 Genome coding
An important process in GA is coding, that
is the transfer from the parameter form on
solution of the optimized problem to the
expression on genome code string.
Taking the knowledge class rule coding as
an example, the genetic code string on the rule
(individual) is realized with the binary code
according to the feature of the knowledge class
rule[8]. A rule is divided into two parts, rule
characteristic(premise)
and
rule
sort(conclusion), that is,
2.3 Definition on fitness function
The “good rule” exists from any mining on
2
classification rule within the knowledge
database by GA, and acts as the father
generation rule to reproduction, crossover and
mutation until the optimized rule group is
discovered. The “good rule” means a high
matching between the rule and any instance
within the test data set (also called as the record,
including featured and classified attributes),
while the fitness function shall reflect the
matching degree of rule and data set. In the
definition of the fitness function there are three
important parameters to be considered in the
rule, such as accuracy, utility and coverage,
which will be noted as follows (assumed that U
is a test data set, and e is an instance
(element)within it).
Accuracy: The accuracy of a rule ri is
measured with the matching degree between
instances within the test data set U and the rule,
which is reflected by the following formula:
Accuracy (ri ) 

rule is 1. If an instance e matches with m rules
to be evolved within the current population,
then each rule to be evolved has the utility of
1/m as indicated by the following formula:
(ri , e)
eU  (ri , e)
Utility(ri )  
the instance e successfully
1
 (ri , e)   matches with the rule r
 0 other
In which, U and e are defined as above,
while r is a rule within the current population.
Coverage: The coverage of the rule ri
means the numbers of instance e within the test
data set U matched with the premise
part(characteristic part) of ri to be evolved as
indicated by the following formula:
Coverage(ri )  uri  uri
Apparently, the higher the rule coverage is,
the better the rule generality is. As the values of
the three parameters would affect the size of
rule fitness, we are in wish of their higher
values.
When fixing the fitness function of the rule,
the relationship between rules shall be analyzed
first. There are three possible relationships
between rules, such as contain, redundant and
contradictory. Let’s take a look:
Contain:
IF PRICE(normal, low)
∧SERVICE(good) THEN CLASS=acc
IF
SERVICE(good)
THEN CLASS=acc
Contradictory: IF
PRICE(high) ∧
u
ri
uri  uri
In which,  uri is a subset of the test data set
U, in which each element(instance)matches
with the rule of ri to be evolved, while uri
is
the base of the subset of  uri .  uri is also a
subset of U, in which only the characteristic
part(premise) of each element(instance)matches
with the rule of ri to be evolved, and uri
is
the base of the subset of  uri . Apparently, the
higher the rule accuracy is, the higher the rule
trust is. If the accuracy is 1, then the rule within
the test data set U is constantly true, where the
condition of the rule is existence, the conclusion
of the rule is also existence.
Utility: Within the test data set U, an
instance of e may be matched with several rules
to be evolved, while each rule has the different
utility to be measurable. If an instance e within
U only matches with a rule to be evolved within
the current population, then the utility of the
QUALITY(poor) ∧ SERVICE(bad)
THEN
CLASS=acc
IF
PRICE(high) ∧
QUALITY(poor)∧SERVICE(bad)
THEN
CLASS=uacc
Redundant:
Very simpleness. That is
two same rules.
The relationship between rules shall be
considered in design of fitness function of the
rule, while contain, contradictory and redundant
3
rules shall be deleted to ensure any father
generation rule is selected as the “good rule”.
Following is the algorithm to evaluate rule
fitness:
Step 1: Assumed that U is a test data set,
P is a rule (genome) variety (N size);
Step 2: Calculating the Accuracy (ri) and
Utility (ri) for each rule ri
(i=1,2, … N) within the current
population P;
Step 3: Arranging rule ri in an
descending order according to the
product with Accuracy(ri) and
Utility(ri), while the results of
arrangement shall be written into
the sort table;
Step 4: The Coverage(rj) of the first rule
in the sort table shall be
calculated, while the fitness(rj) =
Coverage(rj) * Accuracy(rj) *
Utility(rj) (initial j=0)
Step 5: The rule as covered by the rule of
rj shall be deleted from U, i.e.
described in the document, the crossover
arithmetic operator uses the single-point
crossing, that is, randomly set a crossover point
within the individual binary gene string for the
parents’ genome crossover operation.
2.4.3
The mutation arithmetic operator is used to
change the gene value on some gene positions
of the individual binary gene string in the
population, such as 1→0 and 0→1.
2.5 Algorithm description
The following is the flow of data mining
algorithm based on genetic algorithm.
Begin
Input: Test data set U and all control
parameters of the genetic algorithm
Output: The optimized classified rule set
having been evolved
Step 1: Initialized population: N
genomes(binary gene string) are generated
randomly for valid disposal on rules.
Step 2: Calculation of the fitness value: To
calculate the fitness on each individual(rule)
within the current population.
Step 3: If once the currently evolved
generation number reaches the requirements on
setting the maximum evolved generation
number or the certain control parameters to
accord with setting, then goto Step 7, otherwise
it shall be continued.
Step 4: To select, crossover and mutate the
generation to generate the son generation
population.
Step 5: To replace the individual at low
fitness within the father generation population
with the individual at high fitness within the son
generation population for the new generation.
Step 6: Goto Step 2.
Step 7: To output the most optimized rule
set.
End
U  U  urj  urj
Step 6:
If rj is the last rule in the sort
table, exit once the calculation on
fitness of all rules is completed,
otherwise j=j+1, and goto Step 4.
2.4 Genetic arithmetic operator
2.4.1 Selective arithmetic operator
We adopt the normal selection policy as
fitness scaling method, also the roulette
selection method.
2.4.2
Mutation arithmetic operator
Crossover arithmetic operator
The crossover arithmetic operator plays a
very important role. It maintains the feature of
best individual within the original population on
a certain extent, while the algorithm may search
any new gene space, then the new individual
within the population has diversity. As
4
This paper raises the detailed methods on
fixing any parameter and its operation within
the genetic algorithm based on the realization
technology of genetic algorithm in combination
of the knowledge classification rule mining in
the data mining, including the genome coding
method and design of the genetic arithmetic
operator, especially the definition and
realization of the fitness function. Finally the
data mining algorithm based on the genetic
algorithm is given. Within the data mining, as
the genetic algorithm has the characteristic of
high robust, whole search capacity and internal
parallelism, it may be evolved based on the
meta-knowledge (metadata) to get the more
satisfactory knowledge and rule, while any
previously blank knowledge may be generated
from the mutation arithmetic operator within
the genetic algorithm, just as search any hidden
knowledge, which is fully noted in the example
as described in this paper.
3. Simulation on example
For example, the acceptance of customer
on the commodity is uacc, acc, good or vgood,
which is measured in price, quality and service,
while the price is high, moderate or low, quality
is high, normal or poor, and the service is good
or bad. Practice as the following table:
Example
no.
1
2
3
4
5
6
7
8
9
10
11
12
13
Featured attribute of
Acceptance
commodity
(Classification)
Price
Quality Service
High
Poor
Bad
Uacc
Moderate Normal Bad
Uacc
Low
High
Good
Vgood
High
High
Good
Good
Moderate High
Good
Good
Moderate Normal Good
Acc
High
Normal Good
Acc
High
Poor
Good
Uacc
Moderate Poor
Bad
Uacc
Low
Poor
Bad
Uacc
High
Normal Good
Acc
Moderate Normal Good
Acc
Low
Normal Good
Good
REFERENCES
Where we practise the above example with
the genetic algorithm, the parameter cross
probability Pc=0.9, variance probability
Pm=0.03, population size P_SIZE=128,
iterative calculation of 500 times are selected to
mine the following two valid rules:
Rule 1: IF QUALITY(poor) THEN
CLASS=uacc(fitness=36.00) (meaning the poor
service would not be accepted by customer).
Rule 2: IF SERVICE(good) THEN
CLASS=acc(fitness=7.88) (meaning the good
service would be accepted by customer).
The above example only notes that the
genetic algorithm is valid on classification of
knowledge rule, while the given data set or
metadata (meta-knowledge) may be millions in
data mining from the general and professional
knowledge.
1.
2.
3.
4. Conclusion
4.
5
Tubao Ho, Trongdung Nguyen, Ducdung
Nguyen, Saori Kawasaki, Visualization
Support for User Centered Model
Selection in Knowledge Discovery and
Data Mining, International Journal of
Artificial Intelligence Tools, 2001, 10(4).
691 - 713
Susan E. George, A Visualization and
Design Tool (AVID) for Data Mining with
the
Self-Organizing
Feature
Map,
International
Journal
of
Artificial
Intelligence Tools, 2000, 9(3). 369 - 375
Tzung-Pei Hong, Chan-Sheng Kuo,
Sheng-Chai Chi, Trade-off Between
Computation Time and Number of Rules
for Fuzzy Mining from Quantitative Data,
International Journal of Uncertainty,
Fuzziness and Knowledge-Based Systems,
2001,9(5). 587 - 604
Vladimir Estivill-Castro, Jianhua Yang,
Clustering Web Visitors by Base, Robust
5.
6.
7.
and Convergent Algorithms, International
Journal of Foundations of Computer
Science, 2002, 13(4). 497 – 520
R. Félix, T. Ushio, Binary Encoding of
Discernibility Patterns to Find Minimal
Coverings, International Journal of
Software Engineering and Knowledge
Engineering, 2002, 12(1). 1-18
D.E.Goddberg, Genetic Algorithms in
Search, Optimizition and Machine
Learning, Addison-wesley Publishing
Company, 1989
Ai Lirong , He Huacan. Summarise on
and Development,1998,35(1).49-52
9. Xiaomin Zhong, Eugene Santos, Directing
Genetic Algorithms for Probabilistic
Reasoning
Through
Reinforcement
Learning,
International
Journal
of
Uncertainty,
Fuzziness
and
Knowledge-Based Systems, 2000,8(2).
167-185.
10. Gary William Grewal, Thomas Charles
Wilson, An Enhanced Genetic Algorithm
for Solving the High-Level Synthesis
Problems of Scheduling, Allocation, and
Binding,
International
Journal
of
Computational
Intelligence
and
Applications, 2001,10(1).91-110
Genetic Algorithms,Journal of Computer
8.
Application and Research,1997,14(4).3-6
Xiao Yong, Chen Yiyun. Constructing
Decision Trees by Using Genetic
Algorithm, Journal of Computer Research
6
Download