A Data Mining Algorithm based on Genetic Algorithm Yun Bai1 Deyu Qi2 Qiangguo Pu1 Nikos Mastorakis3 Computer Center, University of Science and Technology of Suzhou, Jiangsu 215011, China 2 College of Computer Science, South China University of Technology, Guangzhou 510640, China 3 Hellenic Naval Academy, Terma Hatzikyriakou, 18539 Piraeus, Greece Also: WSEAS, Ag.I.Theologou 17-23, 15773, Zografou, Athens, Greece 1 Abstract: -Data Mining (DM) is a new hot research point in Database technology area. In this paper, we discuss the application of genetic algorithm (GA) in DM, and bring forward the basis idea and key design question of the new DM algorithm based on GA, such as knowledge rule expression, knowledge rule coding, fitness function definition. We illustrate the validity of the new DM algorithm by the given instance. Keywords: -Genetic Algorithm, Data Mining, Knowledge Expression, Knowledge Rule policy which adopts the genetic algorithm to search within the database for evolution on a group of rules as generated randomly until the database is covered by the rule group so as to mine any useful rule as hidden in the database. As the production rule is expressed simply and clearly, it is also mostly applied within the current expert system. Its expression is composed of two parts: premise and conclusion. The premise is just the precondition which must be met in rule operation, while the conclusion is just the action or result in rule operation. A production rule may be described as: IF E1(A1, A2,…, Am)∧E2(A1, A2,…, Am) 1. Data Mining and Knowledge Expression 1.1 Data Mining Data Mining (DM) is the process to extract any unknown and useful information and knowledge from a number of incomplete, uncertain, blurred and random data. Similar process also includes the Knowledge Discovery, Data Analysis, Data Fusion and Decision-Making Support[1]. There are also the following types of knowledge as discovered by DM: generalized, Characteristic, Contrast, correlative, predicting and Deviation, while the classification, clustering, reduced dimension, pattern recognition, visualization, decision tree, genetic algorithm and uncertainty disposal are normally used as the tools and methods for the discovery[2]. ∧ … ∧ En(A1, A2, … , Am) THEN H(Conclusion) In which, Ei(A1, A2,…, Am)(1≤i≤n) is the 1.2 Knowledge Expression premise attributed by Ai(1≤i≤m). If they have As very important to DM, the correct selection and use of Knowledge Expression would greatly improve the efficiency on problem solution[3]. The knowledge may be expressed as predicate logic, production rules, semantic network, framework and neural network. The knowledge extracted in DM is also normally expressed as concepts, rules, regularities and patterns[1]. Most of DM would be deemed as search, while the database is the search space. The DM Algorithm as described in this paper may be considered as a search the relation of OR(∨) between the attributes, Ei(A1, A2,…, Am) is expressed as an disjunction form, that is, Ei=A1 OR Ei=A2… OR Ei=Am. Where AND( ∧ ) exists between attributes, Ei(A1, A2,…, Am)is expressed as a coincidence form, that is, Ei=A1 AND Ei=A2… AND Ei= Am. Following is an example of the rule expression and its significance. A rule on knowledge classification: IF PRICE(moderate, low) ∧ 1 QUALITY(high, normal) ∧ SERVICE(good) THEN CLASS=acc, equivalent to IF (PRICE=moderate ∨ PRICE=low) ∧ IF PRICE(moderate, low) ∧ QUALITY(high, normal) ∧ SERVICE(good) THEN CLASS=acc, supposed that the type of the characteristic attribute is discrete(any attribute on numerical characteristic would be dispersed), if the discrete attribute has k possible results, then the k bits would be allocated in the binary string, each bit corresponds to the given value, 0 means no value in disjunction form, while 1 is reversed. Any individual sort is expressed as the continuous binary string on sort attribute for the purpose of simplifying algorithm. As known from the above rule, if the value field of the characteristic attribute PRICE is {high, moderate, low}, the value field of QUALITY is {high, normal, poor}, the value field of SERVICE is {good, bad}, the value field of the sort attribute is {uacc, acc, good, vgood} respectively corresponded by the codes of 00, 01, 10 and 11, then the rule may express the following genome(binary string)form as 0111101001, in which the rule corresponds to the genome type one by one. The binary string 001100111 also corresponds to the following rules as: IF PRICE(low) ∧ QUALITY(high) ∧ SERVICE(good) THEN CLASS=vgood If a characteristic attribute code is full of 1, then the attribute would not affect any validity of rule with whatever value. For example, if the PRICE is coded as 111 and the rule on the code is 1111001010, then the signification of the rule is that if the commodity quality is good while the service is good, then the commodity is accepted taking no account of price. In contrary saying, where the rule of “if the commodity service is poor, then the commodity is unaccepted”, that is, IF SERVICE(bad) THEN CLASS=uacc, which corresponds to the binary code string 1111110100. (QUALITY=high ∨ QUALITY=normal) ∧ (SERVICE=good) THEN CLASS=acc. This rule is expressed with the significance as, if the price of commodity is moderate or low, while the quality is high or normal and service is good, then the commodity would be accepted. 2. Genetic algorithm and realization of genetic operation 2.1 Genetic algorithm The genetic algorithm (GA) is an algorithm simulating the process of biological evolution to complete the optimized search, it is essentially a random search method based on simulation of biological evolution process[6]. Its basic principle may be concluded to comprehend or transfer the target function on optimization to the fitness of any biological population in environment, to correspond the optimized mutations to any individual of biological population, and to analogize any algorithm on optimized solution with the evolution of the biological population[7]. There are three apparent features between the genetic algorithm and the traditional optimized algorithm, that is, high robust, whole search capacity and internal parallelism. 2.2 Genome coding An important process in GA is coding, that is the transfer from the parameter form on solution of the optimized problem to the expression on genome code string. Taking the knowledge class rule coding as an example, the genetic code string on the rule (individual) is realized with the binary code according to the feature of the knowledge class rule[8]. A rule is divided into two parts, rule characteristic(premise) and rule sort(conclusion), that is, 2.3 Definition on fitness function The “good rule” exists from any mining on 2 classification rule within the knowledge database by GA, and acts as the father generation rule to reproduction, crossover and mutation until the optimized rule group is discovered. The “good rule” means a high matching between the rule and any instance within the test data set (also called as the record, including featured and classified attributes), while the fitness function shall reflect the matching degree of rule and data set. In the definition of the fitness function there are three important parameters to be considered in the rule, such as accuracy, utility and coverage, which will be noted as follows (assumed that U is a test data set, and e is an instance (element)within it). Accuracy: The accuracy of a rule ri is measured with the matching degree between instances within the test data set U and the rule, which is reflected by the following formula: Accuracy (ri ) rule is 1. If an instance e matches with m rules to be evolved within the current population, then each rule to be evolved has the utility of 1/m as indicated by the following formula: (ri , e) eU (ri , e) Utility(ri ) the instance e successfully 1 (ri , e) matches with the rule r 0 other In which, U and e are defined as above, while r is a rule within the current population. Coverage: The coverage of the rule ri means the numbers of instance e within the test data set U matched with the premise part(characteristic part) of ri to be evolved as indicated by the following formula: Coverage(ri ) uri uri Apparently, the higher the rule coverage is, the better the rule generality is. As the values of the three parameters would affect the size of rule fitness, we are in wish of their higher values. When fixing the fitness function of the rule, the relationship between rules shall be analyzed first. There are three possible relationships between rules, such as contain, redundant and contradictory. Let’s take a look: Contain: IF PRICE(normal, low) ∧SERVICE(good) THEN CLASS=acc IF SERVICE(good) THEN CLASS=acc Contradictory: IF PRICE(high) ∧ u ri uri uri In which, uri is a subset of the test data set U, in which each element(instance)matches with the rule of ri to be evolved, while uri is the base of the subset of uri . uri is also a subset of U, in which only the characteristic part(premise) of each element(instance)matches with the rule of ri to be evolved, and uri is the base of the subset of uri . Apparently, the higher the rule accuracy is, the higher the rule trust is. If the accuracy is 1, then the rule within the test data set U is constantly true, where the condition of the rule is existence, the conclusion of the rule is also existence. Utility: Within the test data set U, an instance of e may be matched with several rules to be evolved, while each rule has the different utility to be measurable. If an instance e within U only matches with a rule to be evolved within the current population, then the utility of the QUALITY(poor) ∧ SERVICE(bad) THEN CLASS=acc IF PRICE(high) ∧ QUALITY(poor)∧SERVICE(bad) THEN CLASS=uacc Redundant: Very simpleness. That is two same rules. The relationship between rules shall be considered in design of fitness function of the rule, while contain, contradictory and redundant 3 rules shall be deleted to ensure any father generation rule is selected as the “good rule”. Following is the algorithm to evaluate rule fitness: Step 1: Assumed that U is a test data set, P is a rule (genome) variety (N size); Step 2: Calculating the Accuracy (ri) and Utility (ri) for each rule ri (i=1,2, … N) within the current population P; Step 3: Arranging rule ri in an descending order according to the product with Accuracy(ri) and Utility(ri), while the results of arrangement shall be written into the sort table; Step 4: The Coverage(rj) of the first rule in the sort table shall be calculated, while the fitness(rj) = Coverage(rj) * Accuracy(rj) * Utility(rj) (initial j=0) Step 5: The rule as covered by the rule of rj shall be deleted from U, i.e. described in the document, the crossover arithmetic operator uses the single-point crossing, that is, randomly set a crossover point within the individual binary gene string for the parents’ genome crossover operation. 2.4.3 The mutation arithmetic operator is used to change the gene value on some gene positions of the individual binary gene string in the population, such as 1→0 and 0→1. 2.5 Algorithm description The following is the flow of data mining algorithm based on genetic algorithm. Begin Input: Test data set U and all control parameters of the genetic algorithm Output: The optimized classified rule set having been evolved Step 1: Initialized population: N genomes(binary gene string) are generated randomly for valid disposal on rules. Step 2: Calculation of the fitness value: To calculate the fitness on each individual(rule) within the current population. Step 3: If once the currently evolved generation number reaches the requirements on setting the maximum evolved generation number or the certain control parameters to accord with setting, then goto Step 7, otherwise it shall be continued. Step 4: To select, crossover and mutate the generation to generate the son generation population. Step 5: To replace the individual at low fitness within the father generation population with the individual at high fitness within the son generation population for the new generation. Step 6: Goto Step 2. Step 7: To output the most optimized rule set. End U U urj urj Step 6: If rj is the last rule in the sort table, exit once the calculation on fitness of all rules is completed, otherwise j=j+1, and goto Step 4. 2.4 Genetic arithmetic operator 2.4.1 Selective arithmetic operator We adopt the normal selection policy as fitness scaling method, also the roulette selection method. 2.4.2 Mutation arithmetic operator Crossover arithmetic operator The crossover arithmetic operator plays a very important role. It maintains the feature of best individual within the original population on a certain extent, while the algorithm may search any new gene space, then the new individual within the population has diversity. As 4 This paper raises the detailed methods on fixing any parameter and its operation within the genetic algorithm based on the realization technology of genetic algorithm in combination of the knowledge classification rule mining in the data mining, including the genome coding method and design of the genetic arithmetic operator, especially the definition and realization of the fitness function. Finally the data mining algorithm based on the genetic algorithm is given. Within the data mining, as the genetic algorithm has the characteristic of high robust, whole search capacity and internal parallelism, it may be evolved based on the meta-knowledge (metadata) to get the more satisfactory knowledge and rule, while any previously blank knowledge may be generated from the mutation arithmetic operator within the genetic algorithm, just as search any hidden knowledge, which is fully noted in the example as described in this paper. 3. Simulation on example For example, the acceptance of customer on the commodity is uacc, acc, good or vgood, which is measured in price, quality and service, while the price is high, moderate or low, quality is high, normal or poor, and the service is good or bad. Practice as the following table: Example no. 1 2 3 4 5 6 7 8 9 10 11 12 13 Featured attribute of Acceptance commodity (Classification) Price Quality Service High Poor Bad Uacc Moderate Normal Bad Uacc Low High Good Vgood High High Good Good Moderate High Good Good Moderate Normal Good Acc High Normal Good Acc High Poor Good Uacc Moderate Poor Bad Uacc Low Poor Bad Uacc High Normal Good Acc Moderate Normal Good Acc Low Normal Good Good REFERENCES Where we practise the above example with the genetic algorithm, the parameter cross probability Pc=0.9, variance probability Pm=0.03, population size P_SIZE=128, iterative calculation of 500 times are selected to mine the following two valid rules: Rule 1: IF QUALITY(poor) THEN CLASS=uacc(fitness=36.00) (meaning the poor service would not be accepted by customer). Rule 2: IF SERVICE(good) THEN CLASS=acc(fitness=7.88) (meaning the good service would be accepted by customer). The above example only notes that the genetic algorithm is valid on classification of knowledge rule, while the given data set or metadata (meta-knowledge) may be millions in data mining from the general and professional knowledge. 1. 2. 3. 4. Conclusion 4. 5 Tubao Ho, Trongdung Nguyen, Ducdung Nguyen, Saori Kawasaki, Visualization Support for User Centered Model Selection in Knowledge Discovery and Data Mining, International Journal of Artificial Intelligence Tools, 2001, 10(4). 691 - 713 Susan E. George, A Visualization and Design Tool (AVID) for Data Mining with the Self-Organizing Feature Map, International Journal of Artificial Intelligence Tools, 2000, 9(3). 369 - 375 Tzung-Pei Hong, Chan-Sheng Kuo, Sheng-Chai Chi, Trade-off Between Computation Time and Number of Rules for Fuzzy Mining from Quantitative Data, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2001,9(5). 587 - 604 Vladimir Estivill-Castro, Jianhua Yang, Clustering Web Visitors by Base, Robust 5. 6. 7. and Convergent Algorithms, International Journal of Foundations of Computer Science, 2002, 13(4). 497 – 520 R. Félix, T. Ushio, Binary Encoding of Discernibility Patterns to Find Minimal Coverings, International Journal of Software Engineering and Knowledge Engineering, 2002, 12(1). 1-18 D.E.Goddberg, Genetic Algorithms in Search, Optimizition and Machine Learning, Addison-wesley Publishing Company, 1989 Ai Lirong , He Huacan. Summarise on and Development,1998,35(1).49-52 9. Xiaomin Zhong, Eugene Santos, Directing Genetic Algorithms for Probabilistic Reasoning Through Reinforcement Learning, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 2000,8(2). 167-185. 10. Gary William Grewal, Thomas Charles Wilson, An Enhanced Genetic Algorithm for Solving the High-Level Synthesis Problems of Scheduling, Allocation, and Binding, International Journal of Computational Intelligence and Applications, 2001,10(1).91-110 Genetic Algorithms,Journal of Computer 8. Application and Research,1997,14(4).3-6 Xiao Yong, Chen Yiyun. Constructing Decision Trees by Using Genetic Algorithm, Journal of Computer Research 6