KnowledgeDiscovery using Genetic Programmingwith RoughSet Evaluation David H. Foster, W. James Bishop, Scott A. King, Jack Park ThinkA1ongSoftware,Inc. Brownsville, California 95919 jpark@aip.org Abstract An important area of KDDresearch involves development of techniques which transform raw data into forms more useful for prediction or explanation. Wepresent an approach to automating the search for "indicator functions" whichmediate such transformations. The fitness of a function is measuredas its contribution to discerning different classes of data. Genetic programming techniques are applied to the search for and improvementof the programs which makeup these functions. Roughset theory is used to evaluate the fitness of functions. Roughset theory provides a unique evaluator in that it allows the fitness of each function to dependon the combinedperformance of a population of functions. This is desirable in applications which need a population of programsthat perform well in concert and contrasts with traditional genetic programmingapplications which have as there goal to find a single program which performs well. This approach has been applied to a small database of iris flowers with the goal of learning to predict the species of the flower given the values of four iris attributes and to a larger breast cancer database with the goal of predicting whether remission will occur within a five year period. Introduction An important area of KDDresearch involves development of techniques which transform raw data into forms more useful for prediction or explanation. Wepresent an approach to automating the search for "indicator functions" whichmediate such transformations. The fitness of a function is measuredas its contribution to discerning different classes of data. Genetic programming techniques are applied to searching for and improvingthe programs which makeup these functions. Roughset theory is used to evaluate the fitness of functions. Roughset theory provides a unique evaluator in that it allows the fitness of each function to dependon the combinedperformanceof a population of functions. This is desirable in applications which need a population of programsthat perform well in concert and contrasts with traditional genetic programmingapplications which have as there goal to find a single program which performs well. This approach has been applied to a small database of iris flowers with the goal of learning to predict the species of the flower given the values of four iris attributes (Fisher’s iris data reproducedin Salzberg, 1990) and to a larger breast cancer database (breast cancer data reproducedin Salzberg, 1990) with the goal of predicting whether remission will occur within a five year period. The process begins by applying a population of randomly-generated programs to elements of a database. Programresults are placed in a matrix and evaluated to obtain a measureof each program’s fitness. These fitness values are used to determine which programswill be kept and used in breeding the next generation. This process continues for a specified numberof generations and is illustrated in Figure 1. From: AAAI Technical Report WS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Page 254 KnowledgeDiscovery in Databases WorkshopI993 AAAI-93 Reproduction Crossover Mutation I Population of programs P = {pl,p2,...,pn} Execution Rankorderedprograms Attribute valuematrix P x Data I 1 Knowledge Base Application Data: Program primitives Significance of programs Minimal program sets =@ I Figure 1: Genetic Programming Executing Programsand Filling Evaluation Matrix Programs are constructed of boolean operators connecting application-specific primitives. Our implementation constructs programs using scheme, a dialect of lisp. This representation allows programs to be both easily modified and directly executable. An evaluation matrix is created by applying each program to each point in the training data. For each data point fi), and for each program (j), the programis executed and the result is stored at location (i,j) in the matrix. Thus each column of the matrix contains the values returned by a particular program for all the data points, and each row contains the values returned by all programs for a particular data point. Rough set theory is applied to the evaluation matrix to determine each programs fitness. RoughSet Fitness Evaluation Our fitness evaluation is based on rough classification described by Pawlak (1984,1991) and Ziarko (1989). These methods allow the fitness of functions to be evaluated in a context with other functions. Weexpect this approach to promote population diversity by a natural tendency to assign lower fitness to redundant programs. Definitions S = (U,A,V,f) is an informationsystemconsisting of a set of data objects (U), a set of attributes (A), set of possible attribute values(V), and a function (f) that mapsdata object-attributes pairs attribute values. U is the set of all dataobjects.In our irises applicationit is the set of all examples in our database. A is the set of all attributes. In our implementationit is the set of conceptsmeasured or tested by the attribute programs.Theset A also correspondsto the set of all columnsin the evaluation matrix. V is the domainof attribute values. In our irises applicationit is {setosa,virginica, versicolor} in the prediction columnand { true, false} elsewhere. f is the description function mappingUxA->V.In our implementationit correspondsto the evaluation matrix. AAAI-93 Knowledge Discove~ in Databases Workshop 1993 Page 255 Nis the set of predictedattributesor columns of the evaluation matrix.In our applicationit is a single columncontainingthe speciesnamefor the data objects. P is the set of predictorattributes or columns of the evaluationmatrix. Thesecorrespond to the set of programsto be determined. Mis a minimalsubsetof P whichretainsthe full ability of P to discernelements of N’. N’ is the set of elementary sets or equivalence classesbasedon predictedattributes. It corresponds to the set of sets of rowswhichhavematching valuesin all the N columns. P’ is the set of elementary sets or equivalence classesbasedon all predictorattributes. It corresponds to the set of sets of rowswhichhavematching valuesin all the P columns. M’ is a set of elementary sets or equivalence classesbasedon a minimalset of predictorattributes. It corresponds to the set of sets of rowswhichhavematchingvaluesin all the Mcolumns. ind(P)n’~is the unionof all elementary sets in P’ whicharesubsetsof the ith element in N’. These are the lowerapproximations of the sets in N’. POS(P,N) is the unionof ind(P)n’~for all elements n’~ in N’. Thisis the subsetof U for which sufficient for disceming membership in the equivalenceclassesof N’. k(P,N)- card(POS(P,N)) / card(U). Thisis the fraction of the set Ufor whichP is sufficient for discerningmembership in the equivalence classesof N’. SGF(P,N,p~ = (k(P,N)- k(P-p,N)) / k(P,N). Thisis the relative changein k(P,N)resulting fromdeletionof p~ fromP. SGF hasthe advantage that it rewards programs for their contributionto recall or predictionin the contextof all other programs. Evaluating Attributes The sets and measuresabove are used to determine the significance of membersof P for classifying membersof Uinto the equivalence classes of N’. Onegoal is to find a minimalsubset of attributes that retains the capability of all P for discerning elements of N’. Anothergoal is to assign a significance value to each attribute. This value will be used in calculating programfitness for the genetic programmingprocess. The following methodcalculates significance factors for each program while determining a minimal set. Programswith zero significance are not included in the minimalset, M. Initialize: M1 = P For eachPkin P: Calculate SGF(Mk,N,p~) If SGF(Mk,N,pk ) =0 then Mk+ 1 = Mk-Pk else Mk,1 k =M Theminimalset is not unique. The order of selecting Pk during calculation of Mwill affect the final contents of M. Wehave chosen to select the Pk beginning with the previous generation and in order from lowest to highest SGF.This choice of old rules first encourages turnover in the population by allowing an older programto be replaced by an equivalent set of new programs. Page 256 KnowledgeDiscove~ in Databases Workshop 1998 AAALg$ Genetic Programming Process Our implementation of genetic programmingis based on Koza(1992). Genetic programmingis used modify a population of programs which perform tests on the data. Programsin this study are constructed of boolean expressions. The terminal values of these expressions are generated by application-specific primitives. These primitives are simple functions which return true and false answers to questions about the data. For exampleone primitive asks whether the r~itio of petal length to petal width is greater than a randomlygenerated number. The following describes the generation cycle: Createan initial (random) populationof programs. 1. Executeand evaluate programsto determinefitness. 2. Rankorder programsaccordingto fitness. 3. Generatea newpopulationof programsby applyingreproduction,crossover,andmutationto the best of the old programs. Goto step 1. NOTE:This cycle terminates whena predefined numberof generations havepassedor when accuracyof recall andpredictionhasfailed to improvein the previousn generations.Referto "TestingRecall andPrediction". Normallythe result is chosen as the best programto appear in any generation. In our application the result is the final population of programswith non-zero fitness. These programswill be used in concert for recall and prediction. Survival and Reproduction While investigating survival and reproduction methods we examinedthe following: Program Survival Wehave experimentedwith three options for survival of programsfrom one generation to the next. 1. Retaina fixed numberof the best programs. 2. Calculatethe minimum SGFfor all programs andretain all programs with an SGFgreaterthan the minimum. NOTE: Typically several programswill possesthe minimum SGFandwill therefore be discarded. with SGF > 0. Notethat Mis a minimalset so it will by definition containa 3. RetainM;i.e. programs diverse population. Program Reproduction Wehave experimentedwith two options for production of newprograms from one generation to the next. 1. Replaceall discardedprogramsthus maintaininga constantP. 2. Supplement M by a constant numberof programsthus allowing P to increase or decrease accordingto the size of the minimalset M. This causesthe programto adaptas the problem progressestowardsolution. Regardless of the options chosen the general sequence of events were as follows: 1. Select survivors. 2. Determinethe numberor programsto be reproduced. reproducea fraction of thesevia crossover. 3. Accordingto a preset percentage 4. Reproduce the remainder via mutation. AAA1-98 KnowledgeDiscover# in Databases Workshop1998 Page 257 Crossover and Mutation Crossover was implementedby selecting one sub-expression from each parent and swappingthem to produce two new programs. Mutation was implementedby selecting a sub-expression within a single parent and replacing it with a randomlygenerated expression. Program Crossover The crossover operator first conducts a randomselection of two membersof the current population of programs. The selection is weighted by programfitness in the "roulette wheel" fashion. This involves assigning each programa fraction of an interval proportional to its fitness, randomlyselecting an element of this interval, and choosing the programdetermined by this element. A crossover point is selected, again at random, within each programand the sub-expressions below these points are swappedto produce two new programs. Figure 2 shows two programs with sub-expressions selected. Figure 2: Sub-expressions selectedfor crossoverin parent programs Figure 3 showsthe two new programscreated by the crossover operation. Figure 3: Newprogramscreated by crossover Program Mutation The mutation operator first makesa weighted randomselection of one memberof the current population of programs. A mutation point is randomlyselected within the programand the sub-expression below this point is replaced with a randomlygenerated expression. Figure 4 showsan exampleof an original programand one possible result of applying the mutation operator. Page 258 KnowledgeDiscovery in Databases Workshop1993 AAA1-93 Figure 4: Original andnewprogramcreatedby mutation Testing Recall and Prediction Wehave tested our implementation with a relatively small database of iris flowers (Fisher’s iris data reproduced in Salzberg, 1990) and with a larger database of breast cancer patients (Breast cancer data reproduced in Salzberg, 1990). Wetested for both recall and prediction. The data waspartitioned into two disjoint sets for training and prediction testing. Recall was tested using a subset of the training data. Prediction was tested using data which was not used in training. Testing was accomplishedas follows: Eachprogram in the minimalset, M, is appliedto the test datapoint. Theresulting list of valuesis matched againstthe corresponding valuesin eachrowof the evaluationmatrix. Weusean analogof hamming distanceto select a matchingrowfor prediction. Wecalculate the distanceas the sumof SGF(M,N,m~ for columnsm, that do not matchthe corresponding test value. Therowwith the minimum distancefromthe list of test valuesis selected.If severalrowshavethe samedistancemeasure thenthe first of theseis arbitrarily selected. Thefield to be recalled or predictedis retrieved fromthe Ncolumns of the selectedrow. Success is measured as the fraction of the test datathat is correctly recalledor predicted. Our testing confirmedthe ability of this approach to recall stored patterns and to predict from unseen patterns. IRIS DATA Each entry includes the nameof the species and values for sepal length, sepal width, petal length, and petal width. The prediction field is species. Weachieved 100%accuracy on recall and 96%accuracy for prediction. This is consistent with the 93% to 100%accuracy reported by Salzberg (1990). Our results for the iris database are illustrated below. Figure 4 showsthe k(P,N) values obtained during iris training. t ,¢ 0.0 0.6 0.4 0.2 20 40 60 O0 I00 Figure4: k(P.N)duringiris training AAAI-9$ KnowledgeDiscovery in Databases Workshoplgg3 Page 2,59 Figure 5 showsresults of testing for pattern recall during iris training. This data was obtained by testing on a subset of the data used for training. koourtoy Tr~inlngDi%a 0.9 0.8 0.9 0.6 ¯ . ¯ 0 20 ~neration~ 40 60 80 100 Figure5: Recallaccuracy duringiris training Figure 6 showsresults of testing for predictive ability during iris training. This data was obtained by testing on a subset of the data disjoint fromthat used for training. Test D~t~ I 0.9 0.O 0.7 0.6 Oer.rttio:~ 0 20 40 60 80 100 Figure6: Predictionaccuracy duringiris training Figure7 showschanges in the size of the minimalset, M, during iris training. HII_SET 10 9 e ? 5 8enerttlons RO 40 60 80 100 Figure7: Number of programs in minimalset dudngiris training Page 260 Knowledge Discove~ in Databases Workshop 1993 AAAI-93 CANCER DATA Each entry includes the patient age, whether they had gone through menopause,which breast had the cancer, whether radiation was used, whether the patient suffered a recurrence during the five years following surgery, the tumor diameter and four other measuresof the tumor itself. The prediction field is recurrence. Weachieved 100%accuracy on recall and 79%accuracy for prediction. Again this is comparableto the 78%accuracy reported by Salzberg (1990). These results are illustrated below. Figure 8 shows the k(P,N) values obtained during cancer training. 0.$ 0.6 0.4 , i 0 | 30 20 Figure8: k(P.N)during cancertraining 10 8enerstions I 40 Figure 9 showsresults of testing for pattern recall during cancer training. ~ouraolr 0.9 0.8 0.7 0.6 0.5 i 0 AAAI-93 ! 10 l 20 30 Figure 9: Recall accuracyduring cancer training KnowledgeDiscovery in Databases Workshop1998 I 8eneratior~ 40 Page 261 Figure 10 showsresults of testing for predictive ability during cancer training. &co~Lolr TestO~t~ 0.9 0.8 0.7 0.6 0.5 i 10 | I 9.0 30 i , 40 Figure1 O: Predictionaccuracyduringcancertraining Figure 11 showschanges in the size of the minimalset, M, during cancer training. MIN_SET 2O 18 \ 16 14 12 I 10 20 30 40 Figure 11: Number of programsin minimalset during cancertraining Conclusions Our test results suggest there maybe two distinct aspects to learning in this approach. Thefirst is the developmentof a population of programssufficient to recall membersof the training set. Wesee this in Figures 4 and 5 and Figures 8 and 9 where k(P,N) and the recall accuracy proceed to the maximum value of 1.0. At this point we might expect learning to stop. However,in a numberof trials we found that prediction accuracy continued to improve, albeit sometimeserratically. Lookingat Figure 6 wesee that prediction accuracy continued to increase for sometime after k(P,N) reached its maximum.Wesuggest an explanation maybe found in the size of the minimalset shownin Figure 7. At about the sametime as prediction accuracy began its rise to a final maximum of 96%the size of the minimalset began to decrease from a high of 10 programsto a range of 6-8 programs.This is not inconsistent with other machine-learning techniques in which smaller representations tend to have a greater ability to generalize. Webelieve the approach described here will have important advantages for some difficult applications of KDD.The biggest drawbackto this approach is the computational cost. Our implementation performed well on iris and cancer data bases but the computational burden was a significant problemin preliminary tests on the muchlarger task of discovering regularities betweenthe primary and secondary structure of proteins. Wehave experimented with reducing the effective size of the data set through randomsampling. This madethe evaluation process muchfaster, but accuracy tended to stop improvingand begin fluctuating about slightly lower values. Page 262 KnowledgeDiscoverT in Databases Workshop1998 AAAI-9$ Future directions There are several aspects of this approach that warrant further research. The most important of these are related to improvingcomputational efficiency. It would be useful to explore the relationships between the complexity of individual programs, the size of the programpopulation, and the time to solution. Reducingthe search space for the genetic programmingprocess is a worthwhile goal. The use of more intelligent methodsfor creating and modifyingprimitive functions is one possibility. Another possibility wouldbe to reduce the dimensionality of the data using application-specific knowledgefor preprocessing. Alternatives to strict boolean expressions should also be explored. Stopping criteria wouldbe useful for less goal-oriented applications of KDDsuch as database mining. For example,in a blind search for regularities, the data attributes might be partitioned into different N (predicted) and P (predictor) sets. The methodspresented here wouldbe applied for each partition in a search for interdependencies. REFERENCES Koza, John R. Genetic Programming:on the programmingof computers by meansof natural selection. 1992, Cambridge MA, MITPress Orlowska, Ewaand Pawlak, Zdzislaw. "Expressive power of knowledge representation systems," 1984, Int. J. Man-Machine Studies, vol. 20 pp. 485-500, London,AcademicPress Inc. Park, Jack. "Onan Approachto Index Discovery." 1993, submitted to KDD1993. Park, Jack, and Dan Wood.User’s Manual--TheScholar’s Companion.1993, Brownsville, CA, ThinkAlong Software. Pawlak, Zdzislaw. "RoughClassification," London,AcademicPress Inc. 1984, Int. J. Man-Machine Studies, vol. 20 pp. 469-483, Pawlak, Zdzislaw. RoughSets: Theoretical Aspects of Reasoning about Data, 1991, Boston, Kluwer AcademicPublishers. Salzberg, Steven L., Learning with nested generalized exemplars, 1990, Boston, KluwerAcademic Publishers. Ziarko, Wojciech. "A Technique for Discovering and Analysis of Cause-Effect Relationships in Empirical Data," 1989, KDDWorkshop 1989. AAAI-9$ KnowledgeDiscovery in Databases Workshop1993 Page 263