Knowledge Discovery using Genetic Programming

advertisement
KnowledgeDiscovery using
Genetic Programmingwith RoughSet Evaluation
David H. Foster, W. James Bishop, Scott A. King, Jack Park
ThinkA1ongSoftware,Inc.
Brownsville, California 95919
jpark@aip.org
Abstract
An important area of KDDresearch involves development of techniques which transform raw data into
forms more useful for prediction or explanation. Wepresent an approach to automating the search for
"indicator functions" whichmediate such transformations. The fitness of a function is measuredas its
contribution to discerning different classes of data. Genetic programming
techniques are applied to the
search for and improvementof the programs which makeup these functions. Roughset theory is used to
evaluate the fitness of functions.
Roughset theory provides a unique evaluator in that it allows the fitness of each function to dependon
the combinedperformance of a population of functions. This is desirable in applications which need a
population of programsthat perform well in concert and contrasts with traditional genetic
programmingapplications which have as there goal to find a single program which performs well.
This approach has been applied to a small database of iris flowers with the goal of learning to predict
the species of the flower given the values of four iris attributes and to a larger breast cancer database
with the goal of predicting whether remission will occur within a five year period.
Introduction
An important area of KDDresearch involves development of techniques which transform raw data into
forms more useful for prediction or explanation. Wepresent an approach to automating the search for
"indicator functions" whichmediate such transformations. The fitness of a function is measuredas its
contribution to discerning different classes of data. Genetic programming
techniques are applied to
searching for and improvingthe programs which makeup these functions. Roughset theory is used to
evaluate the fitness of functions.
Roughset theory provides a unique evaluator in that it allows the fitness of each function to dependon
the combinedperformanceof a population of functions. This is desirable in applications which need a
population of programsthat perform well in concert and contrasts with traditional genetic
programmingapplications which have as there goal to find a single program which performs well.
This approach has been applied to a small database of iris flowers with the goal of learning to predict
the species of the flower given the values of four iris attributes (Fisher’s iris data reproducedin
Salzberg, 1990) and to a larger breast cancer database (breast cancer data reproducedin Salzberg, 1990)
with the goal of predicting whether remission will occur within a five year period.
The process begins by applying a population of randomly-generated programs to elements of a database.
Programresults are placed in a matrix and evaluated to obtain a measureof each program’s fitness.
These fitness values are used to determine which programswill be kept and used in breeding the next
generation. This process continues for a specified numberof generations and is illustrated in Figure 1.
From: AAAI Technical Report WS-93-02. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Page 254
KnowledgeDiscovery in Databases WorkshopI993
AAAI-93
Reproduction
Crossover
Mutation
I
Population
of programs
P = {pl,p2,...,pn}
Execution
Rankorderedprograms
Attribute
valuematrix
P x Data
I
1
Knowledge
Base
Application
Data:
Program
primitives
Significance
of programs
Minimal
program
sets
=@
I
Figure 1: Genetic Programming
Executing Programsand Filling Evaluation Matrix
Programs are constructed of boolean operators connecting application-specific
primitives. Our
implementation constructs programs using scheme, a dialect of lisp. This representation allows
programs to be both easily modified and directly executable.
An evaluation matrix is created by applying each program to each point in the training data. For each
data point fi), and for each program (j), the programis executed and the result is stored at location (i,j)
in the matrix. Thus each column of the matrix contains the values returned by a particular program for
all the data points, and each row contains the values returned by all programs for a particular data
point.
Rough set theory is applied to the evaluation matrix to determine each programs fitness.
RoughSet Fitness Evaluation
Our fitness evaluation is based on rough classification
described by Pawlak (1984,1991) and Ziarko
(1989). These methods allow the fitness of functions to be evaluated in a context with other functions.
Weexpect this approach to promote population diversity by a natural tendency to assign lower fitness
to redundant programs.
Definitions
S = (U,A,V,f) is an informationsystemconsisting of a set of data objects (U), a set of attributes (A),
set of possible attribute values(V), and a function (f) that mapsdata object-attributes pairs
attribute values.
U is the set of all dataobjects.In our irises applicationit is the set of all examples
in our database.
A is the set of all attributes. In our implementationit is the set of conceptsmeasured
or tested by the
attribute programs.Theset A also correspondsto the set of all columnsin the evaluation matrix.
V is the domainof attribute values. In our irises applicationit is {setosa,virginica, versicolor} in the
prediction columnand { true, false} elsewhere.
f is the description function mappingUxA->V.In our implementationit correspondsto the evaluation
matrix.
AAAI-93
Knowledge Discove~ in Databases Workshop 1993
Page 255
Nis the set of predictedattributesor columns
of the evaluation
matrix.In our applicationit is a single
columncontainingthe speciesnamefor the data objects.
P is the set of predictorattributes or columns
of the evaluationmatrix. Thesecorrespond
to the set of
programsto be determined.
Mis a minimalsubsetof P whichretainsthe full ability of P to discernelements
of N’.
N’ is the set of elementary
sets or equivalence
classesbasedon predictedattributes. It corresponds
to the set of sets of rowswhichhavematching
valuesin all the N columns.
P’ is the set of elementary
sets or equivalence
classesbasedon all predictorattributes. It corresponds
to the set of sets of rowswhichhavematching
valuesin all the P columns.
M’ is a set of elementary
sets or equivalence
classesbasedon a minimalset of predictorattributes. It
corresponds
to the set of sets of rowswhichhavematchingvaluesin all the Mcolumns.
ind(P)n’~is the unionof all elementary
sets in P’ whicharesubsetsof the ith element
in N’. These
are
the lowerapproximations
of the sets in N’.
POS(P,N)
is the unionof ind(P)n’~for all elements
n’~ in N’. Thisis the subsetof U for which
sufficient for disceming
membership
in the equivalenceclassesof N’.
k(P,N)- card(POS(P,N))
/ card(U).
Thisis the fraction of the set Ufor whichP is sufficient for discerningmembership
in the
equivalence
classesof N’.
SGF(P,N,p~
= (k(P,N)- k(P-p,N)) / k(P,N).
Thisis the relative changein k(P,N)resulting fromdeletionof p~ fromP. SGF
hasthe advantage
that it rewards
programs
for their contributionto recall or predictionin the contextof all other
programs.
Evaluating Attributes
The sets and measuresabove are used to determine the significance of membersof P for classifying
membersof Uinto the equivalence classes of N’. Onegoal is to find a minimalsubset of attributes that
retains the capability of all P for discerning elements of N’. Anothergoal is to assign a significance
value to each attribute. This value will be used in calculating programfitness for the genetic
programmingprocess.
The following methodcalculates significance factors for each program while determining a minimal
set. Programswith zero significance are not included in the minimalset, M.
Initialize: M1 = P
For eachPkin P:
Calculate SGF(Mk,N,p~)
If SGF(Mk,N,pk
) =0
then Mk+
1 = Mk-Pk
else Mk,1 k
=M
Theminimalset is not unique. The order of selecting Pk during calculation of Mwill affect the final
contents of M. Wehave chosen to select the Pk beginning with the previous generation and in order from
lowest to highest SGF.This choice of old rules first encourages turnover in the population by allowing
an older programto be replaced by an equivalent set of new programs.
Page 256
KnowledgeDiscove~ in Databases Workshop 1998
AAALg$
Genetic Programming Process
Our implementation of genetic programmingis based on Koza(1992). Genetic programmingis used
modify a population of programs which perform tests on the data. Programsin this study are
constructed of boolean expressions. The terminal values of these expressions are generated by
application-specific primitives. These primitives are simple functions which return true and false
answers to questions about the data. For exampleone primitive asks whether the r~itio of petal length
to petal width is greater than a randomlygenerated number. The following describes the generation
cycle:
Createan initial (random)
populationof programs.
1. Executeand evaluate programsto determinefitness.
2. Rankorder programsaccordingto fitness.
3. Generatea newpopulationof programsby applyingreproduction,crossover,andmutationto the
best of the old programs.
Goto step 1.
NOTE:This cycle terminates whena predefined numberof generations havepassedor when
accuracyof recall andpredictionhasfailed to improvein the previousn generations.Referto
"TestingRecall andPrediction".
Normallythe result is chosen as the best programto appear in any generation. In our application the
result is the final population of programswith non-zero fitness. These programswill be used in concert
for recall and prediction.
Survival and Reproduction
While investigating survival and reproduction methods we examinedthe following:
Program Survival
Wehave experimentedwith three options for survival of programsfrom one generation to the next.
1. Retaina fixed numberof the best programs.
2. Calculatethe minimum
SGFfor all programs
andretain all programs
with an SGFgreaterthan the
minimum.
NOTE:
Typically several programswill possesthe minimum
SGFandwill therefore be discarded.
with SGF
> 0. Notethat Mis a minimalset so it will by definition containa
3. RetainM;i.e. programs
diverse population.
Program Reproduction
Wehave experimentedwith two options for production of newprograms from one generation to the next.
1. Replaceall discardedprogramsthus maintaininga constantP.
2. Supplement
M by a constant numberof programsthus allowing P to increase or decrease
accordingto the size of the minimalset M. This causesthe programto adaptas the problem
progressestowardsolution.
Regardless of the options chosen the general sequence of events were as follows:
1. Select survivors.
2. Determinethe numberor programsto be reproduced.
reproducea fraction of thesevia crossover.
3. Accordingto a preset percentage
4. Reproduce
the remainder
via mutation.
AAA1-98
KnowledgeDiscover# in Databases Workshop1998
Page 257
Crossover
and Mutation
Crossover was implementedby selecting one sub-expression from each parent and swappingthem to
produce two new programs. Mutation was implementedby selecting a sub-expression within a single
parent and replacing it with a randomlygenerated expression.
Program
Crossover
The crossover operator first conducts a randomselection of two membersof the current population of
programs. The selection is weighted by programfitness in the "roulette wheel" fashion. This involves
assigning each programa fraction of an interval proportional to its fitness, randomlyselecting an
element of this interval, and choosing the programdetermined by this element. A crossover point is
selected, again at random, within each programand the sub-expressions below these points are
swappedto produce two new programs. Figure 2 shows two programs with sub-expressions selected.
Figure 2: Sub-expressions
selectedfor crossoverin parent programs
Figure 3 showsthe two new programscreated by the crossover operation.
Figure 3: Newprogramscreated by crossover
Program Mutation
The mutation operator first makesa weighted randomselection of one memberof the current population
of programs. A mutation point is randomlyselected within the programand the sub-expression below
this point is replaced with a randomlygenerated expression. Figure 4 showsan exampleof an original
programand one possible result of applying the mutation operator.
Page 258
KnowledgeDiscovery in Databases Workshop1993
AAA1-93
Figure 4: Original andnewprogramcreatedby mutation
Testing Recall and Prediction
Wehave tested our implementation with a relatively small database of iris flowers (Fisher’s iris data
reproduced in Salzberg, 1990) and with a larger database of breast cancer patients (Breast cancer data
reproduced in Salzberg, 1990).
Wetested for both recall and prediction. The data waspartitioned into two disjoint sets for training
and prediction testing. Recall was tested using a subset of the training data. Prediction was tested using
data which was not used in training. Testing was accomplishedas follows:
Eachprogram
in the minimalset, M, is appliedto the test datapoint.
Theresulting list of valuesis matched
againstthe corresponding
valuesin eachrowof the
evaluationmatrix. Weusean analogof hamming
distanceto select a matchingrowfor prediction.
Wecalculate the distanceas the sumof SGF(M,N,m~
for columnsm, that do not matchthe
corresponding
test value. Therowwith the minimum
distancefromthe list of test valuesis
selected.If severalrowshavethe samedistancemeasure
thenthe first of theseis arbitrarily
selected.
Thefield to be recalled or predictedis retrieved fromthe Ncolumns
of the selectedrow. Success
is measured
as the fraction of the test datathat is correctly recalledor predicted.
Our testing confirmedthe ability of this approach to recall stored patterns and to predict from unseen
patterns.
IRIS DATA
Each entry includes the nameof the species and values for sepal length, sepal width, petal length, and
petal width. The prediction field is species.
Weachieved 100%accuracy on recall and 96%accuracy for prediction. This is consistent with the 93%
to 100%accuracy reported by Salzberg (1990). Our results for the iris database are illustrated below.
Figure 4 showsthe k(P,N) values obtained during iris training.
t
,¢
0.0
0.6
0.4
0.2
20
40
60
O0
I00
Figure4: k(P.N)duringiris training
AAAI-9$
KnowledgeDiscovery in Databases Workshoplgg3
Page 2,59
Figure 5 showsresults of testing for pattern recall during iris training. This data was obtained by testing
on a subset of the data used for training.
koourtoy
Tr~inlngDi%a
0.9
0.8
0.9
0.6
¯ . ¯
0
20
~neration~
40
60
80
100
Figure5: Recallaccuracy
duringiris training
Figure 6 showsresults of testing for predictive ability during iris training. This data was obtained by
testing on a subset of the data disjoint fromthat used for training.
Test D~t~
I
0.9
0.O
0.7
0.6
Oer.rttio:~
0
20
40
60
80
100
Figure6: Predictionaccuracy
duringiris training
Figure7 showschanges
in the size of the minimalset, M, during iris training.
HII_SET
10
9
e
?
5
8enerttlons
RO
40
60
80
100
Figure7: Number
of programs
in minimalset dudngiris training
Page 260
Knowledge Discove~ in Databases Workshop 1993
AAAI-93
CANCER DATA
Each entry includes the patient age, whether they had gone through menopause,which breast had the
cancer, whether radiation was used, whether the patient suffered a recurrence during the five years
following surgery, the tumor diameter and four other measuresof the tumor itself. The prediction field
is recurrence.
Weachieved 100%accuracy on recall and 79%accuracy for prediction. Again this is comparableto the
78%accuracy reported by Salzberg (1990). These results are illustrated below.
Figure 8 shows the k(P,N) values obtained during cancer training.
0.$
0.6
0.4
,
i
0
|
30
20
Figure8: k(P.N)during cancertraining
10
8enerstions
I
40
Figure 9 showsresults of testing for pattern recall during cancer training.
~ouraolr
0.9
0.8
0.7
0.6
0.5
i
0
AAAI-93
!
10
l
20
30
Figure 9: Recall accuracyduring cancer training
KnowledgeDiscovery in Databases Workshop1998
I
8eneratior~
40
Page 261
Figure 10 showsresults of testing for predictive ability during cancer training.
&co~Lolr
TestO~t~
0.9
0.8
0.7
0.6
0.5
i
10
|
I
9.0
30
i
,
40
Figure1 O: Predictionaccuracyduringcancertraining
Figure 11 showschanges in the size of the minimalset, M, during cancer training.
MIN_SET
2O
18
\
16
14
12
I
10
20
30
40
Figure 11: Number
of programsin minimalset during cancertraining
Conclusions
Our test results suggest there maybe two distinct aspects to learning in this approach. Thefirst is the
developmentof a population of programssufficient to recall membersof the training set. Wesee this in
Figures 4 and 5 and Figures 8 and 9 where k(P,N) and the recall accuracy proceed to the maximum
value
of 1.0. At this point we might expect learning to stop. However,in a numberof trials we found that
prediction accuracy continued to improve, albeit sometimeserratically.
Lookingat Figure 6 wesee that prediction accuracy continued to increase for sometime after k(P,N)
reached its maximum.Wesuggest an explanation maybe found in the size of the minimalset shownin
Figure 7. At about the sametime as prediction accuracy began its rise to a final maximum
of 96%the
size of the minimalset began to decrease from a high of 10 programsto a range of 6-8 programs.This is
not inconsistent with other machine-learning techniques in which smaller representations tend to have
a greater ability to generalize.
Webelieve the approach described here will have important advantages for some difficult
applications of KDD.The biggest drawbackto this approach is the computational cost. Our
implementation performed well on iris and cancer data bases but the computational burden was a
significant problemin preliminary tests on the muchlarger task of discovering regularities betweenthe
primary and secondary structure of proteins.
Wehave experimented with reducing the effective size of the data set through randomsampling. This
madethe evaluation process muchfaster, but accuracy tended to stop improvingand begin fluctuating
about slightly lower values.
Page 262
KnowledgeDiscoverT in Databases Workshop1998
AAAI-9$
Future directions
There are several aspects of this approach that warrant further research. The most important of these
are related to improvingcomputational efficiency. It would be useful to explore the relationships
between the complexity of individual programs, the size of the programpopulation, and the time to
solution.
Reducingthe search space for the genetic programmingprocess is a worthwhile goal. The use of more
intelligent methodsfor creating and modifyingprimitive functions is one possibility. Another
possibility wouldbe to reduce the dimensionality of the data using application-specific knowledgefor
preprocessing. Alternatives to strict boolean expressions should also be explored.
Stopping criteria wouldbe useful for less goal-oriented applications of KDDsuch as database mining.
For example,in a blind search for regularities, the data attributes might be partitioned into different
N (predicted) and P (predictor) sets. The methodspresented here wouldbe applied for each partition
in a search for interdependencies.
REFERENCES
Koza, John R. Genetic Programming:on the programmingof computers by meansof natural selection.
1992, Cambridge MA, MITPress
Orlowska, Ewaand Pawlak, Zdzislaw. "Expressive power of knowledge representation systems," 1984,
Int. J. Man-Machine
Studies, vol. 20 pp. 485-500, London,AcademicPress Inc.
Park, Jack. "Onan Approachto Index Discovery." 1993, submitted to KDD1993.
Park, Jack, and Dan Wood.User’s Manual--TheScholar’s Companion.1993, Brownsville, CA,
ThinkAlong Software.
Pawlak, Zdzislaw. "RoughClassification,"
London,AcademicPress Inc.
1984, Int. J. Man-Machine
Studies, vol. 20 pp. 469-483,
Pawlak, Zdzislaw. RoughSets: Theoretical Aspects of Reasoning about Data, 1991, Boston, Kluwer
AcademicPublishers.
Salzberg, Steven L., Learning with nested generalized exemplars, 1990, Boston, KluwerAcademic
Publishers.
Ziarko, Wojciech. "A Technique for Discovering and Analysis of Cause-Effect Relationships in
Empirical Data," 1989, KDDWorkshop 1989.
AAAI-9$
KnowledgeDiscovery in Databases Workshop1993
Page 263
Download