From: AAAI Technical Report WS-93-01. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Using a Genetic Algorithm to Learn Prototypes for Case Retrieval and Classification David B. Skalak Departmentof ComputerScience University of Massachusetts Amherst, MA01003 skalakOcs, umass. edu Neighboralgorithm [Hart, 1968], the ReducedNearest Neighbor algorithm [Gates, 1972], IB2 [Aha, 1990]); storing only training instances that havebeencorrectly classified by other training instances [Wilson, 1972]; exploiting domainknowledge[Kurtzberg, 1987]; and combiningthese techniques [Voisin & Devijver, 1987]. (For discussionsof these approaches,see generally [Aha, 1990].) For example,for numericprediction tasks, Aha’s IB2 (also called CBL2)algorithm saves only those training instances whoseprediction error is abovea given tolerance threshold [Aha,1990]. Othersystemsdeal with referenceselection by storing averagesor abstractions of instances. In this paper wedescribe a novel approachto referenceselection: applya genetic algorithmto identify smallsets of instancesthat maybe used as referencesfor nearestneighborretrieval. This paper is an interim report on on-goingresearch in whichwedescribe a case retrieval andclassification system called Off Broadway. Wepresent an experiment on the Fisher Iris data set [Fisher, 1936], [Murphy &Aha, 1992]showingthat Off Broadway achieves classification accuracy of over 95%with fewer than eight reference instances. Related research is discussed, and proposed workand a summary end the paper. Abstract* Wedescribe howa genetic algorithm can identify prototypicalexamples froma casebasethat canbe usedreliably as reference instancesfor nearestneighbor classification.A case-basedretrieval andclassification systemcalled Off Broadway implements this approach.Usingthe FisherIris data set as a case base, wedescribean experimentshowingthat nearest neighborclassification accuracyof over 95%can be achieved witha set of prototypes that constituteless than5%of the casebase. Introduction One of the fundamental problems of case-based reasoning(CBR)is to ensurethat exhaustiveexamination of every case in case memory need not be performedto retrieve the mostsimilar or the mostrelevant cases. A variety of indexingand other strategies havebeenbrought to bear successfullyon this problem. In particular, nearest neighbor case retrieval and classification systems have dealt with the problemof exhaustivecase comparison.Oneclass of solutions is to pre-process cases into data structures that enable fast nearest neighbor retrieval, such as Voronoidiagrams [Preparata &Shamos,1985] or kd-trees [Moore,1990]. An alternative approach is to perform exhaustive comparison, but to use parallel, distributed memory hardware[Stanfill & Waltz,1986]. For tasks that do not require that all instancesbe stored in case memory, a third alternative is to reducethe number of referenceinstances in the case base1. " Reducingthe numberof instances used for nearest neighbor retrieval has beena topic of research in the pattern recognition and instance-based learning communities for some time, where it is sometimes referred to as the "reference selection problem." Approaches to the problem have included storing misclassified instances (e.g., the CondensedNearest Off Broadway Off Broadway is a case retrieval andclassification system that performsclassification basedon a 1-nearestneighbor algorithm. The system attempts to learn a set of distinguished cases (prototypes) that havedemonstrated classification power. The system maintains a population of sets of potentially prototypicalcases, whereeachprototypeset is a subset of a fixed cardinality of the case base. Eachset of prototypesis evaluatedby its classification accuracyon a set of training cases, wherethe class of each training caseis compared withthe class of the prototypethat is its nearest neighborin a prototype set. Agenetic algorithm is used to search the space of prototypesets in order to find a set with superior classification accuracy.Weshow that sets of prototypes can evolve whoseclassification performanceon the Iris data set have success rates comparableto reported nearest neighboralgorithmsthat use muchlarger subsetsof the data set. * This researchwassupportedin part by the National ScienceFoundation,contract IRI-890841,andthe Air Force Officeof Sponsored Research,contract90-0359. 1 Legalargument is an example of a taskfor whichit would be dangerous to eliminateanycasefrommemory. 64 The connotation of "prototype" suggests one of a small numberof distinguished instances. For example, the number of leading cases on a particular legal issue is usually very small, often just one or two cases. Our immediatefocus, therefore, is on identifying a very small numberof cases as prototypes, andwearbitrarily look to designate fewer than 5% of the Iris data base as classification prototypes. Twogeneral benefits of using nearest neighborclassification prototypesas surrogates for a muchlarger ease set are obvious: ¯ decreased time to perform classification, since classification involves comparison with only a handfulof prototypicalcases; and ¯ decreased memoryrequirements, since only the prototypesneedbe stored in short-termmemory. Onespecific benefit of our approachis that no expert domainknowledgeis needed to identify prototypical cases. Case-basedreasoning often is used in domains lacking strong, operational domaintheories, and so the absenceof reliance on domainknowledgeis an important aspect of our approach. Genetic Algorithms Geneticalgorithms (GAs)are a class of adaptive search techniquesthat haveoften proveneffective for searching large search spaces [Goldberg, 1989]. GAsmaintain a population of members,usually called "genotypes" and classically represented by binary string, whichcan be mutated and combinedaccording to a measureof their worth or "fitness", as measuredby a task-dependent evaluationfunction. Asdescribedin [Holland,1986], the basic executioncycle of a typical GAis straightforward: 1. Select pairs of population membersfrom the populationaccordingto fitness, so that stronger members are morelikely to be selected. 2. Applygenetic operators to the pairs, creating offspring. Random mutationof a population memberis a typical operator. Anothergenetic operator, "crossover" exchanges a random segment between two membersof the population. 3. The membersof the population with the lowest fitness are replacedbythe offspring. 4. Returnto 1. unless certain terminationcriteria havebeensatisfied, suchas the creation of an individual whosefitness meetssomethreshold,or the stabilization of ~e,~p.u!.a..a.P.n.................. In this application,the searchspaceis the set of all subsets of a fixed cardinality of a case base. If n prototypesare selected froma case base of mcases, there are C(m,n), "mchoosen", possible sets of prototypes. For our experimentsusing the Iris data set, m=120(30 cases are reservedfor testing) andn ___8,so the search spaceis still quitelarge. WhileGAshavebeen used in the past for rule-based classification (e.g., [Holland,1986],[DeJong, 1990],[De Jong& Spears, 1991]), there has beenrecent interest in exemplar-based classification techniques using GAs 65 [Kelly &Davis, 1991] as well. DeJongand Spears point out that the two conceptdescriptionlanguagesin general use in machinelearning are decision trees andrules. The research in this paper investigates a third description language that is symptomatic of exemplar-based approachesto classification: the cases themselves. Our approach is an exampleof the "single representation trick" appliedelsewherein machinelearning (see [Barr, et al., 1981]), in which instances and instance generalizationsare expressedin the samelanguage(e.g., [Michalski & Chilausky, 1980]). However,weuse this trick in reverse: here the conceptsare expressedin the languageof cases, rather than expressinginstances in the generalizationlanguage. Wesee several advantages to this exemplar-based approachto conceptdescription. Givena fixed case base, the set of conceptdescriptions is finite, and therefore possibly moretractable than description languagesthat admitinfinitely manyconceptdescriptions. Second,there is minimalbias in the conceptrepresentationlanguageof the cases: the only bias is implicit in the cases that have already been exposedto the systemthrough inclusion in the case base. Weview the presence of minimalbias as an advantage supporting wider applicability of this approach,although wespeculate with [De Jong &Spears, 1991] that performancemaysuffer on tasks that are amenableto somea priori bias in the conceptdescription language. Details of the Genetic Algorithmto Identify Sets of PrototypeCases Basic Approach:In general terms, Off Broadway’s GAis a generationalgenetic algorithmthat uses uniform crossoverand a stochastic remainderwithout replacement selection procedure. The GAmaintains a population of individuals,whereeachindividualis a set of n prototypes. Eachprototypeis a member of a fixed case base. Herewe report on experiments in which n ranges from 3 to 8, whichrepresent prototype sets constituting less than approximately5%of the entire case base. Thefitness of each memberof the population is its classification accuracy. Our basic approach is analogous to the Pittsburgh ("Pitt") approach to optimizing rule sets [Smith,1983]in that organismsin the populationare sets of prototypes,rather than individualprototypicalcases. Encoding:Eachprototype set is encodedas a binary string, whichis conceptuallydividedinto n substrings, one for each of the n prototypes, whereeach substring encodes an index into a case base stored as an array (Figure 1). Thelength of the binary string encoding prototypeset is the numberof bits required to represent the largest indexof a case in the case base, multipliedby the numberof prototypes, [-log2 m] * n, wheremis the number of cases in the case base. ~ Pr?totype[3] I data set of 150 Iris types [Murphy& Aha, 1992]. Each Iris case has four rational-numberfeatures (sepal length, sepal width,petal length, petal width)andis assignedone of three classifications (Iris-setosa, Iris-versieolor, Iris virginica.) The database contains 50 cases of each classification. Anexample of an Iris case is I1~..0 Iris-1 ¯ Iris-2 ~’-n (5.1 Binary Encodin8 of Set of 8 Prototypes Case Base Array Iris-149 Figure1. Eachorganismin the genetic populationis a binary encodingof a set of indexes into a case base array. Evaluation Function: The fitness of each set of prototypes in the population is determined by its classification accuracy,specifically, the percentageof the trainingset of casesthat it correctlyclassifies. Atraining case is correctly classified by a prototype set if the training case’sclass equalsthe class of the prototypethat is its 1-ncarestneighborin the prototypeset. Thesimilarity function usedin this nearest neighbor computation is simple,particularlysince the Iris casesare represented by numerical features. The raw values for each case feature are linearly scaled from0 to 100, where any value less (greater) than three standard deviations fromthe meanof that feature’s value across all cases is assigned the value 0 (100). To computethe similarity distancebetweentwocases, wefirst calculate the featureby-feature difference of two cases with these scaled values. The unweighted arithmetic average of such differencesoverall the featuresof a caseis the similarity distance betweenthe two cases. Theprototype with the smallest such similarity distance to a test case is its nearest neighbor. The ReMindcase-based reasoning developmentshell has previously incorporateda similar approach to nearest neighbor similarity assessment [CognitiveSystems,1992]. Initialization: Thepopulation is initialized to a randompopulationof 20 prototypesets. Operators: Mutation and crossover operators effectively alter the set of prototypes by changingthe index of one or more membersof each prototype set. Crossoversplits mayoccur anywherewithin an organism. The probability of mutating an organism(not a single locus) wasfixed at 0.03. Thecrossover probability was set at 0.7. Off Broadwayis implementedin MacintoshCommon Lisp v.2.0 using CLOS. Part of the systeminstantiates a shell for creating GAapplications available from GNU software[Williams,1992]. Wenowturn to an experimentto identify prototypes for classification. Experiment: Classification by Off Broadway Wehave performeda set of experimentsto demonstrate the identification of prototypesby GAusing the UCIIris 66 3.5 1.4 0.2 Iris-setosa). Our experimentalmethodologywasto perform5-fold cross validation of the classification performance of Off Broadway.Thecase base wasrandomlypartitioned into 5 groupsof 30 cases. For each validation, weused one of the partitions of 30 random cases as testing instancesand reserved the remaining 120 instances for training. Prototypeswereselected only fromthe training sets and wereusedto classify the 30 test instances. Wetested the classification accuracyof sets of n prototypesdetermined by Off Broadway,wheren varied from3 to 8. Ourlower limit was3 prototypesin these experiments,since the Iris dataset containsthree classesof Iris types. For each partition, the GAwasrun until it performed 100 evaluations of the training set, approximately6 generationson average, taking approximately10 minutes on average on an AppleMacintoshIh. Iris cases werestored in a case array in the order in whichthey appear in the UCIIris database file, where they are groupedby class. Recall fromthe description of the algorithmthat the order of the stored cases maybe relevant because the membersof the GApopulation encodeindexesinto this case array. Wewere primarily interested in the classification accuracyof the best performingmember of a population. Average best performance for the learned prototype systemswascomputedusing the followingprocedure. a. Withthe numberof prototypes in each prototype set fixed, we divided the cases base into five random partitions, each consisting of a 120-elementtraining set and a 30-elementtest set, as described in the second paragraphin this section. b. For eachof the five partitions of the case base, we determinedthe maximum percentage of the 30 test cases correctly classified by an individual in the populationof 20 prototypesets after 100evaluationsof the population. c. Finally wecomputed the averageof the resulting 5 maxima.This averageis reported in the secondcolumnof Table1, labeled "GAPrototypesAve.Max.Correct". Prototypes(n) GAPrototypes Ave. Max.Correct (%) 3 28.6 (95.3%) 4 29.2 (97.3%) 28.6 5 (95.3%) 29.0 6 (96.7%) 29.2 7 (97.3%) 28.8 8 (96.0%) Table 1. Averagemaximum performancefor learned prototypesets, usingtest sets of 30instances. Related Research Thebasic result, whichis reported in Table1, shows that the small numberof prototypes identified by Off Broadway performedwith greater than 95%classification accuracyin the 5-fold cross validation experimentsusing the Iris data set. The data in Table 1, whichshowthe average best performance of a memberof the final population, reflect howthe algorithmwouldprobablybe used in practice: a best performing memberof the populationwouldbe identified andused as a surrogatefor the entire casebase. These results are comparableto or better than a numberof reported nearest-neighbor methods.[Weiss & Kapouleas,1989], for example,report a correctness of 96%for nearest neighborretrieval on the Iris data set, using a leaving-one-out evaluation methodology.Our approachperformedsignificantly better than the 76.6% correct givenin [Fertig &Gelernter, 1991],whichin turn performedslightly better than the "neighborhood census rule" describedin [Dasarathy,1980]. In onerun, [Gates, 1972] reported 100%classification accuracies using the condensednearest neighborrule and the reducednearest neighborrule, whichused 20 and 15 reference instances, respectively. It is unclear whethercross-validation or someother experimental test was performedto confirm this result, however.Bypre-processingthe raw Iris data and using a 1-nearest neighbor algorithm, Gates also reported classification accuracies from 93.3%to 96.7% using from15 to 31 reference instances. Finally, using from 3 to 6 "pathological" reference instances, Gates found classification accuracyvaried from approximately 17%to 40%.See [Gates, 1972] for descriptions of these andrelated experiments. GAclassification systemshave beencreated by [Kelly & Davis,1991]to learn real-valuedweightsfor features in a data set and by [De Jong & Spears, 1991] to learn conceptualclassification rules. Ourapproachis similar to DeJongandSpears’s in that they useda Pitt approachto defining their population, and applied the same straightforwardfitness function for each individual in their population,the percentagecorrect ona set of training examples.However,an important distinction with this workis that DeJongand Spears used classification rule sets as populationindividuals, rather than prototypical cases. Protos [Bareiss, 1989] is a goodexampleof a casebasedreasoningsystemthat relies on case prototypesfor classification. Anexemplarthat is successfullymatched to a problemhas its prototypicality rating increased. The prototypicality rating is used to determinethe order in whichexemplarsare selected for further, knowledgebased pattern matching. Protos is an intelligent classification assistant andthe prototypicalityrating may be increased based in part on the actions of the supervising teacher. The ReMindcase-based reasoning development shell [Cognitive Systems, 1992] also incorporatesa facility for the user to select prototypesto further indexa casebase. Approachesto reducing the number of reference instances for nearest-neighborretrieval were discussed generallyin the introduction. This line of research is an immediateoutgrowthof two projects from our group. In [Skalak & Rissland, 1990] wesuggested and evaluated a case-based approach to the problem of reducing the numberof training instances neededto create a decision tree using ID5 [Utgoff, 1988]. Recently wedescribed a system called Broadway[Skalak, 1992] that represents cases as blackboard knowledge sources whose preconditions invokelocal similarity functionsthat apply only within a closed neighborhoodof each case in the space of cases. The connection between the current project, Off Broadway,and Broadwayis that the set of prototypes learned by Off Broadway tessellates the case space into local neighborhoods (as for any nearest neighbor algorithm), whereeach neighborhoodcontains the volume of cases in case spacefor whicha particular prototypeis the nearest neighbor. Limitations of Off Broadway Thecurrent implementationis limited by the assumption that the number of prototypesdesiredis fixed a priori, in advanceby the user, andis suppliedas a parameterto the system. However,since GAsusing the Pitt approachhave exploredvariable-lengthrule sets, webelievethat a fairly minor re-implementation of the system woulddispense with this assumption. Oneimportantlimitation of the researchreportedhere lies withthe simplicityof the Iris data set, andweplanto test the approachon morecomplexdata sets. In the Iris data set, oneclass is linearly separablefromthe other two, but the other two are not separable from each other. We havebegunto gather evidencethat shows,in fact, that sets of prototypicalinstancesmaybe relatively easy to find in the Iris domain. Other data sets mayalso include elementsthat are moreclearly prototypical, as that term is usually used. For example,the LED-7 display domainfor light-emitting diodes [Murphy & Aha, 1992], contains obvious 2. prototypicalelements,the correct numeralsthemselves Future Work and Summary Anext step in the Off Broadway systemis to learn what similarity metricappliesin eachregionin whicha specific prototype is the nearest neighbor. Weare investigating an algorithm that uses a version of the absolute error correction rule [Nilsson, 1990] together with a state preference method[Utgoff & Clouse, 1991] to learn a similarity functionfor eachsuchregion. In summary,Off Broadway attempts to learn sets of case classification prototypesusing a genetic algorithm. 2’~l’heproblem of readingprimed charactersis a clear-cut instanceof a situation in whichthe classification is based ultimately ona fixedset of’prototypes’." [Minsky, 1965,p. 21]. 67 Wehaveapplied OffBroadway toclassify elements ofthe Irisdatabase andhaveshownthatclassification performancebetter than or equal to that achievedusing muchlarger reference sets can be achievedusing fewer than8 Iris casesas classificationprototypes. Acknowledgments This workhas benefited fromdiscussions with EdwinaL. Rissland, Jeff Clouse, Claire Cardieand TimurFriedman. References Aha, D. W.(1990). A Study of lnstance-BasedAlgorithms for Supervised Learning Tasks: Mathematical, Empirical, and Psychological Evaluations. Ph.D. Thesis, Availableas TechnicalReport90-42, Dept.of Information and ComputerScience, University of California, Irvine, CA. Bareiss, E. R. (1989). Exemplar-Based Knowledge Acquisition. Boston, MA:AcademicPress. Barr, A., Feigenbaum,E. A. & Cohen, P. (1981). The Handbook of Artificial Intelligence. Reading, MA: Addison-Wesley. Cognitive Systems, Inc. (1992). ReMind:Case-based ReasoningDevelopmentShell. NewHaven, CT. Dasarathy, B. V. (1980). Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognitionin Partially Exposed Environments. IEEE Transactions on Pattern Analysisand MachineIntelligence, 2(1), 6771. De Jong, K. (1990). Genetic-Algorithm-Based Learning. In Y. Kodratoff, & R. Michalski (Eds.), Machine Learning (pp. 611-638). San Mateo, CA: Morgan Kaufmann. De Jong, K. A. & Spears, W. M. (1991). Learning Concept Classification Rules Using Genetic Algorithms.Proceedings,12th International Joint Conferenceon Artificial Intelligence, 651-656. Sydney,Australia. International Joint Conferenceson ArtificialIntelligence. Fertig, S. & Gelernter, D. H. (1991). FGP:A Virtual Machine for Acquiring Knowledgefrom Cases. Proceedings,12th International Joint Conferenceon Artificial Intelligence, 796-802.Sydney,Australia. International Joint Conferences on Artificial Intelligence. Fisher, R. A. (1936). Theuse of multiplemeasurements taxonomicproblems.AnnualEugenics, 7, 179-188. Gates, G. W. (1972). The Reduced Nearest Neighbor Rule. IEEETransactions on Information Theory, (May), 431-433. Goldberg,D. E. (1989). Genetic Algorithmsin Search, Optimization, and MachineLearning. Reading, MA: Addison-Wesley. Hart, P. E. (1968). The CondensedNearest Neighbor Rule. IEEETransactions on Information Theory (Corresp.); IT-14(May),515-516. 68 Holland, J. H. (1986). Escaping brittleness: The possibilities of general purposelearning algorithms applied to parallel rule-based systems. In Machine Learning. San Mateo, CA: MorganKaufmann. Kelly, J. D. J. & Davis, L. (1991). A HybridGenetic Algorithm for Classification. Proceedings, 12th International Joint Conference on Artificial Intelligence, 645-650. Sydney, Australia. International Joint Conferences on Artificial Intelligence. Kurtzberg, J. M. (1987). Feature analysis for symbol recognition by elastic matching. International Business Machines Journal of Research and Development,31, 91-95. Michalski,R. S. & Chilausky,R. L. (1980). Learning being told and learning from examples: An experimental comparison of the two methods of knowledgeacquisition in the context of developing an expert system for soybean disease diagnosis. International Journal of Policy Analysis and InformationSystems,4, 125- 161. Minsky,M.(1965). Steps TowardArtificial Intelligence. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Readingsin MathematicalPsychology,vol. II., 1840. NewYork: John Wiley. Moore,A. W. (1990). Acquisition of DynamicControl Knowledgefor a Robot Manipulator. Proceedings, Seventh International Conference on Machine Learning, 244-252. Austin, TX. MorganKaufmann. Murphy,P. M. & Aha, D. W. (1992). UCIrepository of machinelearning databases. Irvine, CA:University of California, Dept. of Information and Computer Science. Nilsson, N. J. (1990). The MathematicalFoundationsof Learning Machines. San Mateo, CA: Morgan Kaufmann. Preparata, F. P. & Shames,M. I. (1985). Computational Geometry: An Introduction. NewYork: SpringerVerlag. Skalak, D. B. (1992). RepresentingCases as Knowledge Sources that Apply Local Similarity Metrics. Proceedings, The 14th Annual Conference of the Cognitive Science Society, 325-330. Bloomington, Indiana. LawrenceErlbaum. Skalak, D. B. & Rissland, E. L. (1990). Inductive Learningin a MixedParadigmSetting. Proceedings of AAAI-90,EighthNationalConferenceon Artificial Intelligence, 840-847. Boston, MA. American Association for Artificial Intelligence. Smith,S. F. (1983). Flexible Learningof ProblemSolving Heuristics ThroughAdaptive Search. Proceedings, 8th International Joint Conferenceon Artificial Intelligence, 422-425. Karlsruhe, West Germany. International Joint Conferences on Artificial Intelligence. Stanfill, C. & Waltz, D. (1986). TowardMemory-Based Reasoning. Communicationsof the ACM,29(12), 1213-1228. Utgoff, P. E. (1988). ID5: An Incremental ID3. Proceedingsof the Fifth International Conferenceon MachineLearning. AnnArbor, MI. Utgoff, P. E. & Clouse, J. A. (1991). TwoKinds Training Information for Evaluation Function Learning. Proceedings, AAA1-91, 596-600, AAAI Press/MITPress. Voisin,J. &Devijver,P. A. (1987). Anapplicationof the Multiedit-Condensing technique to the reference selection problemin a print recognition system. PatternRecognition,20(5), 465-474. Weiss, S. M. & Kapouleas, I. (1989). An Empirical Comparison of Pattern Recognition,NeuralNets, and Machine Learning Classification Methods. Proceedings,11th International Joint Conferenceon Artificial Intelligence, 781-787. Detroit, MI. International Joint Conferences on Artificial Intelligence. Williams,G. P. W. Jr. (1992). GECO: Genetic Evolution through Combinationof Objects. Free Software Foundation. Wilson, D. (1972). AsymptoticProperties of Nearest Neighbor Rules using Edited Data. Institute of Electrical and Electronic EngineersTransactionson Systems, Manand Cybernetics, 2, 408-421. 69