GeometricComparison of Classifications and Rule Sets, From: AAAI Technical Report WS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. ¯ Trevor J. Monk,R. Scott Mitchell, Lloyd A. Smith and Geoffrey Holmes Department of Computer Science University of Waikato Hamilton, NewZealand tmonk@cs.waikato.ac.nz,rsm 1 @cs.waikato.ac.nz, las @waikato.ac.nz,geoff@waikato.ac.nz Abstract Wepresent a technique for evaluating classifications by geometric comparison of rule sets, Rules ~e represented as objects in an n-dimensional hyperspace. The similarity of classes is computedfrom the overlap of the geometricclass descriptions, Thesystem producesa correlation matrix that ¯ indicates the degree of similarity betweeneach pair of classes. Thetechnique can be applied to classifications generated by different algorithms, with different nt~mbersofclasses and different attribute sets. Experimental results from a case study in a medical domainare included. MachineLearning, Classification, 1. Introduction Inductive learning algorithms fall into two broad categories based on the learning strat, egy that is used. In supervised learning (learning from examples) the system is provided with examples, each of which belongs ¯ to one of a finite numberof classes. Thetask of the learning algorithmis to inducedescriptiojas of eachclass that will correctly classify both the training examplesand the unseentest eases. Unsupervised learning, on the other hand, does not require pre-classified exampies. Anunsupervised algorithm will attempt to discover its ownclasses in the examplesby clustering the data. This learning mechanism is often referred to as ’learning by observation and discovery.’ Oneof the most importantcriteria for evaluating alearning schemeis the quality of the class descriptions it produces. In general, manydescriptions can be found that cover the examples equally well, but most perform badly on unseen cases. Techniques are required for evaluatingthe quality of classifications, either with respect to a classification or Simplyrelative to each other. This paper presents anew method for comparing classifications, using a geometricrepresentation of class descriptions. Thesimilarity of classes is determinedfromtheir overlap Rules, GeometricComparison in an n-dimensional hyperspace. The technique has a numberof advantages over existing statistical or instance-basedmethodsof comparison: Descriptions are producedthat indicate how twoclassifications are different, rather than simply howmuchthey differ. ¯ The algorithm bases its evaluation on descriptions of theclassifications (expressed as productionrules), not on the instances in the training set. This approachwill tend to smoothout irregular or anomalousdata that ¯ might otherwise give misleading results. Classifications using differing numbersor groups of attributes can be compared,provided there is some overlap between the attribute sets. Thetechnique will workwith any clustering schemewhoseoutput can be expressed as a set of productionrules. In this study, the geometric comparison technique is used to evaluate the performance of :a clustering algorithm (AuToCLAsS)in a medical domain. Twoother techniques are :also used to comparethe automatically generated classifications against a clinical classification producedby experts. * This project was funded by the New Zealand Foundation for Research in Science and Technology KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 395 1.1. Comparlng Classlflcatlons and Rule Sets Regarding evaluation of unsupervised ’clustering’ type methods,Michalsld & Stepp (1983)state that: The problem of howto judge the quality of a clustering is difficult; and there seems to be no universal answer to it. ¯ One can, however, indicate two major criteria. Thefirst is that the descriptions formulatedfor clusters (classes) should be ’simple’, so that it is easy to assign ¯ objects to classes and’ to differentiate between the classes. This criterion alone, however,could lead to trivial and arbitrary classifications. The second criterion is that class descriptions should’fit well’ the actual data. To achieve a very precise ’fit’, however,a description may have to be complex. Consequently, the demandsfor simplicity and good fit are conflicting, and the solution is to find a balance betweenthe two." Accuracyis a measureof the predictive ability of the class description whenclassifying unseen test cases. It is usually measured by the error rate--the proportion of incorrect predictions on the test set (Mingers, 1989). Accuracyis often used to measureclassification quality, but it is knownto have several defects (Mingers, 1989; Kononenko Bratko, 1991). An information-based measure of classifier performance developed by Kononenko& Bratko (1991) eliminates these problemsand provides a more useful measure of quality in a variety of domains. Theremainderof this paper is organised as follows. The next section briefly describes WEKA,amachine learning workbench cur¯/ rently under developmentat the University of Waikato.̄ This includes an overview of the AUTOCLASS and C4.5 algorithms used in our experiment. Section 3 describes our experimemalmethodology and the algorithm used by the geometric rule set comparisonsystem. Results of the experiment, using a diabetes data set, are presented in Section 4. These results are discussed in Section 5, including someanalysis of the performanceof the geoThe CLUS’m~2 algorithm used a combined metric comparisonalgorithm. Section 6 conmeasure of cluster quality based on a num, tains someconcluding remarks and ideas for her Of elementarYcriteria including the ’sim- further researchin this area. : plie~ty of description’ and ’goodnessof fit’ mentionedabove (Michalski & Stepp, 1983). 2, The WEKAWorkbench Hansen& Bauer (1987) use an informationthe’oretic measureof cluster quality in their ~ the Waikato Environment for WEKA, WITTsystem. This cohesion metric evaluKnowledgeAnalysis, is a¯ machine learning ates clusters in termsof their within-class and workbench currently under development at ¯ between-classsimilarities, using the training ¯ the University of Waikato(McQueen, et. al, examples that have been assigned to each 1994). The purpose of the workbenchis to class. Other ¯measuresare based on the class give users access to manymachinelearning descriptionsmin the form of decision trees, algorithms, and to apply these to real-world rules or a ’concept hierarchy’ as used by data. O~MEM (Lebowitz, 1987) or COBWEB (Fisher, WEKA provides a uniform interactive in1987). Someclustering systems produceclass terface tea variety of tools, including madescriptions as part of their normal, operachine learning schemes, data manipulation tion while others, such as AUTOCLASS (Cheese, programs, and the LtsP-S’rATstatistics and man,et. al., 1988) merely assign examplesto graphics package (Tierney, 1990). Data sets classes. In this Case, a supervised learning to be manipulated by the workbench use the algorithm such as C4.5 (Quinlan, I992) can ARFF (Attribute-Relation File Format) interbe used to induce descriptions for the classimediate file format. An ARFF file records infication. This is the methodused in this study formation about a relation such as its name, atfor evaluating AUTOCLASS clusterings. tribute names,types and values, and instances Mingers(1989) uses the two criteria size (examples). The WEKA interface is impleand accuracyfor evaluating a decision tree i mented usingthe TK X-Windowwidget set (or an equivalent set of rules). Following under the "ICEscripting language(Ousterhout, the principle of Occam’sRazor, it is generally accepted that the fewer terms in a model 1 The nameis taken from the weka, a small, the better; therefore, in general, a small tree inquisitive native NewZealandbird related to or rule set will performbetter on test data. the well-knownKiwi. Page 396 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 Choose ¯ file;, . ~. ¯ . ,.,~enthec1~1;..~ ’ "¯ " z ¯ ’1": ~" Z ...thena scheme. , 1.02137 ¯ RELATION: dlMmt~ ’ &TTRIOUTE8~ i7 INSTANCES: 146 - important Included ~. rl [] . .! #O02~mlaUve_welght ii - C4~i NVol Spreadsheet ViewData 1~dlltMlMrout NIn ME-Group [] #O09:aclass !: Histogram Stathales : ~ [~ : ;l’?1#011~las~: Quit Figure 1. The WEKA Workbench 1993). ARFFfilters and data manipulation programs are written in C. WEKA runs under UNIXon Sun workstations, Figure 1 showsan exampledisplay presented by the workbench. ID3 (Quinlan, 1986). The basic ID3 algorithm has been extensively described, tested and modified since its invention (Mingers, 1989; Utgoff, 1-989) ’and will not be discussed in detail here. However,C4.5 adds a numberof enhancementsto ID3, which are worth examining. 2.1. AutoClass AUTOCLASS is an unsupervised induction alC4.5uses a new’gain ratio’ criterion to de~orixhrnthat automatically discovers ¯classes termine howtO split the examplesat each node In a databaseusing a Bayesianstatistical techof the decision tree. C4.5. This removesID3’s nique. The Bayesian approach has several strong bias towards tests with manyoutcomes advantages over other methods (Cheeseman, (Quinlan, 1992), Additionally, C4.5 allows et.:al., 1988). Thenumberof classes is detersplits to be made on the values of continuous minedautomatically; examples are assigned¯ (real and integer) attributes as well as enumerwith a probability to eachclass rather than ab’ ations. i s01utelyto a singleclass; all attributes are potentially significant to the classification; and Decisiontrees inducedby I D3are often very the exampledata can be real or discrete. complex,with a tendencyto ’over-fit’ the data An AUTOCLASS run proceeds entirely with- (Quinlan, 1992). c4.5 provides a solution out supervision from the user. The program this ¯ problemin the form of pruned decision trees or production rules. Theseare derived continuouslygenerates classifications until a user-speeifiedtime has elapsed. Thebest clas-, from the original decision tree, and lead to structures that generallycoverthe training set Sification found is saved at this poinLA variety of reports can be produced-fromsaved tess thoroughly but performbetter on unseen Cases. Pruned trees and rules are roughly classifications. ¯ A WEKA filter has been writequivalent in termsof their classification ac~ten that extracts the most probable class for curacy; the advantage of a rule representaeach instance, and outputs this informationin tion is that it is morecomprehensibleto peoa form suitable for inclusion in an ARFF file. ple than a decision tree (Cendrowska,1987; ¯ This allows theAuTOCLASS classification to be Quinlan, 1992). used as input to other programs,such as rule and decision tree inducers. Thefirst stage¯ in rule generationis to turn , the initial decision tree ’inside-out’ and gen2.2. C4.5 erate a rule corresponding to each leaf. The resulting rules are then generalised to remove 04.5 (Quinlan, 1992)is a powerful tool for inducing decision trees and production rules conditions that do not contribute to the accufrom a set of examples. Muchof C4.5 is deracy of the classification. Aside-effect of this rived fromQuinlan’searlier induction system, process is that the rules are no longer exhausKDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 397 tive or mutually exclusive (Quinlan, 1992). C4.5 copes with this by using ’ripple-down rules’ (Compton,et. al., 1992). Therules are Ordered, and any exampleis then classified by the first rule that coversit. In fact, only the classesare ranked, with the twin advantages that the final rule set is moreintelligible and the order of rules within each class becomes irrelevant (Quinlan,1992). c4.5 also defines default class that is used to classify examples not coveredby any of the rules. x~x~ x .: x ~X xX 480 xXx% x’ e¢. x ~.~?.... ¯ v SSPG ~**.,~ 29 748 9 ¢ ¢ Insulin ¯ ." .,0 .-:g...~ ,." ¯ ,X ~’ XX 10 1.57e+03 XX X X X 3. Methodology X Glucose 3.1. TheDataSet X X xX x X x X x~ X x >k X :.~.,.~..,,,., :,.’~e~",~.: 269 Reavenand Miller (1979) examinedthe re!ationship between Chemicaland Overt diaFigure 2. Scatterplotmatrixofdiabetesdata betes in 145 non-obese subjects. The data set used in this study involvessix attributes: ¯ Normal class patient age, patient relative weight, fasto Chemical class ing ¯ plasma Glucose, Glucose, Insulin and × Overt class steady state plasma Glucose (SSPG). The and SSPG),giving a single point in 3-space. data set also involves twoclassifications, labeled CClass and EClass--presumably repreEach instance is assigned to the class whose senting’clinical classification’ and’electronic meanlies closest to it (in terms of the miniclassification’. Eachclassification describes mumEuclidean distance). After each instance three classes: Overt diabetics, those requiring has been assigned to a class, the class means Insulin injections; Chemical¯diabetics, ¯whose are recalculated. The process of assigning condition maybe controlled by diet; and a~ instances to classes and recalculating class Normal group, those without any form of meansrepeats until no patients are reassigned. diabetes. The same data set is used in the The algorithm assumesprior knowledgeof the present study with the omissionof patient age, class means:andnumberof classes, to define Reavenand Miller found the three attributes the initial classes. Reavenand Miller used the Glucose, Insulin, and SSPGto be more sig- results¯ of a previous study (Reavenet. al.), nificant than any of the others. 1976)to determinesuitable starting means. Both the clinical and electronic classifica3.3. ClassificationusingAutoClass tions assume¯the presence of three classes. The scatter plot below (Figure 2) showsthe Oneobjectiyeof this studywasto determine data set and its clinical classification. Each ¯ if the patients inthe data set naturally fall into of the six Small plots showsa different pair groups,¯ without prior knowledgeof the numof variables (the plots at the lower right are ber of groupsor the attributes of the groups. simplyreflections of those at the upper left). Wealso wishedto determine the effectiveness For example, the plot in the upper left corof AUTOCLASS’S classification of the diabetes ner shows Glucose vs. SSPG.The clinical data, and compareit to the clinical classificaclassification appears to be highly related to tion. the Glucose measurement:in fact, the three It is difficult to determinewhichattributes classes appear to be divided by the Glucose from the data set are reasonableones to genermeasurementalone+ This is particularly well atea classification. Somemaybe irrelevant, illustrated in the plot of Glucosevs. SSPGin : and a classification generatedusing themmay the upperleft plot. produce insignificant classes. Reaven and : Miller considered only Glucose, Insulin and The Electronic classification (EClass) enerated using a clustering algorithm by SSPGto be relevant attributes. They found ¯that fasting plasmaGlucoseexhibited a high edman and Rubin (1967). Each class described by the meanof the three variables degreē of linear association with Glucose(r of the instances in that class (GlucOse,Insulin 0.96), indicating that these two variables are ~ Page 398 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 Classification Attributes AClass Glucose, Insulin, SSPG AClassl Relative weight, Fasting Plasma Glucose, Glucose, Insulin, SSPG Glucose, SSPG AClass2 AClass3 Glucose, Insulin Table 1. AUTOCLASS classifications Classification CClass EClass AClass AClassl AClass2 AClass3 Error Rate 7.2% 5.6% 5.6% 7.2% 3.2% 4.8% tion, and the differencesweretallied. Adifference in the classification of an instance from the clinical classification is assumedto be a misclassification. The percentage misclassification gives someindication of the ’goodness’ of a classification. Somemisclassifications aremore important than others, and a ¯ single error statistic doesnot illustrate this. A large numberof Normalpatients misclassified as Chemicaldiabetics maynot be important, since they are in no dangerof dying from this classification error, howeverif Overt patients are misclassified as Normalthen death could result. The Friedman and Rubin automatic classification (EClass) has 20 misclassifications, giving an error statistic of 13.8%.This wasdeemedacceptable by Reavenand Miller. 3.3.2. Class comparison using two sample comparlsonof means Table 2. Predictederror rate of rule sets Each class maybe described by the means of the attributes of the instances it contains. essentially equivalent. Four classifications This is similar to the generationof a classifiwere ’madein the present study, using Auto, cation using Friedmanand Rubin’s clustering CLASS. Eachused different selected attributes, algorithm, where the meanscharacterize the as. shownin Table 1. Since AUTOCLASS com= classes. Each class is described by the means pletesits classification after a specified time land standard deviations of the three mainat-has elapsed, an arbitrary execution time of tributes: Glucose, Insulin, and SSPG.For one hour was chosen. Classes did not appear example, the Glucose mean for the Normal to changesignificantly with longer execution class of the electronic classification (EClass) times. However, a Comprehensive study of will be compared with the Glucose mean for the effects of differing executiontime has not the Normal class of the clinical classification been performed. (CClass). Although different in detail, all the classifi¯ cations divide the data set into classes that bear Weused a twowayt-test (Nie, et. al., 1975) an obvioussimilarity to the clinical classifica- to comparethe class means.The null hypothtion. Thus we were able to assign the names esis wasthat there is no difference betweenthe ¯ ’Normal’, ’Chemical’and ’Overt’ to the gen- two means. The level of similarity between two classifications is given by the numberof erated classes. This is unlikely to be easy to irejected t-tests. All tests were performedat achieve in general. C4.5 wasused to generate the 95%level of significance. For exam¯ rule sets describin~ allsix classifications. The ple, the SSPGmean of the Chemical class attributes used to reduce the rule sets were in for EClassis significantly different from the all cases the sameas those used to generatethe SSPGmeanofthe Chemical class for CClass. initial classification. Cross’validationchecks Assuming that the meansof the classes in one were performedon the rule sets to derive a classification must be the sameas the meansof reliable estimate of their accuracy. Thepre- the classes in another classification for the two dieted error rates of each of the rule sets on classes to be considered equivalent, then we unseencases is shownin Table 2. Three tech, cannot say that the Chemicalclass for CClass niques were used to compare the generated is equivalent to the Chemicalclass generated classifications with the clinical classification: by Friedmanand Rubin’s classification algoclassification differences by instance, classifieation differences by comparisonof means, rithm (EClass). and classification differences by comparison 3.3.3. Comparingrules:for classification of rules. comparison. Neither technique described above allows 3.3.1. Comparing individual Instances us toaccurately comparedifferent classifications. Comparingclassifications by examinThe automatic classification of each ining misclassified instances is dependent on stance wascompared to the clinical classificaKDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 399 thetwoindividual setsof databeingcom- against anything else. Unfortunately, most pared. Examining misclassifications maynot of the rules inducedfrom our classifications take this form, as G4.5worksby splitting atprovide anaccurate estimate oftheciassificatributes rather than explicitly defining regions tionerrorrate forunseen data. of space. In this study, our solution to this Assumingthat a rule set accurately dehas been to define absolute upper scribes the classification of a set of data, then problem ~d lower limits for each attribute. Rules that by comparingtwo sets of rules we :are effecdo not specify a particular boundare then astively comparingthe two classifications. The sumedto use the appropriate absolute limit. technique presented in this paper for compar, We have used the maximumand minimum ing rules producesa newset of rules describvalues in the data set for each attribute as our ing the differences betweenthe two classificaabsolute limits. The limits for the Glucose tions. Ananalysis of this kind allows machine attribute are zero ¯and 1600; for SSPGthey learning researchers to ask the question "Why are zero and 500. The rule above can then be are these two classifications different?" Previously it has only been possible to ask "How re-written as: different are these twoclassifications?" Glucose > 0 A Glucose < 418 A SSPG> 0 A SSPG < 145 =~ NORM. 3.4. MultidimensionalGeometricRule ~ Comparlson The geometric representation of the above ruleis shownin Figure 3. Thesolid dots repThis techniquerepresents each rule as a ge- resent the Normalexamples from the CClass ometric object in n-space. A rule set produced classification. using 04.5 is represented as a set of such objects. As a geometric object, a rule forms a boundarywithin whichall instances are classified accordingto that rule. Thedomaincovo o o o ¯ erage of a set of rules is the proportion of o 0 o the entire domainwhich they cover. Domain 0 o coverageis calculated by determiningthe hyo pervolume(n-dimensional volume)of a set o rules as a proportion of the hypervolumeof o o ¯ o the entire domain. The ’ripple down’rules of 04.S must be mademutually exclusive to en~ure that no part of the domainis counted ¯ morethan once. The size of the overlap betweentwo sets of rules provides an indication 0 0 of their similarity. Thenon-overlappingportions of the tworule sets are convertedinto a set of rules describing the differences between 0 the tworule sets. 800 1.2e+03 1.6e+03 0 400 Glucose 3.4.1. Geometric Representation of Rules A production rule can be considered to Figure 3. Geometricrepresentation of a rule delimit a region of an n-dimensional space, where n is the ’dimension’ of the rulemthe Anentire set of rules describing a data set numberof distinct attributes used in the terms can be represented as a collection of geometon the left handside of the rule. The’volume’ ric objects of dimension m, where m is the ~ of a rule is then simplythe volumeof the re maximum dimension of any rule in the set. gion it encloses. Anyinstance lying inside Any rule with dimension less than the maxithis region will be classified accordingto the mum in the set is promoted to the maximum right handside of the rule. Thereis a problem, dimension.For example,if another rule, however,with rules that specify only a single bound for someattributes. For example, the Glucose > 741 =:, OVERT two-dimensional rule wereaddedto the first, ¯the dimensionof this Glucose < 418 A SSPG< 145 ~ NORM rule would have to be increased to two the current maximum in the set. This is because does not give lower boundsfor either of the two squares may be easily compared, but a attributes. The volumeof this rule is effecline and a square, or a square and a cube, tively infinite, makingit very hard to compare Page 400 AAAI.94 Workshop on Knowledge Discovery in Databases KDD-94 B maynot. Promotionto a higher dimension is achievedby addinganother attribute to a rule, but not restricting the rangeof that attribute. The rule A Glucose > 741 =:, OVERT in one dimensionis equivalent to Glucose > 741 A SSPG< 500 A SSPG> 0 =:, OVERT in two dimensions. The range of SSPGis not a factor in determiningif a patient is Overt since no patient lies outside the rangeof SSPG specifiedin this rule. Figure 5. Before cutting Consider a comparison between the normal classes of EClass and AClass. The input rule the cutting function. kall twoinvolve dimensional example showninConsider Figure the 5. sets are as follows: Rule A overlaps Rule B in every dimension. EClass: Rule A is the cutting rule, and Rule B is the Glucose < 376 =~ NORM rule being cut. Glucose < 557 ^ SSPG< 204 =~ NORM Each dimension in turn is examined; the AClass: first being the z dimension. The minimum bound (left hand edge) of rule A does not Glucose < 503 ^ SSPG< 145 => NORM overlap rule B, so it need not be considered. Glucose< 336 ~ NORM The maximum bound (right hand edge) of rule Glucose< 418 ^ SSPG < 165 ~ NORM A cuts rule B into two segments. Rule B becomesthe section of rule B which was overFigure 4 shows the geometric representation of the two rule sets. The solid boxes lapped by A in the z dimension. A new rule, represent the EClass rules; the dotted ones B I, is created whichis the section of rule B not overlapped by Rule A in the x dimension represent the AClassrules. (Figure 6). B 8In i o i i ! | ! i e o o i i i o i i o i ¯o ¯ ° i ~’ °olii o oI o ¢ . 8 tDff’l n o’~ ¯ o oo oo ¯ ¯ ,~ o ..... ,,,*++: ’1""I~o 8,,’4 A q o o o o o o °o , t~, t’~ o o° +.++ "41+I "~" I i i ! ii i i i o 0 400 1.2e+03 1.6e+03 800 Glucose Figure 4. Geometricinput rules 3.4.2, TheCuttingFunction Makingrules mutually exclusive, and determining their similarities and differences, KDD-94 B1 o o o Figure 6. After cutting z dimension Considering dimension2, the y dimension: All references to rule B refer to the newly created rule B at the last cut. The minimum bound of rule A (the bottom edge) does not overlap rule B, so no cut is made. The maximumbound of rule A (the top edge) overlaps rule B, creating a new rule B2 which is the AAAI-94 Workshop on Knowledge Discovery in Databases Page 401 section of rule B not overlapped by rule A. Rule B becomesthe section of rule B which is overlapped by rule A (Figure 7). The remaining portion of the original rule B after all dimensions have been cut is the overlap betweenthe two rules. B2 I B1 cutters remain,the result is a list of mutually exclusive rules--which classify all instances exactly as before. The mutually exclusive rules for the example comparison between EClass and AClass are shownbelow (Figure 8). As before, the s01id boxesrepresent the EClassrules and the dotted boxes represent the AClassrules. The rules within each classification do not overlap. A o ° o o ° o o B 0 ¯ o °° o o * * * :p,’0*0 * , o:. ¢, ¯ Figure 7. After cutting y dimension Thenewrules A, B, B 1, and B2are all mutually exclusive. Rules B 1 and B2 describe the part of the domaincoveredby the original Rule B, and not by Rule A. The cutting function generalises to N dimensions, since each dimensionis cut independentlyof every other dimension. Rule A is assumedto have a higher ’priority, than rule B. Anyinstances which lie in thg overlap between rule A and rule B will remainwithin rule A after the cut. ,:.,,¢o [ ~~ 1." . ....... o I 0 0 ! I ! I I iI | 400 800 1.2e+03 1.6e+03 Glucose Figure 8. Mutually exclusive rules 3.4.4. RuleSet Differences Theobjective of the geometricalgorithm is to comparetwo sets of rules and determine 3.4.3, Generating mutually exclusive their differences. Eachclass in the first set rules of rules is comparedwith each class in the The ’ripple down’ rules of C4.5 may be+ second, generating a newrule set describing ¯ the differences betweenthe two classes. Conthought of as a priority ordering of rules-sider two rule subsets: A and B, describing those that appear first are moreimportantthan two classes within different rule sets. Each those that follow. Anyinstances that fall into rule from A is used to cut each rule from B the overlap of two rules Wouldbe correctly *+ and the resulting set of rules describesthe part classified by the higher priority rule, that is, of the domain described by rule set B and not the one whichappearsfirst in the list. by rule set A. The same process is used in Obviouslyhigher priority rules must not be reverse (set B cutting set A) to describe the cut by ¯lower priority rules; the result would part of the domaincoveredby set Athat is not no longer correctly classify instances. In an covered by set B. ordered list of rules from highest to lowest Thedifference matrix showsrule set coverp.riority, such as those output from04.5, the ageofthe differences betweenrules describrule at the headof the list will be used as a cutter for all those followingit+ Eachcut rule ing twoclasses. This is not an accurate statis:is replaced by the segmentsnot overlappedby tic since bigger rules cover moreof the dothe cutter. B is replaced by B 1 and B2 in the mainand their differences will therefore apexampleabove. Onceall the rules below the pear moresignificant than smaller rules, even though their relative differences maybe simcutter have been cut, the rule following the ilar. cutter is chosen as the newcutter. Whenno Page 402 AAAI-94 Workshop on Knowledge Discovery in Databases KDD-94 For example, the rules below describe the part of the domaincovered by EClass (class Norm)but not by AClass (class Norm). Fig, ure 9 shows the geometric representation of these rules in relation to the data set. Glucose > 336 A Glucose < 376 A SSPG > 165 =~ NORM Glucose > 503 A Glucose > 557 A SSPG < 204 ~ NORM Glucose > 418 A Glucose < 503 A SSPG> 145 A SSPG< 204 =¢. NORM Glucose > 376 A Glucose < 418 A SSPG> 165 A SSPG < 204 =~ NORM to o ° o oO o o,o ° 8 t’l t.9 I1. to ¯ 0¯ Qo ,o o o ° °0 o ° ,o o o o ° o o o o o o o I o o o 0 400 BOO Glucose 1.2e+03 1.8e+03. Figure 9. Output rules. the hypervolumeof the rule from set A. This statistic indicates the proportion of the rule in set A overlapped by the one in set B. The sumof the proportions indicates the similarity between the two rule sets. Comparinga set of rules to itself will producea correlation matrix with 100%along the main diagonal, and 0%everywhereelse, indicating that each class completelyoverlaps itself and each class is mutuallyexclusive(since all the rules are mutuallyexclusive). 4. Results This section presents the results produced by the three classification comparisonmeth¯ ods described earlier--instance comparison, tWOsample comparison of means and geo~metric rule comparison. For each method, the five automatically generated classifications (EClass, AClass, AClassl, AClass2and ¯ AClass3)are ¯comparedagainst the original clinical classification, CClass. Table 3 showsthe results of the instance comparisontest. The numbersin each column ¯ indicate the numberof examplesclassified differently to CClass. These values are also expressedas a percentageof the total training set of 145 instances. The rows of this table break the misclassifications downfurther to showwhat types of misclassification are occurring, For example, the row ’Norm~Chem’ ¯ ¯represents instances classified as Normalby CClass, but as Chemical by the automatic classifications. Results of the two-samplet-test comparison are given in Table 4. Here the entries in the The EClass rules entirely encompassthose ~ table indicate the attributes for whichthe t-test ¯ ofACIass,so there are no rules describing the failed, for eachclass in the five classifications. part of the domaincovered by AClassand not A failed test is one wherethe null hypothesis by EClass. wasrejected, ie. ¯the meanof the attribute is significantly different from the corresponding 3.4.5. Correlation Matrix CClass mean, at the 95%significance level. The matrix provides a similarity measure, The bottom rowof this table showsthe total numberof rejections for each classification. or correlation estimate, betweentwo sets of rules. It ¯describes the amountby which the Tables 5-9 showthe correlation matrix outtworule sets overlap. Eachclass fromthe first put produced by the geometric rule comparset of rules is comparedto each class from ¯ ison¯ system. "Phetable entries indicate the the second set. Consider again two subsets amount of overlap betweenthe two classes as of rules~ A and B, each describing a class a proportion of the class listed across the top within two different sets of rules. A describes of the table--in this case CClass. the ideal classification; Set B describes the classification to be compared against the ideal. As with the Difference Matrix, each rule from 5. Discussion ¯ set A is comparedto (cuts) each rule in set B. The hypervolumeof the overlap between The correlation matrices produced by the each pair of rules is taken as a percentageof geometric rule comparisonsystem can be used KDD-94 AAAI-94 Workshop on Knowledge Discovery in Databases Page 403 Difference Type Norm=~Chem Chem~Norm Overt=:,Norm Overt=~Chem Total Classification EClass AClass AClass 1 AClass2 AClass3 3 (2.1%) 9 (6.2%) 3 (2.1%) 5 (3.4%) 2 (1.4%) 4 (2.8%) 10 (6.9%) 13 (8.9%) 12 (8.3%) 19 (13.1%) 1 (0.7%) 0 (0.0%) 1 (0.7%) 3 (2.1%) 8 (5.5%) 6 (4.1%) 8 (5.5%) 4 (2.8%) 5 (3.4%) 2 (1.4%) 20(13.8%) 2t(14.5%) 21(14.5%) 25(17.2%) 31(21..4%) Table 3. Results of classification instance comparison Class Normal Chemical Overt Total EClass SSPG 1 Classification AClass AClassl AClass2 AClass3 Glucose Glucose Glucose, SSPG SSPG SSPG Insulin, SSPG Insulin 0 2 2 5 Table 4. Results oft-tests on classification means EClass Normal Chemical Overt CClass Normal Chemical Overt 94.05% 31.33% 0.00% 5.95% 68.67% 14.19% 0.00% 0.00% 85.81% Table 5. EClassvs. CClasscorrelation AClass Normal ~hemical Overt CClass Normal Chemical Overt 86.86% 13.62% 0.00% 13.14% 86.38% 14.19% 0.00% 0.00% 85.81% Table 6. AClass vs. CClasscorrelation to determinewhichclassification rule set compares mostclosely to the clinical classification. Choosingthe best rule set to describe the classification also dependson the particular misclassifications madeby each rule set. Weindicated previously that misclassifying Overt patients as Normalcan result in death. Obviously it is more important to minimise these types of misclassifications rather than those from Normalto Chemical for example, wherethe effect ofmisclassification is not so important. The correlation matrices indicate AClassand ECIass have 0%misclassification of Overt patients to Normal. AClass2has a Significant misclassification error of 37.2%. AClass3 has low correlation between similar classes, and significant Overlap between the Normaland Chemicalclasses. 35.25%of Chemicaldiabetics are misclassified as Normal and 64.75%of Normalpatients are misPage 404 AClassl Normal Chemical Overt CClass Normal Chemical Overt 94.89% 41.68% 1.64% 5.11% 58.32% 2.76% 0.00% 0.00% 95.60% Table 7. AClassl vs. CClasscorrelation AClass2 Normal Chemical Overt CClass Normal Chemical Overt 87.68% 37.20% 37.20% 12.32% 62.80% 8.91% 0.00% 0.00% 53.89% Table 8. AClass2vs. CClasscorrelation AClass3 Normal Chemical Overt CClass Normal Chemical Overt 35.25% 35.25% 8.29% 64.75% 64.75% 20.76% 0.00% 0.00% 70.94% Table 9. AClass3vs. CClasscorrelation classified as Chemical. The Normaland Overt classes of AClass 1 are very similar to the equivalent CClass classes, but the Chemical classes are not highly correlated between these two classifications. If misclassification of Chemical diabetics to Normal were considered unimportant, AClassl wouldobviously be the best classification. However, AClass has good correlation betweensimilar classes (all above 85%), and would be used if an acceptable AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94 error rate over all classes weredesired. It is desirable to havea single metric that describes the similarity between two rule sets. This could perhaps be calculated as a weightedaverage of the correlation percentage for equivalentclasses. In the tables above, equivalent classes lie on the maindiagonal of the correlation matrix. If no misclassification is consideredmoreimportantthan any other, a weightingof 1.0 wouldbe used for each class. Table 10 showsthis metric, calculated using a weight of 1.0, for each rule set compared to the CClass rule set. Thistabie indicates that AClassis the classification most similar to CClass, and that AClass3is the least similar. This supports the ’intuitive’ reading of the correlation matrices. Classification EClass AClass AClassl AClass2 AClass3 Similarity 82.84% 86.35% 82.94% 68.12% 56.98% Table 10. Overall correlation with CClass Anadvantage of the instance misclassifications is that the distribution of the data is inherent in the misclassifications themselves. The geometricrepresentation of the classifiCations has no knowledgeof the underlying distribution of the data; the similarity estimates assume a uniform distribution of each attribute across the domain. The t-test comparisonof classes is very imprecise. It imparts very little information aboutthe quality of a classification compared with the clinical classification. AClass is indicated as a similar classification, since the meansof all the variables for each class are equivalent. AClass3obviously compares poorly to the clinical classification since the meansare significantly different in all three classes--two out of three meansare different ¯ in both the Normaland Chemicalclasses. The t-test results wouldindicate that AClass1 and AClass2are comparableclassifications. This is an obvious disagreementwith the analysis of the geometric correlation matrices. The similarity metrics for AClassl and AClass2 alone differ by 14.82%. 37.2%of Overt diabetics are classified as Normalby AClass2, compared with 1.64% by AClassl. Thetable of instance misclassifications can be used in the same way as the correlation 6. Conclusions matrix. Theclassification error for each type This paper has presented a newmethodfor of misclassification corresponds roughly to evaluating the quality of classifications, based those inthe correlation matrix. For example, on a geometric representation of class descripthere are 19 misclassifications from Chemical tions. Rule sets are producedthat describe the to Normalfor the AClass3e!assification, giv’i difference between pairs of classes. Correlaing an error rate of 13,1%(one of the highest tion matrices are used to determinethe relative rates in Table 2). Thesamemisc!assification: degree of similarity between classifications. in Table 7 gives an error rate of 37.2%----also one of the highest. Webelieve the similarity ¯ i Themethodhas beenapplied to classifications in a medical domain, statistics in the correlation matricesare more generated by AUTOCLASS and its evaluation compared to those of simple indicative of potential misclassification error instance comparison and statistical methods. than the estimates of Table 2. The instance misclassification errors are only representaTheresults obtained so far are encouraging. tive of the current data set. For large data The¯evaluations produced by the geometric sets this error maybe close to the populaalgorithm appear to¯correlate reasonably well tion errormhowever,small data sets are not with the simple instance comparison. Weberepresentative of the population. The gee, lieve that the geometric evaluation is more metric rules are ¯moreindicative of classifiuseful because it reflects the performanceof cation errors because the rules are represem the classification in the real world, on unseen tative of the population domainDthatis, it data. Instance based or statistical methods is assumedthe rules will be used to classify ¯ cannot ¯reproduce this. However,several asunseencases. Misclassifications indicated in pects of the techniquerequire further experiTable 2 are not always accounted for by the mentation and development. rules. C4.5 reclassifies someinstances when the rules are constructed. Thesingle misclasThe algorithm currently assumesa uniform sification from the Overt class to the Normal distribution for all attributes, whichin general is not valid. Weplan to incorporate a class by EClass for example, is not repre: ¯ mechanism for specifying the distribution of sented in Table 4. KDD-94 AAA1-94 Workshop on Knowledge Discovery in Databases Page 405 attributes to the algorithm,or at least use ’standard’ configurations such as a normaldistribution. Kononenko, I. & Bratko, I., 1991. "Information-Based Evaluation Criterion for Classifier’s Performance". Machine Learning 6, 67-80. Thesystem is at present limited to compar- Lebowitz, M., 1987, "Experiments with Ining ripple-down rule sets. Ideally it would cremental Concept Formation: UNIMEM". also be able to handle rule sets in whichevery MachineLearning 2, 103-138. rul.e has equalpriority. Thereis a difficulty in this case with overlappingrules fromdifferent McQueen,R., Neal, D., DeWar,R. & Garner, S., 1994. ,’Preparing and processing classes--unlike ripple-downrules there is no relational data through the WEKA machine easy wayto decide whichrule, if any, should learning workbench".Internal report, De, take priority. partment of ComputerScience, University Finally we plan to extend the system to of Waikato, Hamilton, NewZealand. handle enumeratedas well as continuous attributes, and integrate it fully with the WEKA Michalski, R. & Stepp, R., 1983. "Learning from Observation: Conceptual Clusworkbench. tering". In Michalski, R., Carbonell, J. & Mitchell, T. (eds.), MachineLearning: An Artificial Intelligence Approach,331-363. Acknowledgements Tioga Publishing Company,Palo Alto, California. Wewish to thank RayLittler and Ian Witten Mingers, J., 1989. "An Empirical Comparifor their inspiration, commentsand suggestions. Credit goes to Craig Nevill-Manning son Of Pruning Methodsfor Decision Tree Induction". MachineLearning 4, 227-243. for his last minuteediting. Special thanks to AndrewDonkinfor his excellent work on the Nie, N., Hull, H., Jenkins, J., Steinbrenner, WEKAWorkbench. K: & Bent, D., 1975. SPSS: Statistical Packagefor the Social Sciences, 2nd ed. MCGraw-HillBook Company, NewYork. References Ousterhout, J., 1993. An Introduction to TCL Cendrowska,J., 1987. "PRISM:Analgorithm and TK. Addison-Wesley Publishing Comfor inducing modularrules". International pany, Inc., Reading, Massachusetts. Journal of Man-MachineStudies. 27,349Quinlan, J., 1986. "Induction of Decision 270. Trees". MachineLearning 1, 81-106. Cheeseman,E, Kelly, J., Self, M., Stutz, J., Quinlan, J., 1992. c4.5: Programsfor MaTaylor, W, & Freeman, D., 1988. "AuTochine Learning. Morgan KaufmannPubCLASS:A Bayesian Classification System". lishers, San Mateo,California. In Proceedingsof the Fifth International Reaven, G., Berstein, R., Davis, B. & OlefConference on Machine Learning, 54-64. sky, J., 1976. "Non-ketoic diabetes melMorgan KaufmannPublishers, San Mateo, litus: insulin deficiency or insulin resisCalifornia. tance?". AmericanJournal of Medicine60, Compton,E, Edwards, G., Srinivasan, A., 80-88. Malor, R., Preston, E, Kang, B. & Lazarus, Reaven, G. & Miller, R., 1979. "An Attempt L., 1992. "Ripple down rules: turnto Define the Nature of ChemicalDiabetes ing knowledgeaquisition into knowledge Using a Multidimensional Analysis". Diamaintenance". Artificial Intelligence in betologia 16, 17-24. Medicine 4, 47-59. Tierney, L., 1990. LIsP-S’rA’r: An ObjectFisher, D., 1987. "KnowledgeAcquisition Oriented Environmentfor Statistical ComVia Incremental Concept Clustering". Maputing and DynamicGraphics. John Wiley chine Learning 2, 139-172. & Sons, NewYork. Friedman, H. & Rubin, J., 1967. "On some Utgoff, E, 1989. "Incremental Induction invariant criteria for groupingdata". Jourof Decision Trees". MachineLearning 4, nal of the AmericanStatistical Association 161-186. 62, 1159-1178. Hanson, S. & Bauer, M., 1989. "Conceptual Clustering, Categorization, and Polymorphy". Machine Learning 3, 343-372. Page 406 AAA/-94 Workshop on Knowledge Discovery in Databases KDD-94