From: Proceedings of the Eleventh International FLAIRS Conference. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. Determining the Incremental Worth of Membersof an Aggregate Set through Difference-Based Induction Avelino J. GonT~lez University of Central Florida PO Box 162450 Orlando, FL 32816-2450 ai~(~cce.en~r.ucfedu Sylvia Dm-oszewski Harris Corporation Palm Bay, FL Abstract Calculating the incremental worth or weight of the individual componentsof an aggregate set whenonly the total worthor weightof the wholeset is known is a problem common to several domains.In this article wedescribe an algc~’ithmcapable of inducingsuch incrementalworthfrom a databaseof similar (but not identical) aggregatesets. The algorithmfocuseson finding aggregatesets in the database that exhibit minimaldifferences in their corresponding components (referred to here as affr/buwaandtheir values). This 1~ocedureisolates the diasimflarities betweennearly similar aggregate sets so that any difference in worth betweenthe sets is attributed to these dissimilarities. In effect, this algorithmsm’vesas a mappingfunction that mapsthe makeup and overall wcecthcf an aggregateset into the incrementalwcrthof its individualattributes. It cotfld also be categorized as a wayof calculating interpolation vectorsf~ the attributes in the aggregateset. Introduction Knowing the relative contribution of individual componentsof aa aggregate set in terms of their worth (or cost, weight, etc.) to the total worth (or cost, weight, etc.) of the whole set can be important in domains such as engineering analysis, scientific discovery, accounting, and real estate appraisal. A slightly different but very similar problem is that of obtaining the incremental worth of such individual components. This ta~k can be a difficult one whenonly the total worth of the wholeset is known. The work described in this article focuses on an approachto calculate the incremental worth of components of aggregate sets. This is possible when a database contains manyaggregat9 sets that are homogeneous(i.e., havethe sameattributes), but dissimilar (i.e., havedifferent values). An algorithm is formulated which finds aggregate sets in the d_~base that exhibit minimal differences in their attributes’ values. This procedurestrives to isolate the single differentiating attribute/value combinationof nearly s~nilm"aggregatesets. 1. Copyright1998American Associationfor Artificial Intelligence (www.aaai.oraL Altfights reserved. HowardJ. Hamilton Dept. of Ccanputer Science University of Regina Regina,Canada hamiltont~cs.m-egina.ca Any difference in worth betweert the sets can be attributed to this singular difference. For this reason, the algorithm is called the difference-based induction algorithm, or DBIalgorithm for short. Definition of Terms An aggregate set represents a collection, assembly or grouping of member components which have a common nature or purpose. An aggregate set is made up of these aforementioned components. The term attribute will be used to signify a canponent of an aggregate set, and each attribute is said to have a partio,lAr value. A groupof such attribute/value combinations can singnlarly describe a specific aggregateset. The ccanponents of an aggregate set act as integral members, which in~vidually add worth to the aggregate set. However,their individual conlributions to the total worth of the aggregate set maynot be explicitly known. Wewill use the term worth here generically to meana real numberrepresenting the worth, weight, cost, price or any other similarly quantifiable factor. An aggregate set, AG,is therefore defined as a pair ccusisting of a set containing its components (called attributes, A), and the total worth of the aggregate set, (worth). Therefore, the ’~ aggregate s et AG! (of n total aggregatesets) can be represented by a pair as AG, = <At worth1>. At is further defined as the group of attributes for the set AGi,such that for mattributes in AGI: At= {aua,~ a~ a~ a~ ... a~.} whereall is the nameof the ju~ attribute of the i ~ aggregate set. Furthe~’more,each aUribute, atj, represents another pair depicting the value of the attribute (v~l), and its individual worth (w~). The individual worth of an attribute is the amountthat this attribute conu’ibutes to the overall worth of the aggregate set (worth0. In the problems described here, w~ais typically not known. Care should be taken to not confuse the terms "value" and "worth". Value is the value inherent with that attribute, muchlike attribute-value pairs in frames or the MachineLearning245 valueof an attribute in a relational c~t_~base.Thevalueof an attribute can be either discrete or continuous.Discrete valuescan be binary, suchas present or not-present,Yes or No, etc., or nou-binary.Anon-binaryaUributecan take on one of several possible values, whichmaybe numericalor symbolicin nature. Non-binary discrete attributes (such as the engine-sizeof a car) typically havea relatively small set of possible values (4-cyl, 6-cyl, 8-cyL). Continuous attributes, on the other hand,suchas the time to accelerate from0 to 60 MPH,can take on a value from a muchlarger domain.The range of the continuous domainvalues can be dividedinto several sub-intervals (i.e., 5-6 secs, 7-8 secs, 9-10 secs, 10-11 secs, 12-13 secs and so on) to couverta continuousvalueinto a discrete one. Thus, the pair defining the j,h attribute aj of the aggregateset AGiis However,since the value of an attribute can differ betweertthe variousaggregatesets, the valueandits worth remainas previouslydefined: aj = <vidwq>. Thus,it is ~ requiredthat (for all) i, vij = vl÷ld andwti = wmj It is the goal of our workto determinethe incremental worth of a specific attribute/value combinationin an aggregateset. Since the domainswherethis information wouldbe highlyuseful are not fields of exact sciences, the relative or incremental worth of an ipAividual attribute/value combinationcan vary somewhatbetween variousaggregatesets inthe d_A~abase. Ourgoal, therefore, is to find the averageincrementalor relative worthof a particular attribute4valuecombination basedonthe overall Ai = <via wi~>. worthsof several aggregatesets. Lastly, the critical attribute is definedas the attribute The value of an attribute, via must comefrom a prewhoseincremea3talworthis to be discoveredfor different definedset of possiblevaluesor range, value assignments. Vtj (element of){pvt pv2 pvs pv4 pvs ...pvj} Toput these definitions in properperspective, wewill apply themto a description of the value of cars in an (if discrete) automobile dealership. A dealership mayhave a large pw.. <=rid <=pv..x numberof automobiles(aggregatesets) in their invontory (database). Eachautomobilehas a set of atwibutes(engine (if continuous range) size, tires, color, doors,powersteering,powerbrakes,etc.) Wefurthcn-nore define thesubtle difference between which,whenassignedspecific values, describe it. Eachof relative worth and incremental worth. If w~l were to these atlribute/value combinations contributesto the car’s representthe relative worth,it mustbe true that total worth.Eachattribute is definedby its name(e.g., a3 = engine-size), its value (e.g., vi,3 = 6-cyl), and its worthj= (wj,l + wi,2+ wl,3+ .... + wi,=). incremental worth(in this caseit representsthe price) (e.g., However, this is not a necessary condition if w u w~,~= 500). In somecases, the differance betweencars is represents the incremental worth of an attribute/value in the valuesof their attributes(e.g., 4-cylvs. 6-eylenginecc~nbination. Our algoriflun does not require the above size). In others, it is in whethex theyevenexist (a car has expressionmhold. or doesnot haveair conditioning). A database, DB,containing n aggregate sets can be Therelative versus incrementalworth of each aUribute definedas can be explainedin the difference betweenbuyinga new car anda usedcar. In a newcar purchase,the final price of DB= {AGIAG2AG3..... AGe}. the car is (ideally[) determined by the car’s cost TheDBIalgorithm, however,assumesthat all aggregate manufacture plus the overhead and profit set by the sets in DBhave exactly the same attributes. Any manufacturer, plus that of each option ordered by the difference betweenthe aggregatesets is reflected by the buyer. Reductionin anyof these directly affects the final values of its attributes. However, if an aggregateset were price of the car. In a usedcar purchase,however,the price to lack a particular attribute, this can be representedby of the car is set by the market, rather than by the assigninga valueof false or noneto the missingattribute. manufacturer’scost. Whileincrementsto the retail price Usingthe samedefinition formalism,this homogeneity of are suggestedbecauseof things like lowmileageor year of aggregatesets couldbe representedas: make,these do not add up to the total price of the car. Thus,these represent incrementalvalues. (for all) i, W=a,+u This wouldallow the following simplification in our TheDifference-BasedInduction Algorithm definition: A={ala2 a3 a4 .. aj .. a=} 246 Gonzalez and Associated Procedures Thebasic conceptof our approachis that by isolating the differences (as represented by aUribute/value combinations)betweennearly similar aggregatesets in database, the incremental or relative worth of the differentiating attribute/value combinationcan be said to account for the diff~enee in the total worth of the aggregate sets. The average incremental worth of an attribute/value combinationcanthen be computed from several specific Cemlmtations of these differencesin total worth. Weproposea techniquethat can efficiently identify the differences betwee~aggregatesets that havemultiple attributes, and quantify these differences. This quantificatkm of differences is not difficult whenthe attributes are few, but it can be a daunting~sk whenthey are numerous.Thus, the DBIalgorithm IXesentedhere is designed to segregate aggregate sets in terms of their differences, andquantifythe we~thof these diff~ences. The DBIalgorithm builds a decisiou tree whoseleaf level nodes each c~tain a group of identical aggregate sets. In manyways,this approachis similar to ID3andits successors [Quinlan 1983, 1986, 1987, 1993]. Quinlan’s algorithms induce a classification tree from a set of training examples that are described in terms of a collectiou of attributes, whereeachexamplebelougsto oue of several mutuallyexclusive classes. A set of rules is producedfrcqmthis tree that is capable of classifying examplessimilar to those froth which the roleswere induced. Each rule is represented by a path in the classification tree that goes froth the root nodeto a leaf nodeindicatinga specific classification. Eachlevel in that path specifies oue attribute and the branchesemanating fromthe nodesin a level representtheir possiblevalues. Thedifferences betwee~the DBIalgorithm and ID3, CA,C4.5and other similar ones, are not so muchin how eachbuildsthe classification tree, but rather, in howthe tree is used after it is built, and the source of the examples. Objective of the DBI Algorithm In c~derto understsndhowthe DBIalgorithmworks, it is importantto first understandwhatit attemptsto achievein the cc~text of the defmitious provided in the previous section. Theobjective of the DBIalgorithmis to calculate a general estimate for the incremental worth of each attribute/value c¢lnbination whencomparedwith anothe~ value for the same attribute. Morespecifically, it empirically computesthe average incremental worth, lwj,,~, of a specific valuepv=of attribute j whencc~pared with value pvI for the sameattribute. This estimate can then be usedto estimate the total worthof a newaggregate set whenit is comparedto other aggregate sets in the databasewhichdiffer by this attribute~value cembination. Ideally, the empiricallydiscovered Iwj,z,7is constantfor all aggregatesets, but this is not a necessarycondition. It should be noted that the DBIalgorithm (Ices not attemptto determinethe individual values of w~jfor each aggregate set in the database. While these values are typically unknown, they are often not necessary, andthus, it is not cur mainobjective. Theycan be easily c, vmFated by makingscrne simple modificatiens to the algorithm. However, wehaveleft this as a subject of future work. Description of DBI Algorithm Eachlevel in a classification tree, as in couventional induction algorithms, represeats one attribute. The exceptionto this is the leaf level, whichsimplyacts as a repositoryof aggregatesets to be usedin fialher analysis. Withthis exceptiou, all nodesat the samelevel represent the sameattribute. Each branch emanatingfrom a node representsa specific possiblevalueor a "discretized"range of valuesfor that attribute. Aggregatesets found in DBare distributed throughout the tree accordingto their attributes and their values. Begirmingwith the root node, distribution takes place by repeatedlypassingeachaggregateset frcern a presentnode at cenelevel to a child nodeat the next level throughthe branchcorrespoudingto the aggregateset’s value for the attribute representedby the parent node. Therefore, all aggregate sets collected in any one node are identical insofar as the attributes consideredin that level andin all ancestor levels of the tree. Asnewsuccessorlevels are addedandthe aggregatesets are further dislributed, they will be further segregated, with fewer and fewer sets populatingany one node. This processcontinuestmtil all atlributes have been represented in the tree up to and includingthe critical attribute, whichis representedat the level abovethe leaf level. Aggregatesets foundat the ~meleaf nodeare identical to each oth~. They,however,are similar to the sets in a sibling leaf nodein all attribute values exceptone - the attribute at the level of the leaves’ parentnode.Thus,any differences in worth betweenaggregate sets in sibling leaves can be inferred to beattributed to the differencein the valueof that singleattribute. Theclassification of the groupingsto whichaggregate sets in the leaf nodesbelongis not importantin the DBI algoriflun. It is ouly importantto knowthat the sets are similar emongthemselves, and minimallydifferent frum sets populating sibling leaf nodes. Identifying and quantifyingthese differencesis the final objectiveof the DBIalgorithm. The following section describes the techniqueproposedin greater detail. Construction of the DBIAlgorithmTree This section describes the procedure used in the construction of the DBIalgorithmtree. This procedureis cmnposed of the followingsteps: I)data preparation and preprocessing,2) selectiou of critical auribute, 3) tree construction, 4) aggregateset pruning,and5) paired leaf analysis. TheDBIalgorithmis describedin this section. Machine Learning 247 Data Preparationand Preprocessing. This step ensures that all the aggregatesets to be usedas "training examples" are similarin structure. Selectionof the Critical Attribute. Thefirst step in the DBIalgorithmis to select the critical attribute for eachof the classificationtrees to be built. Thecritical attribute is an importantone, as it is the onewhoseincrementalworth is to be induced from the data. To determine the incremental woxths for the values of more than one attribute (as would typically be the case), distinct classification trees are built by the DBIalgorithm,onefor each attribute whoseworth for various values is to be determined. Tree Construction Procedure. The DBIalgorithm builds the tree one level at a time, starting with the root node. Thenmnberof distinct values representedby the aggregate sets at the current nodedeterminesthe numberof branches to be createdat that node.If there is at least oneaggregate set supporting a branch, one will be created for that aggregate set. Only the branches that are supported by aggregate sets are incorporated into the node; thus no "null" or redundantbranchesare ever created using this technique. The attribute assigned to the root node can be any attribute exceptthe critical one. However, in the interest of efficiency, someheuristics cart be used to select the attribute representedat the root nodeor any other non-leaf node. Pleaserefer to (Daroszewski, 1995)for a description of theseheuristics. Pruning Heuristics. After a ctanpleteclassification tree is constructedfor eachcritical attribute, the algorithmenters into the case pruningphase.Here, heuristics are appliedto identify and discard those aggregatesets whoseworthsare not consistentwith thoseof other aggregatesets in the leaflevel group. See(Daroszewski,1995)for moredetails. Computation of Incremental Worth - The Paired-Leaf Analysis.Thefinal phaseofthe DBIalgorithmis the incrementalworthcalculation, the processof determining the worthfor the critical attribute of a DBItree. This processinvolvesa techniquereferred to as paired-leaf analysis. Paired-leafanalysisis appliedto twosibling leaf nodes. All aggregatesets populatingthese leaves haveidentical valuesfor all attributesexceptfor the critical attribute (the parent nodesof the leaf-level nodes).Therefore,it can be inferred that any difference in worth betweenaggregate sets in twosibling leavesis the differencein their valuesof the critical attribute. The paired-leaf analysis, therefore, computesthe incremental worth W~I(the incremeatal worth of the jt~ attribute having value pvx versus having value pvy) by examiningWj,~, the average of the overall worthsof the aggregatesets populatinga specific leaf (havingvaluepvx 248 Gonzalez for the leaf nodevalue) with Wj,~,y the averageworthof the aggregatesets populatinga sibling leaf (having the valueof pv~for the leaf nodevalue). For discrete values,the incrementalworthis Wj~I = tWp~- W.1) For continuousvalue attributes that havebeen mapped to integvals, the underlyingcontinuousvalues should be used, rather than discrete interval names.For these values, the difference betweenthe averages should be further divided by the difference betweenthe values themselves. Theformulashouldnowlook as follows: Wj,x,y= (Wp,ffi- Wp, 7) I (pvx- pry) For example,supposecontinuousvalues for automobile accelerationtimes, whichrangefrom1 to 20 seconds,have beenmapped into two-secondint~vals {[1,2] [3,4] [5,6], .... [19,20]}. Thus,if onecar has an accelerationtimeof 6 seconds(interval [5,6]), andanother13 (interval [13,14]), the differenceusedshouldbe (13 - 6) = 7, not the fact that theyare 4 intervals away Finally, the same average difference found between other pairs of sibling leaf nodescouldbe usedto arrive at a final averageof the incrementalworth for that critical attribute. Tree Construction Algorithm. The DBI algorithm will organizethe aggregatesets in the databasein the form of a classification tree. Thetree constructionalgorithmis as follows: 1. Readall aggregatesets fromthe databaseinto memory. 2. Select the attributes whoseincrementalworthis to be determinedfrom the set of all possible attributes occurringin each aggregateset. It will typically be all, but the optionexists to determinethe worthof only a subsetof the attributes. 3. Put these attributes in a list and call it Total Attribute mList. 4. Makea list called Critical_Attribute_Listand set it initially to Total_Attribute_List. is empty, do the 5. Until Critieal_AttributelJist following: a) Remove the first element of the Critical_Attribute_List,andstore it in a variable critical attribute. b) Makea list called Test Attribute List and set it to Total Attribute Li~. c) Remove the c~tical attribute from the Test Attribute List. d) Crea~e a clas~fication tree consisting of an unlabeledroot node. --e) Until the Test_Attribute_Listis empty,do: it) iii) Usingthe criteria above)select the next test attribute fromthe Test Attribute List. As~,l thetesta’U~bute tomecmxent node of the ch.~d~~ee, Createbr-~esc~rr~dmgto possible values ofthecurrently selected testattribute. Distributeeachaggregateset into the next level by assigningthemto the corresponding branches, according to their valuefor the attribute beingtested at the clm~t node. Deletethe test a~ibutefrmnthe Test_AttributeList. Create a newunlabelednode level at the endof branches generated. f) Assign erlticul_attrlbute to the last node level created. g) Create branchesc(z’respondingto possible values of the currentlyselectedtest attribute. It) Distributeeachaggregateset into the leaf level by assigning themto the era’responding branches, accordingto their value for the attribute being tested at the currentnode. i) Prune the aggregate sets irrelevant for analysis throughthe heuristics describedin Section3.3.4 j) Applypaired-leaf analysis to determinethe partial worthestimatefor the critical attribute. k) Ccmpute the final wcrthfor the critical attribute. 6. End vi) A mc~edetailed description of the system in pseudocode can be foundin [Daroszewski,1995]. Implementation and Evaluationof the Difference.BasedInduction Algorithm A prototype was built which implements the DBI algorithm. The domainused f~ the prototype was in residential property appraisal. This domainlends itself very well to this approach because determining the incremmtalworthof the specific attributes of a house(the aggregate set) has a significant impact m its appraised value (its overall wcxth). Largedatabasesof sold houses are typically a~jlable to be used to determine the incrementalw~thof Severalattributes. We developed a DBI algorithm prototype and extensively tested it ca a database of 84 single-family houses sold during 1994 in the Alafaya Woods subdivisiou, Oviedo, Florida. Sales data was obtained fromthe MultipleListing Service (MLS)database Theprototype will accept the entry of a feature of a house whoseincrementalworthis to be computedfrc~ the MLS databasefor that sectimof the city at that time. It will return a single numberindicating what the market ccmiders to be the incremental ~ of that In~exty feature whencompared to oneof lesse~" worth Results of Prototype Evaluation Toevaluateits effectiveness,its results wereccempared to those developed in~zmlly by an expert for the same neighb~hoodarea during the same period of time. The aUributes cemparedwe~’e the numberof bedrooms,the numberof bathrooms,the living area, existenceof a pool, existence of a garage, existence of a fireplace and age of house. The DBIprototype results were in the form of a range of values, somethingtypically donein the appraisal business. The percent difference betweea the maximum and minimumof said rangeswhmcompared to the cmTesIxmding ranges ccemputedby the expert were as fotlows(the ccennents in parentheses relxe..umtthe expert’s evaluatiou of the comparison):Living area: 02.4% ("acceptable"); Bedrooms:49-72%("marginally acceptable"); bathrooms: 154-186%Ctmacceptable’); garage: 0.3-49%("acceptable"); swimming pool: 2.5-19% ("acceptable"); fireplace: 2.5-I09%("low end acceptable, high end not acceptable’); age of house: 20-42% ("acceptable"). The complete data and a moredetailed discussiouof results can be foundin (Daroszewski, 1995). Conclusion The results obtained indicate that the procedureworked well with a relatively small database. However, zmaeceptableresults were obtained for the numberof bedrooms,fireplace, and the numberof bathrooms.These discrepanciescould be atwibutedto the limited numberof housespopulatingthe leafs usedin the paired-leafanalysis. References Daroszewski, S.G. 1995. MiningMetric Knowledgefrom Databases Using Feature-OrientedInduction. Master’s thesis, Doparlment of Electrical and Cemlmter Engineering,Universityof CemralFlc¢ida, Orlando,FL. Quinlan, J. R. 1983. LearningEfficient Classification Proceduresand their Application to Chess EndGames. In Michalski,R. S., J. G. Carbouell,andT. M.Mitchell, eds MachineLearning: An Artificial Intelligence Approach.San Mateo, California: MorganKaufmmm. Quinlan, J. R. 1987. Decisive Trees as Probabilistic Classifiers. In Proceedingsof the Fourth Intematioual W~kshop ou Machine Learning, 31-37. Irvine, Quiidan J. R. 1993. C4.5: Programs for Machine Learning,San Mateo,C.alifomia: MorganKaufraann,. Machine Learning249