Determining the Incremental Worth of Members... Difference-Based Induction

From: Proceedings of the Eleventh International FLAIRS Conference. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.
Determining the Incremental Worth of Membersof an Aggregate Set through
Difference-Based Induction
Avelino J. GonT~lez
University of Central Florida
PO Box 162450
Orlando, FL 32816-2450
ai~(~cce.en~r.ucfedu
Sylvia Dm-oszewski
Harris Corporation
Palm Bay, FL
Abstract
Calculating the incremental worth or weight of the
individual componentsof an aggregate set whenonly the
total worthor weightof the wholeset is known
is a problem
common
to several domains.In this article wedescribe an
algc~’ithmcapable of inducingsuch incrementalworthfrom
a databaseof similar (but not identical) aggregatesets. The
algorithmfocuseson finding aggregatesets in the database
that exhibit minimaldifferences in their corresponding
components
(referred to here as affr/buwaandtheir values).
This 1~ocedureisolates the diasimflarities betweennearly
similar aggregate sets so that any difference in worth
betweenthe sets is attributed to these dissimilarities. In
effect, this algorithmsm’vesas a mappingfunction that
mapsthe makeup
and overall wcecthcf an aggregateset into
the incrementalwcrthof its individualattributes. It cotfld
also be categorized as a wayof calculating interpolation
vectorsf~ the attributes in the aggregateset.
Introduction
Knowing the relative
contribution
of individual
componentsof aa aggregate set in terms of their worth (or
cost, weight, etc.) to the total worth (or cost, weight, etc.)
of the whole set can be important in domains such as
engineering analysis, scientific discovery, accounting, and
real estate appraisal. A slightly different but very similar
problem is that of obtaining the incremental worth of such
individual components. This ta~k can be a difficult one
whenonly the total worth of the wholeset is known.
The work described in this article focuses on an
approachto calculate the incremental worth of components
of aggregate sets. This is possible when a database
contains manyaggregat9 sets that are homogeneous(i.e.,
havethe sameattributes), but dissimilar (i.e., havedifferent
values). An algorithm is formulated which finds aggregate
sets in the d_~base that exhibit minimal differences in
their attributes’ values. This procedurestrives to isolate the
single differentiating attribute/value combinationof nearly
s~nilm"aggregatesets.
1. Copyright1998American
Associationfor Artificial Intelligence
(www.aaai.oraL
Altfights reserved.
HowardJ. Hamilton
Dept. of Ccanputer Science
University of Regina
Regina,Canada
hamiltont~cs.m-egina.ca
Any difference in worth betweert the sets can be
attributed to this singular difference. For this reason, the
algorithm is called the difference-based induction
algorithm, or DBIalgorithm for short.
Definition of Terms
An aggregate set represents a collection, assembly or
grouping of member components which have a common
nature or purpose. An aggregate set is made up of these
aforementioned components. The term attribute will be
used to signify a canponent of an aggregate set, and each
attribute is said to have a partio,lAr value. A groupof such
attribute/value combinations can singnlarly describe a
specific aggregateset.
The ccanponents of an aggregate set act as integral
members, which in~vidually add worth to the aggregate
set. However,their individual conlributions to the total
worth of the aggregate set maynot be explicitly known.
Wewill use the term worth here generically to meana real
numberrepresenting the worth, weight, cost, price or any
other similarly quantifiable factor.
An aggregate set, AG,is therefore defined as a pair
ccusisting of a set containing its components (called
attributes, A), and the total worth of the aggregate set,
(worth). Therefore, the ’~ aggregate s et AG! (of n total
aggregatesets) can be represented by a pair as
AG, = <At worth1>.
At is further defined as the group of attributes for the set
AGi,such that for mattributes in AGI:
At= {aua,~ a~ a~ a~ ... a~.}
whereall is the nameof the ju~ attribute of the i ~ aggregate
set.
Furthe~’more,each aUribute, atj, represents another pair
depicting the value of the attribute (v~l), and its individual
worth (w~). The individual worth of an attribute is the
amountthat this attribute conu’ibutes to the overall worth
of the aggregate set (worth0. In the problems described
here, w~ais typically not known.
Care should be taken to not confuse the terms "value"
and "worth". Value is the value inherent with that
attribute, muchlike attribute-value pairs in frames or the
MachineLearning245
valueof an attribute in a relational c~t_~base.Thevalueof
an attribute can be either discrete or continuous.Discrete
valuescan be binary, suchas present or not-present,Yes or
No, etc., or nou-binary.Anon-binaryaUributecan take on
one of several possible values, whichmaybe numericalor
symbolicin nature. Non-binary
discrete attributes (such as
the engine-sizeof a car) typically havea relatively small
set of possible values (4-cyl, 6-cyl, 8-cyL). Continuous
attributes, on the other hand,suchas the time to accelerate
from0 to 60 MPH,can take on a value from a muchlarger
domain.The range of the continuous domainvalues can
be dividedinto several sub-intervals (i.e., 5-6 secs, 7-8
secs, 9-10 secs, 10-11 secs, 12-13 secs and so on) to
couverta continuousvalueinto a discrete one.
Thus, the pair defining the j,h attribute aj of the
aggregateset AGiis
However,since the value of an attribute can differ
betweertthe variousaggregatesets, the valueandits worth
remainas previouslydefined:
aj = <vidwq>.
Thus,it is ~ requiredthat
(for all) i, vij = vl÷ld andwti = wmj
It is the goal of our workto determinethe incremental
worth of a specific attribute/value combinationin an
aggregateset. Since the domainswherethis information
wouldbe highlyuseful are not fields of exact sciences, the
relative or incremental worth of an ipAividual
attribute/value combinationcan vary somewhatbetween
variousaggregatesets
inthe d_A~abase.
Ourgoal,
therefore,
is to find the averageincrementalor relative worthof a
particular attribute4valuecombination
basedonthe overall
Ai = <via wi~>.
worthsof several aggregatesets.
Lastly, the critical attribute is definedas the attribute
The value of an attribute, via must comefrom a prewhoseincremea3talworthis to be discoveredfor different
definedset of possiblevaluesor range,
value assignments.
Vtj (element of){pvt pv2 pvs pv4 pvs ...pvj}
Toput these definitions in properperspective, wewill
apply themto a description of the value of cars in an
(if discrete)
automobile dealership. A dealership mayhave a large
pw.. <=rid <=pv..x
numberof automobiles(aggregatesets) in their invontory
(database). Eachautomobilehas a set of atwibutes(engine
(if continuous
range)
size, tires, color, doors,powersteering,powerbrakes,etc.)
Wefurthcn-nore
define
thesubtle
difference
between which,whenassignedspecific values, describe it. Eachof
relative worth and incremental worth. If w~l were to
these atlribute/value combinations
contributesto the car’s
representthe relative worth,it mustbe true that
total worth.Eachattribute is definedby its name(e.g., a3
= engine-size), its value (e.g., vi,3 = 6-cyl), and its
worthj= (wj,l + wi,2+ wl,3+ .... + wi,=).
incremental
worth(in this caseit representsthe price) (e.g.,
However,
this is not a necessary condition if w
u
w~,~= 500). In somecases, the differance betweencars is
represents the incremental worth of an attribute/value
in the valuesof their attributes(e.g., 4-cylvs. 6-eylenginecc~nbination. Our algoriflun does not require the above
size). In others, it is in whethex
theyevenexist (a car has
expressionmhold.
or doesnot haveair conditioning).
A database, DB,containing n aggregate sets can be
Therelative versus incrementalworth of each aUribute
definedas
can be explainedin the difference betweenbuyinga new
car anda usedcar. In a newcar purchase,the final price of
DB= {AGIAG2AG3..... AGe}.
the car is (ideally[) determined by the car’s cost
TheDBIalgorithm, however,assumesthat all aggregate
manufacture plus the overhead and profit set by the
sets in DBhave exactly the same attributes. Any
manufacturer, plus that of each option ordered by the
difference betweenthe aggregatesets is reflected by the
buyer. Reductionin anyof these directly affects the final
values of its attributes. However,
if an aggregateset were
price of the car. In a usedcar purchase,however,the price
to lack a particular attribute, this can be representedby
of the car is set by the market, rather than by the
assigninga valueof false or noneto the missingattribute.
manufacturer’scost. Whileincrementsto the retail price
Usingthe samedefinition formalism,this homogeneity
of
are suggestedbecauseof things like lowmileageor year of
aggregatesets couldbe representedas:
make,these do not add up to the total price of the car.
Thus,these represent incrementalvalues.
(for all) i, W=a,+u
This wouldallow the following simplification in our
TheDifference-BasedInduction Algorithm
definition:
A={ala2 a3 a4 .. aj .. a=}
246 Gonzalez
and Associated Procedures
Thebasic conceptof our approachis that by isolating the
differences (as represented by aUribute/value
combinations)betweennearly similar aggregatesets in
database, the incremental or relative worth of the
differentiating attribute/value combinationcan be said to
account for the diff~enee in the total worth of the
aggregate sets. The average incremental worth of an
attribute/value combinationcanthen be computed
from
several specific Cemlmtations
of these differencesin total
worth. Weproposea techniquethat can efficiently identify
the differences betwee~aggregatesets that havemultiple
attributes, and quantify these differences.
This
quantificatkm of differences is not difficult whenthe
attributes are few, but it can be a daunting~sk whenthey
are numerous.Thus, the DBIalgorithm IXesentedhere is
designed to segregate aggregate sets in terms of their
differences, andquantifythe we~thof these diff~ences.
The DBIalgorithm builds a decisiou tree whoseleaf
level nodes each c~tain a group of identical aggregate
sets. In manyways,this approachis similar to ID3andits
successors [Quinlan 1983, 1986, 1987, 1993]. Quinlan’s
algorithms induce a classification tree from a set of
training examples that are described in terms of a
collectiou of attributes, whereeachexamplebelougsto oue
of several mutuallyexclusive classes. A set of rules is
producedfrcqmthis tree that is capable of classifying
examplessimilar to those froth which the roleswere
induced. Each rule is represented by a path in the
classification tree that goes froth the root nodeto a leaf
nodeindicatinga specific classification. Eachlevel in that
path specifies oue attribute and the branchesemanating
fromthe nodesin a level representtheir possiblevalues.
Thedifferences betwee~the DBIalgorithm and ID3,
CA,C4.5and other similar ones, are not so muchin how
eachbuildsthe classification tree, but rather, in howthe
tree is used after it is built, and the source of the
examples.
Objective of the DBI Algorithm
In c~derto understsndhowthe DBIalgorithmworks, it is
importantto first understandwhatit attemptsto achievein
the cc~text of the defmitious provided in the previous
section. Theobjective of the DBIalgorithmis to calculate
a general estimate for the incremental worth of each
attribute/value c¢lnbination whencomparedwith anothe~
value for the same attribute. Morespecifically, it
empirically computesthe average incremental worth,
lwj,,~, of a specific valuepv=of attribute j whencc~pared
with value pvI for the sameattribute. This estimate can
then be usedto estimate the total worthof a newaggregate
set whenit is comparedto other aggregate sets in the
databasewhichdiffer by this attribute~value cembination.
Ideally, the empiricallydiscovered
Iwj,z,7is constantfor all
aggregatesets, but this is not a necessarycondition.
It should be noted that the DBIalgorithm (Ices not
attemptto determinethe individual values of w~jfor each
aggregate set in the database. While these values are
typically unknown,
they are often not necessary, andthus,
it is not cur mainobjective. Theycan be easily c, vmFated
by makingscrne simple modificatiens to the algorithm.
However,
wehaveleft this as a subject of future work.
Description of DBI Algorithm
Eachlevel in a classification tree, as in couventional
induction algorithms, represeats one attribute. The
exceptionto this is the leaf level, whichsimplyacts as a
repositoryof aggregatesets to be usedin fialher analysis.
Withthis exceptiou, all nodesat the samelevel represent
the sameattribute. Each branch emanatingfrom a node
representsa specific possiblevalueor a "discretized"range
of valuesfor that attribute.
Aggregatesets found in DBare distributed throughout
the tree accordingto their attributes and their values.
Begirmingwith the root node, distribution takes place by
repeatedlypassingeachaggregateset frcern a presentnode
at cenelevel to a child nodeat the next level throughthe
branchcorrespoudingto the aggregateset’s value for the
attribute representedby the parent node. Therefore, all
aggregate sets collected in any one node are identical
insofar as the attributes consideredin that level andin all
ancestor levels of the tree. Asnewsuccessorlevels are
addedandthe aggregatesets are further dislributed, they
will be further segregated, with fewer and fewer sets
populatingany one node. This processcontinuestmtil all
atlributes have been represented in the tree up to and
includingthe critical attribute, whichis representedat the
level abovethe leaf level.
Aggregatesets foundat the ~meleaf nodeare identical
to each oth~. They,however,are similar to the sets in a
sibling leaf nodein all attribute values exceptone - the
attribute at the level of the leaves’ parentnode.Thus,any
differences in worth betweenaggregate sets in sibling
leaves can be inferred to beattributed to the differencein
the valueof that singleattribute.
Theclassification of the groupingsto whichaggregate
sets in the leaf nodesbelongis not importantin the DBI
algoriflun. It is ouly importantto knowthat the sets are
similar emongthemselves, and minimallydifferent frum
sets populating sibling leaf nodes. Identifying and
quantifyingthese differencesis the final objectiveof the
DBIalgorithm. The following section describes the
techniqueproposedin greater detail.
Construction of the DBIAlgorithmTree
This section describes the procedure used in the
construction of the DBIalgorithmtree. This procedureis
cmnposed
of the followingsteps: I)data preparation and
preprocessing,2) selectiou of critical auribute, 3) tree
construction, 4) aggregateset pruning,and5) paired leaf
analysis. TheDBIalgorithmis describedin this section.
Machine
Learning
247
Data Preparationand Preprocessing. This step ensures
that all the aggregatesets to be usedas "training examples"
are similarin structure.
Selectionof the Critical Attribute. Thefirst step in the
DBIalgorithmis to select the critical attribute for eachof
the classificationtrees to be built. Thecritical attribute is
an importantone, as it is the onewhoseincrementalworth
is to be induced from the data. To determine the
incremental woxths for the values of more than one
attribute (as would typically be the case), distinct
classification trees are built by the DBIalgorithm,onefor
each attribute whoseworth for various values is to be
determined.
Tree Construction Procedure. The DBIalgorithm builds
the tree one level at a time, starting with the root node.
Thenmnberof distinct values representedby the aggregate
sets at the current nodedeterminesthe numberof branches
to be createdat that node.If there is at least oneaggregate
set supporting a branch, one will be created for that
aggregate set. Only the branches that are supported by
aggregate sets are incorporated into the node; thus no
"null" or redundantbranchesare ever created using this
technique.
The attribute assigned to the root node can be any
attribute exceptthe critical one. However,
in the interest of
efficiency, someheuristics cart be used to select the
attribute representedat the root nodeor any other non-leaf
node. Pleaserefer to (Daroszewski,
1995)for a description
of theseheuristics.
Pruning
Heuristics. After a ctanpleteclassification tree is
constructedfor eachcritical attribute, the algorithmenters
into the case pruningphase.Here, heuristics are appliedto
identify and discard those aggregatesets whoseworthsare
not consistentwith thoseof other aggregatesets in the leaflevel group. See(Daroszewski,1995)for moredetails.
Computation
of Incremental Worth - The Paired-Leaf
Analysis.Thefinal phaseofthe DBIalgorithmis the
incrementalworthcalculation, the processof determining
the worthfor the critical attribute of a DBItree. This
processinvolvesa techniquereferred to as paired-leaf
analysis.
Paired-leafanalysisis appliedto twosibling leaf nodes.
All aggregatesets populatingthese leaves haveidentical
valuesfor all attributesexceptfor the critical attribute (the
parent nodesof the leaf-level nodes).Therefore,it can be
inferred that any difference in worth betweenaggregate
sets in twosibling leavesis the differencein their valuesof
the critical attribute.
The paired-leaf analysis, therefore, computesthe
incremental worth W~I(the incremeatal worth of the jt~
attribute having value pvx versus having value pvy) by
examiningWj,~, the average of the overall worthsof the
aggregatesets populatinga specific leaf (havingvaluepvx
248 Gonzalez
for the leaf nodevalue) with Wj,~,y the averageworthof
the aggregatesets populatinga sibling leaf (having the
valueof pv~for the leaf nodevalue).
For discrete values,the incrementalworthis
Wj~I = tWp~- W.1)
For continuousvalue attributes that havebeen mapped
to integvals, the underlyingcontinuousvalues should be
used, rather than discrete interval names.For these values,
the difference betweenthe averages should be further
divided by the difference betweenthe values themselves.
Theformulashouldnowlook as follows:
Wj,x,y= (Wp,ffi- Wp,
7) I (pvx- pry)
For example,supposecontinuousvalues for automobile
accelerationtimes, whichrangefrom1 to 20 seconds,have
beenmapped
into two-secondint~vals {[1,2] [3,4] [5,6],
.... [19,20]}. Thus,if onecar has an accelerationtimeof 6
seconds(interval [5,6]), andanother13 (interval [13,14]),
the differenceusedshouldbe (13 - 6) = 7, not the fact that
theyare 4 intervals away
Finally, the same average difference found between
other pairs of sibling leaf nodescouldbe usedto arrive at a
final averageof the incrementalworth for that critical
attribute.
Tree Construction Algorithm. The DBI algorithm
will organizethe aggregatesets in the databasein the form
of a classification tree. Thetree constructionalgorithmis
as follows:
1. Readall aggregatesets fromthe databaseinto
memory.
2. Select the attributes whoseincrementalworthis to be
determinedfrom the set of all possible attributes
occurringin each aggregateset. It will typically be
all, but the optionexists to determinethe worthof only
a subsetof the attributes.
3. Put these attributes
in a list and call it
Total Attribute mList.
4. Makea list called Critical_Attribute_Listand set it
initially to Total_Attribute_List.
is empty, do the
5. Until Critieal_AttributelJist
following:
a) Remove the first
element
of the
Critical_Attribute_List,andstore it in a variable
critical attribute.
b) Makea list called Test Attribute List and set it
to Total Attribute Li~.
c) Remove the c~tical attribute
from the
Test Attribute List.
d) Crea~e a clas~fication tree consisting of an
unlabeledroot node.
--e) Until the Test_Attribute_Listis empty,do:
it)
iii)
Usingthe criteria above)select the next
test attribute fromthe
Test Attribute List.
As~,l
thetesta’U~bute
tomecmxent
node of the ch.~d~~ee,
Createbr-~esc~rr~dmgto
possible
values
ofthecurrently
selected
testattribute.
Distributeeachaggregateset into the
next level by assigningthemto the
corresponding
branches,
according
to
their valuefor the attribute beingtested
at the clm~t node.
Deletethe test a~ibutefrmnthe
Test_AttributeList.
Create a newunlabelednode
level at the endof branches
generated.
f) Assign erlticul_attrlbute to the last node level
created.
g) Create branchesc(z’respondingto possible values
of the currentlyselectedtest attribute.
It) Distributeeachaggregateset into the leaf level by
assigning
themto the era’responding
branches,
accordingto their value for the attribute being
tested at the currentnode.
i) Prune the aggregate sets irrelevant for analysis
throughthe heuristics describedin Section3.3.4
j) Applypaired-leaf analysis to determinethe partial
worthestimatefor the critical attribute.
k) Ccmpute
the final wcrthfor the critical attribute.
6. End
vi)
A mc~edetailed description of the system in pseudocode
can be foundin [Daroszewski,1995].
Implementation
and Evaluationof the
Difference.BasedInduction Algorithm
A prototype was built which implements the DBI
algorithm. The domainused f~ the prototype was in
residential property appraisal. This domainlends itself
very well to this approach because determining the
incremmtalworthof the specific attributes of a house(the
aggregate set) has a significant impact m its appraised
value (its overall wcxth). Largedatabasesof sold houses
are typically a~jlable to be used to determine the
incrementalw~thof Severalattributes.
We developed a DBI algorithm prototype and
extensively tested it ca a database of 84 single-family
houses sold during 1994 in the Alafaya Woods
subdivisiou, Oviedo, Florida. Sales data was obtained
fromthe MultipleListing Service (MLS)database
Theprototype will accept the entry of a feature of a
house whoseincrementalworthis to be computedfrc~ the
MLS
databasefor that sectimof the city at that time. It
will return a single numberindicating what the market
ccmiders to be the incremental ~ of that In~exty
feature whencompared
to oneof lesse~" worth
Results of Prototype Evaluation
Toevaluateits effectiveness,its results wereccempared
to
those developed in~zmlly by an expert for the same
neighb~hoodarea during the same period of time. The
aUributes cemparedwe~’e the numberof bedrooms,the
numberof bathrooms,the living area, existenceof a pool,
existence of a garage, existence of a fireplace and age of
house. The DBIprototype results were in the form of a
range of values, somethingtypically donein the appraisal
business. The percent difference betweea the maximum
and minimumof said rangeswhmcompared to the
cmTesIxmding
ranges ccemputedby the expert were as
fotlows(the ccennents
in parentheses
relxe..umtthe
expert’s evaluatiou of the comparison):Living area: 02.4% ("acceptable"); Bedrooms:49-72%("marginally
acceptable"); bathrooms: 154-186%Ctmacceptable’);
garage: 0.3-49%("acceptable"); swimming
pool: 2.5-19%
("acceptable"); fireplace: 2.5-I09%("low end acceptable,
high end not acceptable’); age of house: 20-42%
("acceptable"). The complete data and a moredetailed
discussiouof results can be foundin (Daroszewski,
1995).
Conclusion
The results obtained indicate that the procedureworked
well with a relatively small database. However,
zmaeceptableresults were obtained for the numberof
bedrooms,fireplace, and the numberof bathrooms.These
discrepanciescould be atwibutedto the limited numberof
housespopulatingthe leafs usedin the paired-leafanalysis.
References
Daroszewski, S.G. 1995. MiningMetric Knowledgefrom
Databases Using Feature-OrientedInduction. Master’s
thesis, Doparlment of Electrical and Cemlmter
Engineering,Universityof CemralFlc¢ida, Orlando,FL.
Quinlan, J. R. 1983. LearningEfficient Classification
Proceduresand their Application to Chess EndGames.
In Michalski,R. S., J. G. Carbouell,andT. M.Mitchell,
eds MachineLearning: An Artificial Intelligence
Approach.San Mateo, California: MorganKaufmmm.
Quinlan, J. R. 1987. Decisive Trees as Probabilistic
Classifiers. In Proceedingsof the Fourth Intematioual
W~kshop ou Machine Learning, 31-37. Irvine,
Quiidan J. R. 1993. C4.5: Programs for Machine
Learning,San Mateo,C.alifomia: MorganKaufraann,.
Machine
Learning249