From: AAAI Technical Report SS-93-07. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Classification Learning: FromParadigmConflicts to Engineering Choices
David L. Waltz
Thinking Machines Corporation and
Brandeis University
Abstract
Classification learning applies to a wide range of tasks, fromdiagnosis and
troubleshooting to pattern recognition and keywordassignment. Manymethodshave
been used to build classification systems, including artificial neural networks,rule-based
expert systems (both hand-built and inductively learned), fuzzy rule systems, memorybased and case-based systems and nearest neighbor systems, generalized radial basis
functions, classifier systems, and others. Researchsubcommunitieshave tended to
specialize in one or another of these mechanisms,and manypapers have argued for the
superiority of one methodsvis-a-vis others. I will argue that none of these methodsis
universal, nor does any one methodhave a priori superiority over all others. To support
this argument,I showthat all these methodsare related, and in fact can be viewedas
lying at points along a continuous spectrum, with memory-basedmethodsoccupying a
pivotal position. I further argue that the selection of one or another of these methods
should generally be seen as an engineering choice, even whenthe research goal is to
explore the potential of somemethodfor explaining aspects of cognition; methodsand
problemareas must be considered together. Finally a set of properuesis identified that
can be used to characterize each of the classification methods,and to begin to build an
engineeringscience for classification tasks.
1.0 Unified Framework
for Classification Learning
A widevariety of classification learning methodscan be seen as related, as points on a
spectrum of methods. Memory-basedReasoning (MBR)is the key to this analysis. The
idea of MBR
is to use a training set without modification as the basis of a nearest
neighbor classification method.Anynewexampleto be classified is comparedto each
element in the training set and the distance from the newexampleis computedfor each
training set element. Thenearest neighbor(or nearest k neighbors) are found in the
training set, and their classifications used to decide on the classification for the new
example.In a single nearest neighbor version of MBR,the class of the closest training
set neighboris assigned to the newexample.In a k-nearest neighborversion, if all k
nearest neighbors have the sameclass, it is assigned to the newexample;if morethan one
class appears within the nearest k neighbors, then a voting or distance-weightedvoting
schemeis used to classify the newexample.As stated, MBR
has no learning. (It is
certainly possible -- and for real worldproblemsgenerally a goodidea -- to include
learning with MBR;
wewill comeback to this issue later.)
First, wecan relate MBR
to rule-based systems; in particular, if lookedat the right way, a
single-nearest-neighbor MBR
system is already a rule-based system. To see this, note that
MBR
cases consist of situtations and actions, like production rules. There are as many
"rules" as there are cases in the MBR
training set database. Each"left handside" is the
conjunctionof all the features of the case. Each"right handside" is the classification.
Usingthis observation, wecan see that there is a spectrumof rule-based systems between
MBR
and an "ordinary" rule-based system, with a relatively small numberof rules. We
can movealong this spectrum by using AI learning techniques: for example, we can find
irrelevant features by noting that certain left-hand side variables haveno correlation with
classifications, and can thus be eliminated, yielding shorter rules. Also, somecases may
be repeated, and as variables are eliminated, morecases will becomeidentical, and can
128
thus be merged.Moreover,if ranges or sets are used instead of specific variables, more
cases can be collapsed. (Relevant AI methodsinclude ID-3, AQVAL,
version spaces,
COBWEB,etc.)
Second,wecan also relate MBR
to neural nets. If lookedat the right way, single-nearestneighbor MBR
systems are a kind of neural net, namely one with as manyhidden units as
there are examplesin the training set. Eachinput unit codes for a feature/variable, and is
fully connectedto each hidden unit. There are as manyoutput units as there are possible
classifications, and each hiddenunit has a single link to the appropriateoutput unit. All
the hiddenunits are connectedin a "winner-take-all" networkconfiguration. It is also
possible to define a spectrum of methodsbetween MBR
and "ordinary" neural nets. We
can reduce the numberof MBRhidden units by a numberof methods: we can remove
duplicate cases; wecan replace similar cases with a single case, using methodssimilar to
those of generalized radial basis functions, whichyield a set of Khiddenunits, each with
a central pseudo-case,together with radii of influence (this also has close relations to
G-rossberg’s ARTsystem); we can use methodslike those of Kibler and Aha, whoremove
cases whoseneighbors all have the sameclasification, leaving only those cases that are
on the boundariesbetweenregions of similar categorizations; etc.
2. Relative advantages and disadvantages of various methods
For the past few years, I and a numberof mycolleagues have been involved with several
projects that have allowed us to comparevarious classification learning methods.I will
concentrate here on exampleswhere we have been able to comparethe results of varous
systems quantitalvely. Examplesinclude MBRTalk
and the research of Wolpert, which
comparedneural nets with MBR;PACE,a system that classified Census Bureau returns,
and allowed direct comparison with AIOCS,an expert system; a memory-basedsystem
for assigning keywordsto news articles that can be comparedwith CONSTRUE,
an
expert system for a very similar task as well as with humanindexers; and workon PHIPSI, a system for protein secondarystructure prediction, whichlet us compareneural nets
with MBR
as well as with statistical techniques. Other projects have let us compareMBR
with CART
and various statistical regression methods.
These various systems can be comparedalong a numberof dimensions. These include 1)
quality of classification decisions; 2) computationalcost; 3) programmer
effort required;
4) memory
requirements; 5) learning cost; 6) update cost; 7) ability to generalize
extrapolate beyonda training set; 8) ability to scale to very large training sets and/or to
very large numbersof categories; 9) ability to provide and explanation for classifications;
10) range of inputs handled(e.g. numbers,symbolicvariables, free text, etc.); and
ability to provide a confidencescore for classifications. Theproperties of each of the
various classification methodsare characterized, and comparedwith the requirementsfor
various classification tasks.
Manyof these dimensionsare summarizedin the following table:
Accu-Learn
Decsn Progr Updte MemExpl? Conf Text? Noise Scales
racy
Cost
Cost Cost
MBR
++
++
--
+
++
--
+
ANNs
+
-
++
++
-
+
......
ID-3
+
-
++
+
-?
Cost
meas?
Reqts
+
129
+
Tol?
+
"?
--
?
+
+
+
+
"?
+?
GRBF
++
-
+
~?
+
-?
RBSs
+?
-
++
....
+
+
~? -?
~
+
+
+
....
Keyto abbreviations
MBR--- Memory-BasedReasoning
ANN= Artificial Neural Nets
ID-3 = Quinlan’s system or CART
GRBF= Generalized Radial Basis Functions (Poggio)
RBSs= Rule-Based Systems (Expert Systems)
Accuracy= level of correctness of classifications
Learn Cost = computational cost of learning
DecsnCost = cost of makinga classification, once the system has been trained
Progr Cost = programmercost, the amountof humaneffort required to build
a system
UpdteCost --- computationaland/or humancost to train/reprogram system to correctly
classify newcases
MemReqts = memoryspace requirements
Expl? -- does the system provide an explanation of its behavior?
Conf meas?= does the system provide a measureof its confidence in its
classification?
Text? = can free text be handled as an input?
NoiseTol? = is the systemtolerant to noise? (or is it "brittle"?)
Scales? = can the system be scaled up to deal with really large training
sets and/or high numbersof classifications per second?
++ = very strong point
+ = strong point
~ = a mixed bag
- = weakpoint/system not goodat this
-- = very weakpoint/system can’t do this, or does badly
? - hard to judge
Highlights of the table
MBR
has been the most accurate -- or amongthe two or three most accurate -- methods
on every domainwe’veexplored, as long as the numberof cases is very large (the larger
the numberof cases, the morelikely that one of themwill be an exact or nearly exact
matchfor a newcase). MBR
does not require learning in its simplest form; all other
methodseither require considerableresources for learning, or else can’t be learned at all
and must therefore be hand-coded.Updatescost little -- one simply adds newitems to
the case-base, removesold ones, and the system immediatelybegins making
classifications based on the newinformation. MBR
provides explanations for its
classifications (in terms of precedents). MBR
provides confidencemeasures(i.e.
nearness of nearest neighbors), unlike any other methods. AndMBR
can use text as
inputs (e.g. by using methodsborrowedfrom Information Retrieval to judge the relative
similarity of free text passages). AnMBR
system’s great drawbackis that it requires large
amountsof memory,and large computational power for decision-making.
Artificial NeuralNets are easy to programand require very little computationalpowerin
order to makeclassifications, once a networkis trained. Theyare, however,expensiveto
130
train or update, provide no explanations or justifications, provide no confidence
measures,and cannot deal with free text inputs.
Decision trees (ID-3, CART,...) require only small computational powerto make
decisions, oncethe tree is learned.
Generalized Radial Basis Functions have been proven to have high accuracy.
Rule-basedSystemsare computationallycheap once built, can deal with free text inputs,
can supply explanations (in the form of the sequenceof rules applied to obtain the
classification), and, at least in someforms can give a confidence measure. However,they
require large programmer("knowledgeengineer") effort to build or update, are intolerant
of noise ("brittle") -- unlike the rest of the methodsconsideredhere -- and they scale
poorly: few, if any, rule-based systems exceed1,000 rules.
3. Understandingthe properties of various methods-- towardan engineering science
for classification tasks
In the long run, hardwarecosts will becomenegligible, designs will be completedand
implemented,and systems will be fielded. Themost critical issue then remainingis
accuracy: can a system actually classify reliably? Can it track changes?The accuracy that
can be achieved by any given system dependsa great deal on the underlying nature of the
classification domain.
AI has commonly
assumedthat a very small numberof rules or principles underlie each
domain.Thegoal of a learning system is to find the compact,concise rule set that
captures the domain. Themost extreme exampleis a physical law, e.g. Newton’slaw of
motion, F = maas found by a Bacon-like system. A similar spirit motivates MDL
(Minimum
Description Length) learning, AQVAL,
etc. Believers in such systems point
evidence of bad performancedue to "overtraining" of neural nets.
But manydomainsare not simple, and for these rules maysimply be inappropriate (or at
best only appropriate for somefraction of the domain). Thefirst examplewe stumbledon
was NETtalk, Terry Sejnowskiand Charlie Rosenberg’s neural net system for
pronouncing English words: MBR
gave dramatically higher generalization accuracy than
did NETtalk.Thereason is, I think, closely related to the fact that Englishpronunciation
is patterned, but has a vast numberof exceptions. Neuralnets (or rule-based systems) can
capture the regularities, but only a finite portion of the domainis regular, and after some
point, one mayneed to add a newrule (or hidden unit) for each newexample.The
doaminwherethe numberof rules is proportional to the numberof examplesis, by
definition, MBR.To give one concrete example: NETtaiknever learned to pronounce
"psychology"correctly becausethere was only one item (out of 4000) in the training set
that had initial "psy..." and therefore no hiddenunit developedto recognizethis
possibility. With MBR,even a single examplethat matchesexactly dominatesall others,
so MBR
was able to find and respond appropriately to "psy..." I suspect that many
domainshave this property of a few common
patterns that can be captured by rules,
shading into large numbersof exceptions and idiosyncratic examples: medical diagnosis,
software or hardwaretroubleshooting, linguistic structures of actual utterances, motifs for
protein structures (another domainwe’velookedat closely), etc. all have this form. This
form is similar to Zipfs law for the relative occurrenceof various wordsin a language’s
vocabulary(Zipfs law states that "The relative frequencyof a wordis inversely
proportional to the rank of the word." "Rank"refers to a word’s order of frequencyin the
language; "a" and "the" have ranks 1 and 2, whereasvery rare wordshave high rank.) The
upshot is that, for manydomainsof interest, we mayneed MBR-likeabilities to handle
131
the residue of the domainthat can’t be captured by a small numberof rules. Of course,
since MBR
can workas a rule-based system as well, it is useful as a uniform, general
method. Perhaps the best option wouldbe an MBR
that learned to find the minimum
numberof examplesthat wouldcover a domain, using clustering, duplicate removal, and
other learning methods.
Until we better understand learning methodsand the domainsto which we want to apply
them, the best methodologymaywell be to try a variety of learning methods, and then
keepthe one that gives the best results within budgetlimitations. I don’t wantto leave the
impression tha MBR
is a panacea: far from it. Anygiven learning methodwill be best for
at least somedomains.It is an important research goal to understandand generate a priori
guidelines for matchinga learning methodwith a particular domain.
Referenceswill be providedin a future, fuller accountof these ideas.
132