Using Cases to Represent Context for Text Classification

From: AAAI Technical Report SS-93-07. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
UsingCasesto RepresentContextfor Text Classification
Ellen Riloff
Deparlmentof ComputerScience
Universityof Massachusetts
Amherst, MA01003
riloff@cs.umass.edu
Abstract
Researchon text classification has typically focused
on keywordsearches and statistical techniques.
Keywordsalone cannot always distinguish the
relevant fromthe irrelevant texts and somerelevant
texts do not containanyreliable keywords
at all. Our
approach to text classification uses case-based
reasoningto represent natural languagecontexts that
can be used to classify texts with extremelyhigh
precision. Thecase base of natural languagecontexts
is acquired automaticallyduring sentence analysis
using a training corpus of texts and their correct
relevancyclassifications. Atext is representedas a
set of casesandweclassify a text as relevantif anyof
its cases are deemedto be relevant. Werely on the
statistical properties of the case base to determine
whether similar cases are highly correlated with
relevance for the domain.Preliminary experiments
suggestthat case-basedtext classification can achieve
very high levels of precision and outperformsour
previousalgorithmsbasedon relevancysignatures.
Introduction
Text classification has received considerable attention
from the information retrieval community,which has
typically approachedthe problemusing keywordsearches
and statistical techniques [Salton 1989]. However,text
classification can be difficult becausekeywords
are often
not enoughto distinguish the relevant fromthe irrelevant
texts. In previous work[Riloff and Lehnert 1992], we
showedthat natural languageprocessingtechniquescan be
used to generate relevancy signatures whichare highly
effective at retrieving relevant texts with extremelyhigh
precision. Althoughrelevancy signatures capture more
informationthan keywords
alone, they still representonly
a local context surroundingimportantwordsor phrases in
a sentence. Asa result, they are susceptibleto false hits
whena correct relevancy discrimination depends upon
additional context. Furthermore, manyrelevant texts
cannot be recognizedby searching for only key wordsor
phrases. A sentence may contain an abundance of
informationthat clearly makesa text relevant, but doesnot
9O
contain any particular word or phrase that is highly
indicativeof relevanceon its own.
In responseto these limitations, webeganto explore
case-based reasoning as an approachfor naturally
representing larger pieces of context. Givena training
corpus
of texts andtheir correct
relevancyclassifications,
wecreate a case base that serves as a d~t~baseof contexts
for thousandsof sentences fromhundredsof texts. Each
case represents the natural languagecontext associated
with a single sentence. The correct relevancy
classifications associatedwith our training corpusgive us
the correct classification for each text, but not for each
case. Thereforewecannotmerelyretrieve the mostsimilar
case and apply its classification to the current case.
Instead, weretrieve manycases andrely on the statistical
properties of the retrieved cases to makean informed
decision. Ourapproachtherefore differs from manyother
CBRsystemsin that wedo not retrieve one or a few very
similar cases and apply themdirectly to the current case
(e.g., [Ashley 1990; Hammond
1986]). Instead,
retrieve manycases that are similar to the current case in
specific waysanduse the statistical propertiesof the cases
as part of a classificationalgorithm.
In this paper, weshowthat case-based reasoningis a
natural technique for representing and reasoning with
natural language contexts for the purpose of text
classification. First, we briefly describe the MUC
performanceevaluations of text analysis systems that
stimulatedour research on text classification. Second,we
introduce relevancy signatures and explain hownatural
languageprocessingtechniquescan be used effectively for
high-precisiontext classification. Third, wedescribe our
case representation and present the details of the CBR
algorithm.Weconcludewith preliminarytest results from
an experimentto evaluate the performanceof case-based
text classification.
Text Classification in MUC-4
Althoughtext classification has typically beenaddressed
byresearchersin the informationretrieval community,
it is
an interesting and importantproblemfor natural language
processing(NLP)researchers as well. In 1991and 1992,
the UMass
natural languageprocessinggroupparticipated
in the Third and Fourth Message Understanding
Conferences (MUC-3and MUC-4)sponsored by DARPA.
The task for both MUCs
wasto extract informationabout
terrorism from news wire articles. The information
extraction task, however, subsumesthe more general
problemof text classification. That is, weshould know
whethera text contains relevant information before we
extract anything. Wecould dependon the information
extractionsystemto do text classification as a side effect:
if the system extracts any information at all then we
classify the text as relevant. Butthe information
extraction
task requires natural languageprocessingcapabilities that
are costly to apply,evenwithinthe constraintsof a limited
domain. In conjunction with a highly reliable text
classification algorithm, however,the text extraction
systemcould be invokedonly whena text wasclassified
as relevant, thereby saving the costs associated with
applyingthe full NLPsystemto each and every text. 1 In
addition, informationextractionsystemsgeneratefalse hits
whenthey extract informationthat appearsto be relevant
to the domain,but is not relevant whenthe larger context
of the text is taken into account. Byemployinga text
classification algorithmat the front end, wecouldeasethe
burdenon the informationextraction systemby filtering
out manyirrelevant texts that mightlead it astray.
Motivated by our new found appreciation for text
classification, webeganworkingon this task using the
MUC-4
corpus as a testbed. The MUC-4
corpus consists
of 1700texts in the domainof Latin Americanterrorism,
as well as a set of templates(answerkeys)for eachof the
texts. Thesetexts weredrawnfroma databaseconstructed
and maintained by the Foreign Broadcast Information
Service (FBIS) of the U.S. governmentand comefrom
variety of newssourcesincludingwire stories, transcripts
of speeches,radio broadcasts, terrorist communiquts,
and
interviews (see [MUC-4Proceedings 1992]). The
templates(answerkeys)weregeneratedby the participants
of MUC-3
and MUC-4.
Eachtemplate contains the correct
informationcorresponding
to a relevant incident described
in a text, i.e. the informationthat shouldbe extractedfor
the incident.If a text describesmultiplerelevantincidents,
then the answer keys contain one template for each
incident; if a text describesno relevant incidentsthen the
answerkeys contain only a dummy
template for the text
with no information inside. For the purpose of text
classification, weuse these answerkeys to determinethe
correct relevancyclassification of a text. If wefind any
filled templatesamongthe answerkeysfor a text then we
knowthat the text containsat least onerelevant incident.
However,if wef’mdonly a dummy
templatethen the text
contains no relevant incidents andis irrelevant. Roughly
50%of the texts in the MUC.4
corpusare relevant.
1Althoughour text classification algorithmrelies on the
sentenceanalyzer, a completenatural languageprocessing
system usually has additional components, such as
discourseanalysis, that wouldnot needto be applied.
RelevancyDiscriminationsandSelective
ConceptExtraction
During the course of MUC-3
and MUC..4,we manually
skimmedhundreds of texts in an effort to better
understandthe informationextraction task andto carefully
monitor the performanceof our system. As we scanned
the texts, wenoticed that sometexts are mucheasier to
classify than others. Althoughthe MUC-4
domainis Latin
Americanterrorism, an extensive set of domainguidelines
specifies the details of the domain.For example,events
that occurred more than 2 monthsbefore the wire date,
generaleventdescriptions, andincidentsthat involveonly
military targets were not considered to be relevant
incidents.
Therewill alwaysbe sometexts that are difficult to
classify becauseof exceptionsor becausethey fall into
gray areas that are not coveredby the domainguidelines.
But despite detailed criteria, manyrelevant texts can be
identified quickly andaccurately becausethey contain an
event description that clearly falls within the domain
guidelines.Often, the presenceof a key phraseor sentence
is sufficient to confidentlyclassify a text as relevant. For
example, the phrase "was assassinated" almost always
describes a terrorist event in the MUC-4corpus.
Furthermore,as soonas a relevant incident is identified,
the text can be classified as relevant regardless of what
appears in the remainderof the text. Therefore, oncewe
hit on a key phraseor sentence, wecan completelyignore
the rest of the text~2 This propertymakesthe techniqueof
selective conceptextractionparticularlywell-suitedfor the
task of relevancydiscriminations.
Selective concept extraction is a form of sentence
analysis realized by the CIRCUS
parser [Lehnert 1990].
Thestrength behind selective concept extraction is its
ability to ignorelarge portionsof text that are not relevant
to the domain while selectively bringing in domain
knowledgewhena key word or phrase is identified. In
CIRCUS,
selective conceptextraction is achievedon the
basis of a domain-specific
dictionaryof conceptnodesthat
are used to recognize phrases and expressions that are
relevant to the domainof interest. Conceptnodesare case
framesthat are triggered by individual lexical items but
activated only in the presence of specific linguistic
expressions. For example, in our dictionary the word
"dead" triggers two concept nodes: $found-dead-pass$
and $1eft.dead$.Whenthe word"dead"is found in a text,
both conceptnodes are triggered. However,$found-deadpass$is activated only if the wordpreceding"dead"is the
verb "found" and the verb appears in a passive
construction. Similarly, the conceptnode $1eft-dead$is
activated only if the word"dead"is precededby the verb
2Thisis not alwaysthe case, particularly whenthe domain
description contains exceptions such as the ones
mentioned above. Our approach assumes that these
exceptionalcases are relatively infrequentin the corpus.
"left" and the verb appears in an active construction.
Therefore, at most one of these concept nodes will be
activated dependingupon the surrounding linguistic
context. In general, ff a sentence contains multiple
triggering words then multiple concept nodes maybe
activated; however,ff a sentencecontains no triggering
wordsthen no output is generated for the sentence. The
UMass/MUC-4
dictionary wasconstructed specifically for
MUC-4
and contains 389 concept node definitions (see
[Lehnertet al. 1992]for a description of the UMass/MUC4 system).
RelevancySignatures
Aswenoted earlier, manyrelevant texts can be identified
by recognizing a single but highly relevant phrase or
expression, e.g. "X wasassassinated". However,keywords
are often not enoughto reliably identify relevanttexts. For
example,the word"dead"is a very strong relevancycue
for the terrorism domainin somecontexts but not in
others. Phrasesof the form"<person>wasfound dead"are
highly associated with terrorism in the MUC-4
corpus
because"founddead"has an implicit connotationof foul
play; in fact, every text in the MUC-4
corpusthat contains
an expression of this form is a relevant text. However
expressionsof the form"left <number>
dead", as in "the
attack left 49 dead", are often used in military event
descriptionsthat are not terrorist in nature.Thereforethese
phrasesare not highlycorrelatedwithrelevant texts in the
MUC-4
corpus. As a consequence,the keyword"dead" is
not a goodindicator of relevanceon its own,eventhough
certain expressions containing the word"dead"are very
strong relevancycues that can be usedreliably to identify
relevanttexts.
In previous work, we introduced the notion of a
relevancy signature to recognize key phrases and
expressionsthat are strongly associatedwith relevancefor
a domain.As/&natureis a pair consistingof a lexical item
anda conceptnodethat it triggers, e.g. <placed,$1oc-valpass-IS>. Together, this pair recognizes the set of
expressionsof the form: "<weapon>
<auxiliary>placed",
such as, "a bombwas placed" or "dynamitesticks were
placed". The word "placed" triggers the concept node
$1oc.vai.pass.l$whichis activatedonly if it appearsin a
passive construction and the subject of the clause is a
weapon.Note that different wordscan trigger the same
concept node and the same word can trigger multiple
conceptnodes. For example,the signature <planted, $1ocval-pass-l$> recognizespassive consn’uctionssuch as "a
bombwasplanted" and <planted, $1oc-va!-15>recognizes
active constructions such as "the terrorists planted a
bomb".Finally, relevancysignaturesare signatures that
are highly correlated with relevant texts in a training
corpus. Thepresenceof a single relevancysignature in a
text is often enough
to classify the text as relevant.
In a set of experimentswith the MUC-3
corpus [Riloff
and Lehnert 1992], we showedthat relevancy signatures
can be highly effective at identifying relevant texts with
high precision. However,wenoticed that manyof their
92
errors weredueto relevancyclassifications that depended
uponadditional context surroundingthe key phrases. For
example,two very similar sentences (1) "a civilian was
killed by guerrillas" and (2) "a soldier was killed
guerrillas" are representedby the samesignature <killed,
Skill-pass.IS>. However,
(1) describesa relevant incident
and (2) does not becauseour domaindefinition specifies
that incidents involvingmilitary targets are not acts of
terrorism. Relevancy signatures alone cannot make
distinctions that dependon the slot fillers inside the
conceptnodes, suchas the victimsandperpetrators.
Withthis in mind, weextendedour workon relevancy
signatures by augmenting
themwith slot rifler information
from inside the concept nodes. Whereas relevancy
signatures represent the existence of concept nodes,
augmentedrelevancy signatures represent entire concept
node instantiations. Theaugmentedrelevancy signatures
algorithm[Riloff and Lehnert 1992] classifies texts by
looking for both a strong relevancy signature and a
reliable slot filler inside the conceptnode. For example,
"a civilian waskilled" activates a murderconcept node
that extracts the civilian as its victim. Theaugmented
relevancy signatures algorithm will classify this as a
highly relevant expressiononly if its signature <killed,
Skill-pass-IS>is highlycorrelated with relevanceand its
slot filler (the civilian victim)is highlycorrelated with
relevanceas well.
Additional experiments with augmentedrelevancy
signatures demonstrated that they can achieve higher
precision text classification than relevancysignatures
alone [Riloff and Lehnert 1992]. However,augmented
relevancysignatures can still fall short whenrelevancy
discriminationsdependon multipleslot Idlers in a concept
node or on information that is scattered throughout a
sentence. Furthermore, somerelevant sentences do not
contain any key phrases that are highly associated with
relevance for a domain.Instead, a sentence maycontain
manypieces of information,noneof whichis individually
compelling,but whichin total clearly describe a relevant
incident. For example,considerthis sentence:
Police sourceshaveconfirmedthat a guerrilla waskilled
and two civilians were woundedthis morningduring an
attack by urbanguerrillas.
This sentenceclearly describesa terrorist incident even
thoughit does not contain any key wordsor phrases that
are highly indicative of terrorism individually. The
important action words, "killed", "wounded", and
"attack", all refer to a violent eventbut not necessarilya
terrorist event; e.g., peopleare often killed, wounded,
and
attacked in military incidents (whichare prevalentin the
MUC-4
corpus). Similarly, "guerrillas" is a reference to
terrorists but "guerrillas" are mentioned
in manytexts that
do not describe a specific terrorist act. Collectively,
however,
the entire contextclearly depicts a terrorist event
becauseit containsrelevant action words,relevant victims
(civilians), andrelevant perpetrators(guerrillas). We
all of this informationto concludethat the incident is
relevant. These observations promptedus to pursue a
different approachto text classification using case-based
reasoning.Byrepresentingthe contexts of entire sentences
with cases, wecan beginto addre.~manyof the issues that
wereout of our reach usingonly relevancysignatares.
(1) $DESTRUCTION-PASS.I$
(u’iggered by destroyed)
TARGET
ffi twovehicles
(2) $DAMAGE-PASS-I$
(u’iggered by damaged)
TARGET
= an un/dent0~ed
office of the agricultureand
livestockministry
(3) $WEAPON-BOMB$
(uiggered by bombs)
Case-based Text Classification
The motivation behind our case-based approach is
twofold: (I) to moreaccurately classify texts and (2)
classify texts that are inaccessible via techniquesthat
focus only on key wordsor phrases. Wewill present the
CBR
algorithmin several steps. First, wedescribe our case
representation and explain howwegenerate cases from
texts. Second,we introduce the concept of a relevancy
index and showhowthese indices allow us to retrieve
cases that mighthave beeninaccessible via key wordsor
phrasesalone. Andfinally, wepresent the details of the
algorithmitself.
The Case Representation
Werepresent a text as a set of cases using the following
procedure. Givena text to classify, we analyze each
sentenceusing CIRCUS
and collect the conceptnodes that
are producedduring sentence analysis. Thenwecreate a
case for eachsentenceby mergingall of the conceptnodes
producedby the sentenceinto a single structure. Acase is
represented as a frame with rive slots: signatures,
perpetrators, victims, targets, and instruments. The
signatures slot contains the signatures associated with
each concept node in the sentence. The remainingfour
slots represent~e slot fillers that werepickedupby these
concept nodes: Although the concept nodes extract
specific strings fromthe text, e.g. "the guerrillas", a case
retains only the semanticfeatures of the fillers, e.g.
terrorist,in order to generalizeover the specific strings
that appearedin the text. Asamplesentence, the concept
nodes producedby the sentence, and its corresponding
case are shownbelow:
Theseconceptnodesresult in the followingcase:
CASE
SIGNATURES
= (<destroyed, $destruetion-pass-l$>
<damaged, $damage-pass-l$>
<bombs, $weapon-bomb$>)
PERPETRATORS
= nil
VICTIMS
ffi nil
TARGETS= (govt-office-or-residence
t ransport-vehicle)
INSTRUMENTS= (bomb)
Notethat our case representationdoes not preservethe
associations betweenthe conceptnodesand their fillers,
e.g. the case doesn’t tellus whetherthe govt-office-orresidence
was damaged
and
the
transport-vehicle
destroyedor vice versa. Wepurposely disassociated the
fillers fromtheir conceptnodes so that wecan look for
associations betweenconceptnodes and their ownfillers
as well as riflers fromother conceptnodes. For example,
every case in our case base that contains the signature
<killing, $ m u r d e r - 1 $ > and a govt-office-orresidence
target camefrom a relevant text. The word
"killing" doesnot necessarilydescribe a terrorist event,
but if the incident also involves a govt-office-orresidence target 4 then the target seems to provide
additional evidence that suggests terrorism. But only
humantargets can be murdered, so the concept node
$murder-l$ will never contain any physical targets.
Therefore, the govt-office-or-residence must have
been exwactedby a &’fferent conceptnodefrom elsewhere
in the sentence. Theability to pair slot fillers with any
Twovehicles
weredestroyed
andanunidentified
oJflce
of
signature allows us to recognize associations that may
theagriculture
andlivestock
ministry
washeavily exist betweenpieces of informationfromdifferent parts of
damaged
following
theexplosion
oftwobombs
yesterday the sentence.
afternoon.
Relevancy Indices
This sentencegeneratesthree conceptnodes:
Sincea text is representedas a set of cases, our goalis to
determineif anyof its cases are relevant to the domain.
Wewill deema case to be relevant if it is similar to other
3Wechose these four slots becausethe conceptnodes in
cases in the case base that werefoundin relevant texts.
our UMass/MUC-4
system [Lehnenet al. 1992] use these
However,an individual case contains manysignatures and
four types of slots to extract informationaboutterrorism.
slot fillers that couldbe usedto indexthe casebase. It is
In general, a case shouldhaveone slot for each possible
unlikely that wewill find manyexact matchesby indexing
type of conceptnodeslot.
on all of these features, so weuse the notionof a relevancy
4Government
officials and facilities are frequently the
targets of terrorist attacks in the MUC-4
corpus.
93
indexto retrieve cases that shareparticular propertieswith
the currentcase. A relevancyindexis a triple of the form:
<signature,slot filler pair, caseoutline>.Aslot frier pair
consists of a slot nameand a semanticfeature associated
withlegitimateslot fillers for the slot, e.g. (perpetrators
terrorist).Acase outline is a list of slots in the case
that have fillers.
For example, the case outline
(perpetratorsvictims)meansthat these twoslots are filled
but the remainingslots, targets andinstruments,are not.
Thesignatureslot is alwaysf’flled so wedo not includeit
as part of the caseoutline.
This index represents muchof the context of the
sentence. Asweexplainedearlier, a signature representsa
set of phrases that activate a specific conceptnode. By
indexing on both a signature and a slot filler, weare
retrieving cases that representsimilar phrasesandsimilar
slot fillers. Intuitively, this meansthat the retrievedcases
share both a keyphraseanda keypiece of informationthat
wasextracted fromthe text. Theability to index on both
signatures and slot fillers simultaneouslyallows us to
recognizerelevant cases that mightbe missedby indexing
on only one of them. For example, the word
"assassination" is highly indicative of relevancein our
domainbecause assassinations in Latin Americaare
almost always terrorist in nature. However,the word
"assassination"by itself is not necessarilya goodkeyword
because manytexts contain generic references to an
assassination but do not describe a specific event.
Thereforewealso needto knowthat the text mentioneda
specific victim, althoughalmostanyreferenceto a victim
is goodenough.Evena vaguedescriptionof a victim such
as "a person"is enoughto confidentlyclassify the text as
relevant becausethe word"assassination" carries so much
weight.Onthe other hand,a phrasesuchas "the killing of
a person" is not enoughto classify a text as relevant
because the word "killing" is not a strong enough
relevancycue for terrorism to offset the vaguenessof the
victimdescription.
Similarly, someslot fillers are highly indicative of
relevance in the MUC-4
corpus regardless of the event
type. For example, whena governmentofficial is the
victimof a crimein Latin America,it is almostalwaysthe
result of a terrorist action. Of course, every mentionof a
governmentofficial in a text does not signal relevance
althoughalmostanyreference to a government
official as
the victim of a crimeis generally goodenough.Therefore
a phrase such as "a governmentofficial was killed" is
enoughto confidentlyclassify a text as relevant because
government
officials are so often the victimsof terrorism.
Alternatively,a phrasesuchas "a personwaskilled" is not
enoughto classify a text as relevantsince the personcould
havebeenkilled in manywaysthat hadnothing to do with
terrorism. The augmentedrelevancysignatures algorithm
also classifies texts on the basis of bothsignaturesandslot
fillers, but boththe signatureandslot filler mustbe highly
correlated with relevance independently. As a result,
augmented relevancy signatures cannot recognize
relationshipsin whicha signature andslot filler together
94
represent a relevant situation eventhoughoneor both of
them are not highly associated with relevance by
themselves.
Finally, the third part of a relevancyindex is the case
outline. Acase outline representsthe set of slots that are
filled in a case. Thepurposeof the caseoutline is to allow
different signatures to require varying amounts of
information. For example, some words such as
"assassination" are so strongly correlated with relevance
that we don’t need much additional information to
confidentlyassumethat it refers to a terrorist event. Aswe
described above, the presence of almost any victim is
enoughto implyrelevance.In particular, wedon’t needto
verify that the perpetrator was a terrorist -- the
connotationsassociated with "assassination" are strong
enoughto makethat assumptionsafely. However,other
words,e.g. "killed", donot necessarilyrefer to a terrorist
act andare often usedin genericeventdescriptionssuchas
"manypeople have been killed in the last year". The
presence of a perpetrator, therefore, can signal the
difference betweena specific incident anda generalevent
description. Toillustrate the powerof the case outline,
consider the followingstatistics drawnfroma case base
constructedfrom1500texts. Weretrieved all cases from
the case base that contained the following relevancy
indices andcalculatedthe percentageof the retrievedcases
that camefromrelevant texts:
(<assassination, $murder-l$>,
(victimscivilian), (victilTIS))
100%
(<killed, Skill-pass-IS>,
(victims civilian), (victims))
68%
(<killed, Skill-pass.IS>,
(victims civi’l Jan), (victimsperpetrators))
100%
Thesestatistics clearly showthat the "assassination"of
civilian victims(s) is strongly correlated with relevance
(100%)even without any reference to a perpetrator. But
texts that contain civilian victim(s) whoare "killed" are
not highly correlated with relevance(68%)since the word
"killed" is muchless suggestive of terrorism. However,
texts that contain civilian victim(s) whoare "killed" and
namea specific perpetrator are highly correlated with
relevance (100%). To reliably depend on the word
"killed", the presenceof a known
perpetratoris critical.
By combininga signature, slot filler pair, and case
outline into a single index, weretrieve cases that share
similar key phrases,at least onesimilar piece of extracted
information, and contain roughly the same amountof
information. However, most cases contain many
signatures and manyslot fillers so weare still ignoring
manypotentially importantdifferences betweenthe cases.
The CBRalgorithm described belowshowshowwe try to
ensurethat these differencesare not likely to be important
by (1) relying onstatistics to assureus that we’vehit on
highly reliable index and (2) employing secondary
signature andslot filler checksto ferret out any obvious
irrelevancycues that mightturn an otherwiserelevant case
into an irrelevantone.
The Case-Based Text Classification Algorithm
A text is representedas a set of eases, one case for each
sentence that produced at least one concept node. To
classify a text, weclassify eachof its casesin turn until we
find a case that is deemedto be relevant or until we
exhaustall of its cases. Assoonas weidentify a relevant
case, weimmediately
classify the entire text as relevant.
Thestrengthof the algorithm,therefore,rests on its ability
to identify a relevant case. Weclassify a case as relevant
if andonlyif the followingthree conditionsare satisfied:
Condition1: Thecase containsa strong relevancyindex.
Condition2: Theease doesnot haveany "bad"signatures.
Condition3: Thecasedoesnot haveany"bad"slot fillers.
Condition1 is the heart of the algorithm;conditions2 and
3 are merely secondarychecks to recognize irrelevancy
cues, i.e. signatures or slot fillers that mightturn an
otherwiserelevant case into an irrelevant one. We’llgive
someexamplesof these situations later. First, we’ll
explain howweuse the relevancyindices. Givena target
case to classify, wegenerateeachpossible relevancyindex
for the caseand, for eachindex,retrieve all casesfromthe
case base that share the sameindex. Thenwecalculate two
statistics over the set of retrieved cases: [I] the total
numberof retrieved cases (N) and [2] the number
retrieved cases that are relevant (FIR). Theratio of NR/N
gives us a measureof the associationbetweenthe retrieved
cases and relevance, e.g..85 means that 85%of the
retrievedcases are relevant. If this relevancyratio is high
then we assumethat the index is a strong indicator of
relevance and therefore the target case is likely to be
relevant as well. Morespecifically, weuse twothresholds
to determine whether the retrieved cases satisfy our
relevancy criteria: a relevancy threshold (R) and a
minimum
numberof occurrences threshold (M). If the
ratio (NR/N) >= R and N >= Mthen the cases, and
thereforethe relevancyindex, satisfy Condition1.
Thesecondthreshold, M,is necessarybecausewemust
retrieve a reasonably large numberof cases to feel
confidentthat our statistics are meaningful.Eachcase in
the case base is tagged with a relevancyclassification
denotingwhetherit wasfoundin a relevant or irrelevant
text. However,this classification does not necessarily
applyto the case itself. If the case is taggedas irrelevant
(i.e., it wasfoundin an irrelevanttext), thenthe caseitself
must be irrelevant. However,if the case is tagged as
relevant(i.e., it wasfoundin a relevanttext) thenwecan’t
he sure whetherthe case is relevant or not. It mightvery
well be a relevant case. Onthe other hand, it could be an
irrelevant case that happenedto be in a relevant text
becauseanother case in the text wasrelevant. This is a
classic exampleof the credit assignmentproblemwhere
95
weknowthe correct classification of the text but do not
know which case(s) were responsible for that
classification.
Because of the fuzziness associated with the case
classifications, wecannotrely on a single case or evena
smallset of cases to give us defmitiveguidance.Therefore
wecannotmerelyretrieve the mostsimilar case, or a small
set of extremely similar cases, as do manyother CBR
systems.Instead, weretrieve a large numberof eases that
share certain features with the target case and rely on the
statistical propertiesof these casesto tell us whetherthese
featuresarc correlatedwith relevance.If a high percentage
of the retrieved cases are classified as relevant, then we
assumethat these common
features are responsible for
their relevanceandweshouldthereforeclassify the target
case as relevantas well.
Finally, twoadditional conditionsmustbe satisfied by
the target case beforeweclassify it as relevant. Conditions
2 and 3 makesure that the case doesnot containany "bad"
signaturesor slot fillers that mightindicatethat the caseis
irrelevant despite other apparentlyrelevant information.
For example, the MUC-4
domainguidelines dictate that
specific terrorist incidents are relevant but general
descriptions of incidents are not. To illustrate the
difference, the following sentence is irrelevant to our
domainbecause it contains a summarydescription of
manyevents but does not describe anysingle incident:
Morethan 100 people havedied in Peru since 1980, when
the Maoist Shining Path organization beganits attacks
andits waveof political violence.
CASE
SIGNATURES
= (<died, $die$>
<wave, $generic-event-marker$>
<attacks, $attack-noun-l$>)
PERPETRATORS
= nil
VICTIMS
ffi (human)
TARGETS
= nil
INSTRUMENTS
= nil
However
a similar sentenceis relevant:
Morethan 100 people have died in Peruduring 2 attacks
yesterday.
CASE
SIGNATURES
= (<died, Sdle$>
<attacks, Sattack-noun-l$>)
PERPETRATORS
= nil
VICTIMS= (human)
TARGETS
ffi nil
INSTRUMENTS
ffi nil
Thesesentences generate almost identical cases except
that the first sentencecontainsa signaturefor a $genericevent-markerS triggered by the word "wave" and the
secondone does not. Oursystemrelies on a small set of
special conceptnodeslike this one to recognizetextual
cues that signal a generalevent description. Thepresence
of this single concept node indicates that the case is
irrelevant, even though the rest of the case contains
perfectly good information that would otherwise be
relevant. Condition 2 of the algorithm recognizes
signatures for conceptnodes, such as this one, that are
only weaklycorrelated with relevanceand therefore might
signalthat the caseis irrelevant.
Similarly, a single slot filler can turn an otherwise
relevant case into an irrelevant one. For example,
according to the MUC-4domain guidelines if the
perpetratorof a crimeis a civilian thenthe incidentis not
terrorist in nature. Thereforean incident with a terrorist
perpetrator is relevant eventhougha similar incident with
a civilian perpetrator is irrelevant. Condition3 of the
algorithmrecognizesslot f’dlers that are weaklycorrelated
with relevanceandtherefore mightindicate that an event
is irrelevant.
Weuse two additional thresholds, Isig and Islet, to
identify "bad" signatures and slot filler§. To evaluate
condition 2, for each signature in the target case, we
retrieve all cases that also contain that signature and
computeNRand N. If the ratio (NR/N)>= Isig and N
Mthen the cases satisfy Condition 2. T6 evaluate
Condition
3, for eachslot f’fller in the case, weretrieve all
cases that also containthat slot filler andthe samecase
outline. Similarly, wecomputeNRand Nfor the retrieved
cases andif the ratio (NR/N)>=Islet andN >=Mthen the
cases satisfy Condition3.
Conclusions
wehave conducted preliminary experiments to compare
the performanceof case-based text classification with
relevancysignatures. Usinga case base containing 7032
cases derivedfrom1500training texts andtheir associated
answer keys from the MUC-4corpus, we applied the
algorithmto two blind test sets of 100texts each. These
are the sametest sets, DEV-0401
and DEV-0801,
that we
used to evaluate the performanceof relevancysignatures
[Riloff and Lehnert1992]. Themostnotable improvement
wasthat our case-based algorithm achieved muchhigher
precision for DEV-0801:for someparameter settings,
case-basedtext classification correctly identified 41%of
the relevant texts with 100%precision, whereas
augmentedrelevancysignatures could correctly identify
only 8%of the relevant texts with 100%precision and
relevancy signatures alone could not obtain 100%
precision on this test set at all. Performance
on the other
test set, DEV-0401,
was comparablefor both augmented
relevancysignatures andthe case-basedalgorithm,beth of
whichslightly outperformedrelevancysignatures alone.
Wesuspectthat weare seeinga ceiling effect on this test
set since all of the algorithmsdo extremelywellwithit.
Our workdiffers frommanyCBRsystemsin that wedo
not retrieve a single best case or a fewhighlysimilar cases
and apply themdirectly to the newcase (e.g., [Ashley
96
1990; Hammond
1986]). Instead, we retrieve a large
numberof cases that share specific features with the new
caseanduse the statistical propertiesof the retrievedcases
as part of a classification algorithm.MBRtalk
[Stanfill and
Waltz 1986] and PRO[Lehnert 1987] are examples of
other systems that have workedwith large case bases.
MBRtalk
is similar to manyCBRsystemsin that it applies
a similarity metricto retrieve a fewbest matches(10) and
uses these cases to determine a response. PRO,on the
other hand,is closer in spirit to our approachbecauseit
relies on frequencydata from the case base to drive the
algorithm. PRObuilds a networkfromthe entire case base
and applies a relaxation algorithm to the network to
generatea response.Theactivation levels of certain nodes
in the networkare based upon frequency data from the
case base. Although both PROand our system rely on
statistics fromthe case base, our approachdiffers in that
wedependon the statistics, not only to find the best
response,but also to resolve the credit assignmentproblem
becausethe classifications of our cases are not always
certain.
Finally, our case-basedtext classification algorithmis
domain-independent
so the systemcan be easily scaled up
and ported to new domains by simply extending or
replacingthe case base. Togeneratea case base for a new
domain, CIRCUS
needs a domain-specific dictionary of
conceptnodes. Givena training corpusfor the domain,we
have shownelsewhere that the process of creating a
domain-specific dictionary of concept nodes for text
extraction can be successfully automated[Riloff 1993;
Riloff and Lehnert 1993].5 However,our approach does
assumethat the lexiconcontainssemanticfeatures, at least
for the wordsthat can be legitimate slot fillers for the
conceptnodes. Givenonly a set of training texts, their
correct relevancyclassifications, and a domain-specific
dictionary, the case base is acquiredautomaticallyas a
painlessside effect of sentenceanalysis.
Weshould also point out that this approachdoes not
dependon an explicit set of domainguidelines. The case
base implicitly captures domainspecifications whichare
automatically derived from the training corpus. This
strategy simplifies the job of the knowledgeengineer
considerably. Thedomainexpert only needsto generate a
training corpusof relevantandirrelevant texts, as opposed
to a complexset of domainguidelinesand specifications.
As storage capacities grow and memorybecomes
cheaper,webelieve that text classification will becomea
central problem for an increasing numberof computer
applications and users. Andas moreand moredocuments
becomeaccessible on-line, weexpect that high-precision
text classification will become paramount. Using
relevancyindices, weretrieve texts that share similar key
5Although
the special conceptnodesusedto identify "bad"
signatures for condition 2 are an exception. Theconcept
nodes that weacquire automatically are not designedto
recognizeexpressionsthat arc negativelycorrelated with
the domain.
phrases,at least onesimilar pieceof extractedinformation,
and contain roughly the same amountof information. We
are encouragedby our results so far whichdemonstrate
that these rich natural languagecontexts can be used to
achieveextremelyaccuratetext classification.
Acknowledgments
Theauthor wouldlike to thank Claire Cardie and Wendy
Lehnertfor providinghelpful comments
on earlier drafts
of this paper. This research wassupportedby the Officeof
Naval Research Contract N00014-92-J-1427and NSF
Grant no. EEC-9209623,State/Industry/University
CooperativeResearchon Intelligent InformationRetrieval.
Bibliography
Ashley, K. 1990. ModellingLegal Argument:Reasoning
with Cases and Hypotheticals. Cambridge,MA.The MIT
Press.
Hammond,K. 1986. CHEF: A Model of Case-Based
Planning.In Proceedingsof the Fifth NationalConference
on ArtO~cialIntelligence, pp. 267-271.MorganKaufmann.
Lehnert, W. G. 1987. Case-BasedProblemSolving with a
Large Knowledge
Base of LearnedCases. In Proceedings
of the Sixth NationalConference
onArtificial Intelligence,
pp. 301-306. MorganKaufmann.
Lehnert, W. G. 1990. Symbolic/SubsymbolieSentence
Analysis: Exploiting the Best of TwoWorlds. In
Advances in Connectionist and Neural Computation
Theory. (Eds: J. Pollack and J. Barnden),pp. 135-164.
Norwood,NJ. AblexPublishing.
Lehnert,W.,Cardie, C., Fisher, D., McCarthy,
J., Riloff,
E., and Soderland,S. 1992. Universityof Massachusetts:
Description of the CIRCUS
System as Used for MUC-4.
In Proceedings of the Fourth MessageUnderstanding
Conference,pp. 282-288.
Proceedings of the Fourth Message Understanding
Conference. 1992. San Mateo, CA.MorganKanfmann.
Riloff, E. 1993.AutomaticallyConstructinga Dictionary
for Information Extraction Tasks. Submitted to the
EleventhNationalConference
on Artificial Intelligence.
Riloff, E. and Lehnert, W.1992.Classifying Texts Using
Relevancy Signatures. In Proceedings of the Tenth
NationalConferenceon Artificial Intelligence, pp. 329334. AAAIPress/The MITPress.
Riloff, E. and Lehnert, W.1993. AutomatedDictionary
Construction for Information Extraction from Text. To
appear in Proceedingsof the Ninth IEEEConferenceon
Artificial Intelligencefor Applications.
97
Salton, G. 1989. Automatic Text Processing: The
Transformation,
Analysis, and Retrieval of Informationby
Computer.Reading, MA.Addison-Wesley.
Staufill, C. and Waltz, D. 1986. TowardMemory-Based
Reasoning.Communications
of the ACM,Vol. 29, No. 12,
pp. 1213-1228.