Dictionary Requirements for Text Classification: A

advertisement
. _ .
From: AAAI Technical Report SS-95-01. Compilation copyright © 1995, AAAI (www.aaai.org). All rights reserved.
Dictionary Requirementsfor Text Classification:
Comparison of Three Domains
...
.
A
Ellen Rfloff
Department of Computer Science
University of Utah
Salt Lake City, UT84112
riloff@cs.utah.edu
Abstract
The type of dictionary required for a natural language processing system depends on both the nature of the task and the domain.For example,an indepth comprehensiontask probably requires more
knowledgethan an information retrieval task. Similarly, technical domainsare fundamentallydifferent from event-based domainsand require different types of lexical knowledge. Weexplore these
issues by comparingthe performance of four text
classification algorithms that use varying amounts
of lexical knowledge. Wetested the algorithms
on three different domains: terrorism, joint ventures, and microelectronics. Wefound that the
algorithms produceddramatically different results
on each domain, suggesting that the nature of the
domainstrongly influences the types of knowledge
required to achieve good performance.
Introduction
Knowledge-basednatural language processing systems
(NLP)use a dictionary to represent lexical knowledge
associated with one or more domains. But what types
of knowledgeneed to be represented7 The answer to
this question depends on both the task (the problem)
and the domain(the subject). An information retrieval
task mayrequire substantially different types of lexical knowledgethan an in-depth comprehensiontask.
Similarly, domainshavedifferent properties that require
varying levels of representation. For example, medical
domainsare typically dominatedby technical words and
jargon, while business-oriented domainsgenerally use
more commonbut varied language.
Text classification is an informationretrieval task for
whichtexts are assigned to one or more categories. We
have developedthree text classification algorithms that
automatically categorize texts as either "relevant" to a
given domainor "irrelevant". Our approach is based on
a type of natural language processing called informa-
tion e..m’action. Informationextraction (]E) systemsextract domain-specific information from text (see [MUC3 Proceedings, 1991; MUC-4
Proc_ _-~yJings, 1992; MUC5 Proceedings, 1993]). For example, an ]E system developed for a terrorism domainmight extract the names
of perpetrators, victims, targets, weapons,dates, and locations of terrorist incidents. ]E systems typically use
a dictionary of domain-specific structures or patterns
to identify and extract relevant information. The dictionary maybe manually constructed, or automatically
constructed using a training corpus (e.g., [Riloff, 1994;
Riloff, 1993a; Kimand Moldovan,1993]).
Weconducted a set of experiments to compare the
dictionary requirementsfor ]E-based text classification
in three different domains: terrorism, joint ventures,
and microelectronics. F=~,we briefly describe three
]E-based text classification algorithms that use vary- ing amountsof extracted information to classify texts:
the relevancy signatures algorithm=the augmentedrelevancysignatures algorithm, and a case-based text classification algorithm. The relevancy signatures algorithm
classifies texts using simple phrases, the augmentedrelevancy signatures algorithm uses phrases and local sew_antic context, and the case-basedalgorithmuses larger
pi_-~__~of context. Thenwe comparethe results of these
algorithms and a word-basedalgorithm across the three
domains. The algorithms behaved very differently on
each domain,suggesting that both the nature of the task
and the domainplay a crucial role in determining what
types of lexical knowledgeare most important.
IE-BasedText Classification
The ]E-based text classification algorithm¢ use a concepmal sentence analyzer called CIRCUS[Lchnert,
1991] to extract domain-specific information from text.
CIRCUS
uses a dictionary of structures called concept
nodes to extract relevant information. The concept node
dictionary used in the experimentswith the terrorism domainwas built manually,but the concept node dictionaries for the joint ventures and microelectronics domains
123
,
were constructed automatically by a system called AutoSlog [Riloff, 1994;Riloff, 1993a]using a training corpus. In general, conceptnodes recognizelocal linguistic
patterns, including simple verb forms (active, passive,
infinitive), and nounor verb phrases followed by prepositional phrases. Figure 1 showsa sample sentence and
a concept node activated by the sentence. The concept
node $MURDER-PASSIVF.$
is activated by passive forms
of verbs associated with murder, such as "X was tourdeed by Y", "X1 and X2 were murdered by Y", and
"X has been murdered by Y". In this case, SMURD~aPASSIVES
was activated by the verb "murdered" and
extracted the three peasants as murdervictims and the
guerrillas as perpetrators.
1
Senteuee:
Three peasants were murderedby guerrillas.
SMURDER-PASSIVF_.$
victim = =three peasants"
perpetrator = "guerrillas"
Figure 1: An instantiated concept node
Thefirst text classification algorithmis the relevancy
signatures algorithm [Riloff and Lehnert, 1994; Riloff
and Lehnett, 1992]. A signature is a wordpaired with a
concept node; together, this pair represents a uniqueset
of linguistic expressions. For example, the word "murdered" could be paired with one concept node to recognize passive verb forms (i.e., "X was murdered by
Y") and a different conceptnode to represent active verb
forms (i.e., "X murderedY").
Thefirst step of the algorithmis to analyzea preclassified training corpus using CIRCUS.The instandated concept nodes produced by CIRCUSare transformedinto signatures, and statistics are’calculated to
determine howoften each signature appears in relevant
texts. Thresholdsare used to identify signatures that occur muchmorefrequently in relevant texts than irrelevant texts, and these are labeled as relevancysignatures.
For example, we might assume that a signature represents an expression that is a goodindicator for the domain if 90%of the occurrences of the signature are in
relevant texts.
The second algorithm uses both linguistic expressions and local semantic context to classify texts. The
augmentedrelevancy signatures algorithm collects frequencystatistics for signatures as well as semanticfeatures associated with extracted information. Figure 2
showsthe signatures and slot triples that represent the
instantiated concept node SMURDER-PASSIVF_,$
in Figure 1. Eachslot triple contains the event type, slot type,
and semantic feature associated with a piece of extracted
124
information. Thefirst slot triple in Figure 2 represents
the peasants, and the second slot triple represents the
guerrillas. A domain-specific semantic feature dictionary is used to find the features associated with a noun
phrase.
Signature:
<murdered,
SMURDER-PASSIVE$>
Slot Triples:
(murder, victim, CIVILIAN)
(murder, perpetrator, TERRORIST)
Figure 2: Signatures and Slot Triples
The augmentedrelevancy signatures algorithm identifies signatures and slot triples that are highly correlated
with relevant texts in the training corpus [Riloff
and Lehnert, 1994; Riloff and Lehnert, 1992]. A new
text is classified as relevant if it producesan instantiated
concept node that represents
botha signature and a slot
triple that were highly correlated with relevance, independently.Intuitively, a text is classified as relevantif it
contains
a linguistic
expression
anda piece
ofextracted
information
thatarebothstrongly
associated
withthe
domain.
The third text classification algorithm uses multiple
pieces of extracted informationto represent larger natural languagecontexts. The case-basedtext classification
algorithm [Riloffand Lehnert, 1994; Riloff, 1993b]collects all of the informationextracted fi’om a sentenceand
mergesit into a single case structure. Figure 3 showsa
samplesentence and the resulting case structure.
SENTENCE:
Two vehicles were destroyed and an
unidentifiedoffice of the agricultureandlivestockministry washeavily damaged
following the explosionof
two bombsyesterday aftemoon.
CASE
Signatures: (<destroyed, SDESTRUCTION-PASSIVE$>,
<damaged,
SDAMAGE-PASSIVE~,
<bomb8, SWEAPON-BOMlk$>)
Perpetrators: nil
Victims:nil
Targets: (GOVT-OFFICE-OR-RESIDENCE
VEHICLE)
IIL~crnm~nt$:
(BOMB)
Figure 3: A sample sentence and corresponding case
In the training phase, the training corpus is analyzed
by CIRCUS,the concept nodes produced by each sentence are mergedinto a case, and each case is stored
inthe case base. In the classification phase, a newtext
is processed
by CIRCUS,
a caseis generated
foreach
sentence,
andthecasebaseisconsulted
todetermine
whether there are any similar cases in the case base. A
case is consideredto be similar if it shares a signature,
8O
Key
W= Relevant Words
R = RelevancySignatures
A = Augmented
RelevancySignatures
C = Case-BasedAlgorithm
[-~ 50~
3O2
20:
100’
0 10 20 3040 50 6070 80 90100
recall
Figure4: Textclassification results for the terrorismdomain
a semanticfeature representingextracted information,
andthe sametypes of information(e.g., a victim and
perpetrator).If a high percentageof similar cases came
fromrelevant texts, thenthe case probablycontainsrelevantinformationandis classified as relevant.
Thecase-basedalgorithm
is similar to the augmented
relevancysignatures algorithm except that the frequencystatistics are computed
for several pieces of informationtogether. In contrast, the augmentedrelevancysignaturesalgorithmcomputes
statistics for signaturesandslot triples separately.Byusingrich natural
languagecontexts, the case-basedalgorithmcan classify
texts that are inaccessibleto the previousalgorithmsbecause somesentencescontain multiplepieces of informarionthat are relevant togetherbut not independently.
ExperimentalResults in Three Domains
Wecomparedthe performanceof these algorithms, and
a fourth word-basedalgorithm, on three different domains.Thefourth algorithm, called the relevant words
algorithm,is exactly the sameas the relevancysignatures algorithmexceptthat the statistics are computed
for individualwordsinstead of signatures.
Figure4 showsthe results in the terrorism domainon
1700texts from the MUC-4
corpus. 53%of the texts
were relevant to the MUC-4
domainguidelines [MUC-4
Proceedings,1992]. Eachalgorithmwasevaluated using a 10-foldcross-validetiondesign(see [Riloff, 1994;
Riloff andLehnert’1994]for details). Weused7 different sets ofthreshold.values
to produce
a spectrum
ofrecall/precisionresults, so eachdata point representsone
recall/precisionpoint achievedby the algorithm.Recall
andprecisionare standardmetrics usedin the information retrieval community.
Recall measuresthe percentageof relevanttexts that werecorrectlyclassifiedas relevantby the algorithm.Precisionmeasuresthe percent125
ageof texts classified as relevantthat actuallyare relevanL
The augmentedrelevancy signatures algorithm and
the case-basedalgorithmachievedconsiderablyhigher
combinedrecall and precision scores than the wordbasedandrelevancysignatures algorithms.This is becausecontextis crucially importantfor the terrorismdomain.Thedefinition of terrorism for this experiment
stated that the perpetratormustbe a terrorist andthe target mustbe civilian. 1 Tomakethese distinctions, two
typesof semanticinformationare essential: (1) the role
of the object, e.g. whethersomeone
is the perpetratoror
victimof a crime,and(2) the semanticsof the objectitself, e.g., whethersomeone
is a guerrilla or a civilian.
Therole of an object can be identified throughlocal linguistic expressions, for example"X wasmurdered"indicatesthat Xis a murdervictim. Isolatedkeywords(e.g., the word"murdered"
alone) do not provide
enoughcontextto identify role objects, whichexplains
whythe word-basedalgorithmhad difficulty. Furthermore,role objects can only be identified with context.
For example,
it is impossibleto identify a victimor perpetrator simply by looking at their name:"John was
murdered"indicates that Johnis a victim while "John
murdered
Steve"indicates that Johnis a perpetrator.
Relevancysignatures also had sometrouble because
they don’t represent the semanticfeatures associated
with role objects, but the augmented
relevancysiguarares algorithmandthe case-basedalgorithmdo. For example,both algorithmsused a semanticfeature dictionary torecognize
"peasants"
asCIVILIAN,
and"guerrillas"as TERRORIST.
Theadditional
context
allows
these
algorithms
todistinguish
terrorist
perpetrators
(e.g.,
"a
guerrilla") fromcivilian perpetrators("a burglar’’),
tTheguidelinesfor the terrorismdomain
werespecifiedby
the MUC-4
ozganizers(see LMUC-4
Proceedings,1992]).
100
80"
70:
Key
W= Relevant Words
R = Relevancy Signatures
A = AugmentedRelevancy Signatures
C = Case-Based
Algorithm
6O:
502
:~. 4o."
20
¯
I ¯ 1 ¯ I ¯ ! ¯ I ¯ I " I " I " ! "
0 I0 20 30 40 50 60 70 80 90 I00
recall
Figure 5: Text classification results for the joint ventures domain
military targets (e.g., "an armybase") from civilian targets (e.g., "the U.S. embassy’’).
Figure 5 showsthe results for the joint ventures domain using a training corpus of 1200 texts. 2 54%of
the 1200 texts were relevant to the domain.3 In the
joint ventures domain,all three m-basedalgorithms performed well, and considerably better than the wordbased algorithm.
The joint ventures domainis characterized by key
wordsand phrases that are associated with joint venture
activities, such as "joint-venture" and "tie-up". These
wordsare strong indicators of joint venture activities,
but the classification task was to label a text as relevant only if it mentioneda specific joint venturebetween
two namedpartners. Therefore finding the word "jointventure" was not enoughto warrant a relevant classification, and consequently the word-based algorithm did
not perform well.
To illustrate, we probedthe corpus with several keywordsthat are associated with joint ventures and calculated recall and precision scores based upon howmany
relevant and irrelevant texts were retrieved. Figure 6
showsthat somekeywordsand phrases performed well,
for examplethe phrase "joint venture’’4 achieved 93.3%
recall and 88.9% precision. However,manyseemingly
good keywordsproduced poor results. For example, the
hyphenated word "joint-venture" achieved only 73.2%
precision because it is often used a modifier wherethe
mainconcept is not a joint venture (as in "joint-venture
legislation’’). Interestingly, the plural word"ventures"
produced only 58.8%precision while the singular form
"venture" achieved 82.8%precision. This is because the
plural formoften refers to joint venturesin general(as in
"there have been manyjoint ventures this year’’) and the
singular formis usually used to describe a specific joint
venture (as in ’*toyota formeda joint venture with Nissan’’). Weobserved this phenomenonin the terrorism
domainas well, where words such as "assassinations"
and "bombings"referred to these types of events in general but did not describe any specific incident.
Words
joint, venture
de-up
venntre
jointly
joint-venture
consortium
joint, ventures
partnership
ventures
Recall I
93.3%[
2.5% I
95.5%I
11.0%I
6.4% I
3.6% I
19.3%I
7.0% I
19.8%I
Prec~on
88.9%
84.2%
82.8%
78.9%
73.2%
69.7%
66.7%
64.3%
58.8%
Figure 6: Recall and precision scores for joint venture
keywords
The relevancy siguamres algorithm, however, represents linguistic expressions that include prepositions,
such as ’.venture with" and "project between". Figure 7 shows the precision rates associated with similar patterns recognized by signatures. The most interesting result was that the presence of the preposi2Thecorpus
contained
719textsfromtheMUC-5cortions producedmuchmoreeffective classification terms.
pus[MUC-5
Proceedings,
1993]
and481texts
fromtheTipFor example, ’.venture between" achieved 100%prester
detection
corpus
[’tipster
Proceedings.
1993]
(see
[Riloff.
cision while ’.venture" only achieved 82.8%precision,
1994]
fordetails
ofhowthese
texts
were
selected).
s According
totheMUC-5
domain
guidelines
[MUC-5
Proand "tie-up with" achieved 100%precision while "tieup" alone achieved only 84.2%precision. In this doceedings,
1993].
¯ Notnecessarily
inadjacent
positions.
main, the prepositions often indicate that a partner is
126
mentioned,so the phraseusually refers to a specific joint
venture. In somesense, prepositions such as "with" and
"between"act as implicit pointers to the partners.
ele~uonics concepts can be recognized by looking for
individual words.
Sigmm~’ePattern
Precision
venUtrebetween<entity>
100.0%
venture with <entity>
95.9%
venture of <entity>
95.4%
venture by <entity>
90.9%
project bet~veen<entity>
100.0%
75.0%
project with <entity>
94.7%
set up with <entity>
set up by <entity>
66.7%
100.0%
tie-up with <entity>
<entity> construct
100.0%
<facility> wasconstructed 63.6%
<entity> form company
100.0%
100.0%
<entity> set up company
100.0%
<entity>join forces
100.0%
<entity> asn:ed to form
Traditional information retrieval systems (e.g., [Turfie and Croft, 1991; Salton, 1971]) and text classification systems [Maron, 1961; Borko and Bernick, 1963;
Hoyle, 1973] use isolated wordsor phrases to classify
texts, but someclassification tasks can benefit from using additional Linguistic information. Our experiments
indicate that both the domainand the task influence
the kinds of knowledgethat a system needs. In both
the terrorism and joint ventures domains, the IE-based
text classification algorithmsrecognizedrole objects using simple linguistic patterns, whichproducedsubstantially betterresuits than isolated wordsand phrases. Furthermore, semantic information associated with role objects must be represented if the task places semantic
constraints on the domain. For example, the classification task in the terrorism domainspecified that only
certain type of perpetrators and victims were relevant.
Finally, experiments with the microelectronics domain
suggest that isolated wordscan be effective in technical domains. Weconclude that the dictionary requirements of an NLPsystem are strongly influenced by the
domainand task to which it will be applied. Shallow
lexical knowledgecan produce good results for some
problems, but more complexsyntactic and semantic information must be represented to achieve good performanceon others.
Conclusions
Figure 7: Relevancypercentages for joint venture signature patterns
The signaa~s also represented several expressions
that did not contain any wordsthat are strongly associated with joint venturesindividually, but are strongly associated with joint ventures collectively. For example,
the phrases "form company", "set up company", "join
forces", and "agreed to form" all achieved 100%precision. Consequently, the relevancy signatures algorithm
performed muchbetter than the word-based algorithm.
However,the additional context used by the augmented
relevancy signatures algorithm and the case-based algorithm did not greatly improve performance. For this
classification task, it doesn’t matter whothe partners are,
as long as partners exist.
Figure 8 showsthe results for the microelectronics
domain on a corpus of 500 texts of which 57%were
relevant, s Wedid not evaluate the augmentedrelevancy
signatures algorithm and the case-based algorithm on
this domain because we did not have a semantic feature dictionary for microelectronics. In contrast to the
previous experiments, the word-basedalgorithm outperformedthe relevancy signatures algorithm in the microelectronics domain. In microelectronics, and presumably other technical domains,specialized wordsand jargun play a crucial role in representing the meaningof a
text. For example, words like "multichip", "megabit",
and "bipolar" are strongly indicative of microelectronics and performed very well as keywords. Furthermore, technical jargon is usually unambiguous
and selfcontained; that is, other words surrounding themdo not
usually change their meaning.As a result, manymicroSAccording
to the MUC-5
guidelines. The texts werepart
of the MUC-5
corpus [MUC-5
Proceedings, 1993].
127
References
Borko, H. and Bemick,M. 1963. AutomaticDocument
Classification. J. ACM
10(2):151-162.
Hoyle, W. 1973. AutomaticIndexing and Generation of
Classification Systemsby Algorithm.InformationStorage
andRetrieval 9(4):233--242.
Kim,J. and Moidovan,D. 1993. Acquisition of Semantic
Patterns for InformationExtraction fromCorpora.In Proceedingsof the NinthIEEEConference
on Artificial InteUi.
gencefor Applications, Los Alamitos, CA. IEEEComputer
SocietyPress. 171-176.
Lehnert, W.1991. Symbolic/Subsymbolic
Sentence Analysis: Exploiting the Best of TwoWorlds.In Bamden,J.
andPollack, J., editors 1991,Advances
in Connectionistand
NeuralComputation
Theory,Vol. I. AblexPublishers, Norwood,NJ. 135-164.
Maron,M.1961. AutomaticIndexing:. AnExperimentalInquiry. J. ACM
8:404-417.
ProceedingsoftheThirdMessageUnderstandingConference (MUC-3),San Mateo, CA. MorganKaufmann.
Proceedingsof the FourthMessageUnderstandingConference (MUC-4),San Mateo, CA. MorganKaufmann.
Proceedingsofthe Fifth MessageUnderstandingConference
(MUC-5),San Francisco, CA. MorganKaufmann.
8O
Key
W = Relevant Words
R = Relevancy Signatures
:~ 5O302
2010O’
0
¯ t - t ¯ ! ¯ ! ¯ i ¯ i ¯ i - i - i ¯
I0 20 30 40 50 60 70 80 90 I00
recall
Figure 8: Text classification
results for the microelectronics domain
R/loft, E. and Lehnert, W. 1992. Classifying Texts Using
RelevancySignatures. In Proceedingsof the Tenth National
Conference on Artificial Intelligence. AAAIPress~TheMIT
Press. 329-334.
R/loft, E. and Lehnen, W.1994. Information Extraction as a
Bash for High-Precision Text Classification. ACMTransac.
tions on Information Systems 12(3):296-333.
R/loft, E. 1993a. Automatically Constructing a Dictionary for Information Extraction Tasks. In Proceedings of
the Eleventh National Conferenceon Artificial Intelligence.
AAAI Pres~e MIT Press. 811-816.
R/loft, E. 1993b. Using Cases to Represent Context for
Text Classification. In Proceedingsof the Secondlnterna.
tional Conference on Information and KnowledgeManagement(CIKM.93), NewYork, NY. ACMPress. 105-113.
R/loft, F1,. 1994. Information Extraction aa a Basis for
Portable Text Classification Syatems.Ph.D. Dissertation, Depamnent of ComputerScience, University of Massachusetts
Amherst.
Salton, G., editor 1971. The SMARTRetrlevalSystem." Experiments in Autonmtic DocumentProcessing. Prentice Hall,
EnglewoodCliffs, NJ.
Proceedings of the TIPSTERText Program (Phase I), San
Francisco, CA. Morgan Kaufmann.
Turtle, Howardand Cro/~, W. Brace 1991. Efficient Probabilistic
Inference forText Retrieval. In ProceedingsofRIAO
91.6A.A. 561.
128
Download