Issues in Corpus-Trained Information Extraction

advertisement
Issues in Corpus-Trained Information Extraction
Ralph Grishman and Roman Yangarber
Computer Science Department
New York University
New York, NY
{grishman, roman}@cs.nyu.edu
1. Introduction
What is Information Extraction?
Information extraction is the process of
automatically identifying specified types of
entities, relations, or events in free text, and
recording this information in a structured form.
Particular examples include finding the names of
all the people and places in a document;
retrieving information on people's ages from a
document; and finding all mentions of murders
from a newspaper (including tabulating who was
killed, by whom, when, and where). The focus
of our interest in this paper will be on relations
and events, although (since events often involve
named people and groups) this is often
dependent on good name recognition.
Why do Information Extraction?
Information extraction means getting specific
facts from a document; it provides a qualitatively
different capability from the much more widely
used information retrieval technology, which can
find relevant documents, web pages, or (at best)
passages.
In some cases automated extraction can replace
(or speed) an extraction task which is currently
performed manually. For example, for various
types of medical research and quality assurance,
large numbers of medical reports containing
narrative information must be reduced to tabular
form, and some of this can be done through
automated extraction [Friedman 1995].
In many other areas, extraction has the potential
of providing much more precise access to
information in the literature than is possible with
keyword-based techniques: documents can be
retrieved based on specific relationships or
interactions.
Several such studies have been
conducted in the areas of biomedicine and
genomics [Humphreys 2000, Rindflesch 2000,
Sekimizu 1998]. By extracting information on a
range of common event types, it should similarly
be possible to provide detailed indexing of
current news stories.
2. How is Extraction
performed?
In simplest terms, the basic idea behind
information extraction is to look for one of a set
of patterns which convey the relation or event of
interest [Grishman 1997]. For example, if we
are looking for reports of people's income, we
might look for patterns of the form "person made
amount-of-money", or "person earned amount-ofmoney". Such patterns assume that we have preprocessed the text to recognize people's names
and currency amounts as units.
Beyond such name recognition, the amount of
preprocessing varies among extraction systems.
Some extraction systems basically operate at the
level of sequences of words. Others do a limited
amount of syntactic analysis, such as identifying
verb groups and noun groups (non-recursive
noun phrases) [Appelt 1993]. Such analysis
allows a system to easily handle other tense
forms, such as "Fred Smith will be earning
$3,000,000" and noun phrases such as "the new
internet guru will make $800,000 next week".
Some systems go further, performing some
degree of clause-level syntactic analysis in order
to identify subject-verb-object relations; the
patterns are then stated in terms of those
relations (something like: "subject: person;
verb: earn; object: amount-of-money).
As the patterns are elaborated to cover more
examples, we find that they can involve many
different nouns and verbs. For example, in the
pattern "company earns amount-of-money", the
company position may be filled by the name of a
company or any of a large number of words that
can be used to refer to a company: "the firm",
"the manufacturer", "the clothes retailer", etc.
Since the same set of words may appear in
multiple patterns, it is convenient to introduce a
set of semantic word classes, arranged in a
hierarchy, and have the patterns refer to these
word classes.
Once these patterns have operated to find
information from individual phrases or
sentences, this information must be combined to
produce an information extract for the entire
document.
At the very least, anaphoric
references must be resolved in order to make the
event participants explicit. Thus if it says "Fred
Smith is the newly hired linguist. He will be
making $59,000 this year." we need to determine
that "He" refers to "Fred Smith". The system
also needs to identify multiple mentions of the
same event; this may occur, for example, in a
newspaper story, where the main event is
described in the opening paragraph and then
elaborated later in the story. Finally, the system
may need to make inferences in order to
determine the appropriate facts: "Fred Smith
earned $60,000 last year and got a $5,000 raise
this year." While we cannot expect an extraction
system to make open-ended inference, selected
inferences may be incorporated if they are
frequently relevant to the extraction of a
particular type of relation.
3. Problems of Information
Extraction
There are several obstacles to the wider use of
information extraction.
One obstacle is the cost of developing extraction
systems for new tasks.
Although system
designers have been quite successful in
separating a task-independent core from the taskspecific knowledge sources, the cost of
customization remains considerable. Several
knowledge sources need to be adapted: the
largest are typically the patterns for identifying
the events of interest, although additions will
typically also be required to the lexicon and
patterns for name recognition; specialized rules
for filling the data base and for inference will
also generally be needed. The cost is particularly
high if these knowledge sources need to be built
by the extraction system developers. There has
therefore been a push towards allowing user
customization, through special interfaces,
example-based methods, and corpus-trained
systems.
Another problem has been the performance
levels obtained for event extraction. A number
of different extraction tasks, of widely varying
complexity, have been tested in the Message
Understanding Conferences.1 Performance on
these tasks (after one month of customization, in
the recent evaluations) has rarely exceeded
F=0.60, where F is an average of slot recall and
slot precision.2 These performance limitations
represent a combination of several factors, since
successful extraction may require a combination
of successful name recognition, event
recognition, reference resolution, and in some
cases inference. One problem in particular is the
long 'tail' of infrequent patterns, which is typical
of natural language phenomena. While is it
relatively easy to identify the frequent linguistic
patterns which account for the bulk of instances
of a particular type of event, it is very hard to get
good coverage of the less frequent patterns.
4. Learning Methods
The key to improving portability to new tasks
has been a shift from the explicit manual
construction of patterns, which requires
considerable expertise, to the automated, or
semi-automated, construction of such patterns
from examples of relevant text and the
information to be extracted.
Semi-automated learning methods
Several systems have been based on the
automated creation of patterns from individual
extraction examples, followed by their manual
review and possible generalization. One of the
first was the AutoSlog system at the Univ. of
Massachusetts, which constructed a pattern for
each instance of each slot, and then had these
patterns manually reviewed [Riloff 1993].
The PET system, developed at New York
University, is intended to provide an integrated
interface for the user customization of extraction
[Yangarber 1997]. It creates an extraction pattern
from an individual example of a phrase along
with the information to be extracted, and then
allows the user to generalize or specialize the
pattern using a semantic hierarchy. The user can
also modify the hierarchy, add lexical items, etc.
In this way the user is given broad control over
1
The Proceedings of the MUC-7 Conference and
other information about the MUC evaluations is
available at http://muc.www.saic.com/
2
F
2  precision  recall
precision  recall
the customization process without becoming
involved in the system internals.
Learning from fully annotated
corpora
A number of systems over the past several years
have moved to fully automated creation of
extraction patterns from an annotated corpus,
including specifically the creation of generalized
rules from multiple examples. Experiments have
been done both with probabilistic (HMM)
models [Miller 1998, Freitag 1999] and with
symbolic rules for identifying slot fillers [Califf
1998], using methods such as Inductive Logic
Programming to learn the rules.
The
shortcoming of this approach is the need for
relatively large amounts of annotated training
data, even for relatively simple extraction tasks.
With the MUC-6 corpus, for example (100
articles), the University of Massachusetts
automated training could not do as well as
traditionally-developed systems [Fisher 1995].
One reason for the need for large training
corpora is that most such systems learn templatefilling rules stated in terms of individual lexical
items, although recent work from NTT describes
a system capable of specializing and generalizing
rules stated in terms of a pre-existing hierarchy
of word classes [Sasaki 1999].
Active learning
The methods just described require a corpus
annotated with the information to be extracted.
Such approaches can quickly learn the most
common patterns, but require a large corpus in
order to achieve good coverage of the less
frequent patterns. Given the skewed distribution
of patterns, the typical process of sequential
corpus annotation can be quite inefficient, not to
say frustrating, as the user annotates similar
examples again and again before finding
something new.
Active learning methods try to make this process
more efficient by selecting suitable candidates
for the user to annotate. Some gains in learning
efficiency have been reported by selecting
examples which match patterns which are
'similar' to good patterns [Soderland 1999], or
examples which match patterns about which
there is considerable uncertainty (patterns
supported by few examples) [Thompson 1999].
Learning from relevance-marked
corpora
In order to reduce the burden of annotation,
Riloff [1996] has developed methods for
learning from documents or passages marked
only with regard to task relevance. In her
approach, all possible patterns (a word and its
immediate syntactic context) are generated for a
collection containing both relevant and irrelevant
documents. Patterns are then ranked based on
their frequency in the two segments of the
collection, preferring patterns which occur more
often in the relevant documents. The top-ranked
patterns using this approach turn out to be
effective extraction patterns for the task.
Learning from unannotated corpora
Brin's DIPRE system [Brin 1998] uses a
bootstrapping method to find patterns without
any pre-annotation of the data. The process is
initiated with a seed set of pairs in some given
relation, such as people and their home town or
their age. It then searches a large corpus for
patterns in which one of these pairs appears.
Given these patterns, it can then find additional
examples which are added to the seed, and the
process can then be repeated. This approach
takes advantage of facts or events which are
stated in multiple forms within the corpus.
Similar bootstrapping methods have been used
recently to find patterns for name classification.
The challenge: finding promising
patterns
As we noted earlier, one of the obstacles to
portability and performance is the difficulty of
finding the many infrequent patterns which may
be used to express the events of interest. The
methods described above point to at least three
different properties of patterns which may be
used to aid in their search, using active learning
or bootstrapping methods:
 patterns relevant to a given topic are
likely to co-occur within a given
document.
Riloff [1996] used the
connection from documents of known
relevance to patterns in her discovery
work, and we extend this (as described
below) to a bootstrapping method.
 some relevant patterns are likely to be
'similar' to other relevant patterns. The
active learning methods described above
make use of a simple notion of similarity
to enhance the search for patterns. By
employing more sophisticated notions of
similarity, using word similarity within a
thesaurus, it may be possible to make the
search even more effective.
 facts are likely to be mentioned several
times in different forms, so we can look
for related patterns by looking for
patterns involving the same arguments as
known relevant patterns.
Not all of these methods are likely to be effective
in any particular case. Some relations, such as
people's ages, are not likely to be very topicspecific, so bootstrapping based on relevant
documents may not be effective. If the analysis
is based on a single news source, information is
not as likely to be repeated, so the DIPRE
strategy will be less effective. However, through
a combination of methods, it may be possible to
formulate an efficient search for a broad range of
relevant patterns.
5. Our Work
Our own recent work has examined the
possibility of using topic relevance for active
learning or bootstrapped search for patterns. Our
goal is to be able to locate patterns in
unannotated corpora, and in that way to be able
to make use of very large corpora for our
searches --- much larger corpora than we could
hope to annotate exhaustively for a new topic.
Our basic approach is as follows:
1.
create a few seed patterns
characterize the topic of interest
which
2.
retrieve the documents containing these
patterns
3.
take the patterns which appear in at least one
relevant document and rank them based on
their frequency in relevant set compared to
their frequency in the remainder of the
corpus
4.
select the top-ranked pattern and add it to
the set of seed patterns
5.
repeat from step 2
This method can be used as fully-automatic
bootstrapping or we can present the selected
pattern in step 4 to the user for approval, creating
a form of active learning.
Methods
Our corpus is pre-processed as follows: first, a
named-entity tagger marks all instances of names
of people, companies, and locations. Then a
parser is used to extract all the clauses from each
document;3 for each clause we build a tuple
consisting of the head of the subject, the verb,
the head of the object, locative and temporal
modifiers, and certain other verb arguments.
Different clause structures, including passives
and relative clauses, are normalized to produce
tuples with a uniform structure. These tuples
will be the basis for our pattern extraction
process.
We then begin the customization process by
selecting a few seed patterns which characterize
the topic of interest. For example, for the topic
of management succession, the seed patterns
included "company appoints person" and
"person resigns" (where company and person
would match any company name and person
name, respectively). The system then retrieves
all the documents containing any instance of one
of the seed patterns. These constitute the initial
relevant document set, R.
For each pattern p (tuple) in R, we compute a
score, proportional to the pattern’s precision and
the log of its coverage. We then select the
pattern p with the highest score, subject to two
filtering constraints: the pattern must appear in
at least 2 documents in R, and it must appear in
no more than 10% of the documents in the entire
collection (patterns which are more frequent are
considered uninformative). The selected pattern
p is added to the seed patterns.
The documents containing pattern p are added to
the relevant set R. However, we do not place the
same total confidence in the relevance of these
new documents. Instead, we define a measure of
graded relevance for the documents, Reli(d), the
estimated relevance of document d after the ith
iteration (after the ith pattern has been added to
the pattern set). Initially, Rel0(d)=1 if the
document contains a seed pattern, and =0
otherwise. Let A(d) be the set of accepted
patterns which appear in document d. For a set
of patterns P, let D(P) be the set of documents
which contain all of the patterns in P. Then
3
We have used the parser from Conexor oy,
Helsinki, Finland, and we gratefully
acknowledge their assistance in our research.
Rel i 1 (d )  max( Rel i (d ),

d 'D ( A( d ))
Rel i (d ' )
)
| D( A(d )) |
These graded relevance values are then used as
document weights in the selection of additional
patterns. These procedures will be described in
more detail in [Yangarber 2000].
Because full patterns may not repeat sufficiently
often to obtain reliable statistics, we have
employed in our experiments 'generalized
patterns' consisting of pairs from the original
tuples: a subject-verb pair, a subject-object pair,
etc. Once we have identified pairs relevant to
the extraction task, we can use them to build
semantic word classes, by grouping tuples
containing the same (task-relevant) pair. For
example, if "X appoints person" is a relevant
pair, we could collect all the values of X as a
word class. This will allow us to build the
semantic classes concurrently with the patterns. 4
Results
We report here on our initial results in applying
this procedure to the "management succession"
extraction task. This task involves the extraction
of information on the job changes of corporate
executives, and was the basis of the evaluation
for Message Understanding Conference - 6
[MUC-6]. We used two seed patterns for this
task,
company appoint-verb person
person resign-verb
where appoint-verb consists of the four verbs
{appoint, elect, promote, and name} and resignverb consists of the verbs {resign, depart, quit,
and step down}. We performed 50 iterations of
our pattern discovery procedure.
We evaluated the effectiveness of our pattern
discovery procedure in two ways, as a text filter
and as a source for event extraction patterns.
We can treat the set of patterns as a document
filter, where a document is selected if it matches
any one of the patterns. We can then ask how
effective the patterns are at selecting precisely
those documents which are relevant to the topic
of management succession (as defined for MUC6). For this evaluation, we used a set of 250
documents which had been classified for task
relevance (100 documents from the MUC
4
[Riloff 1999] also reports work on creating
semantic word classes from extraction patterns.
evaluation and an additional 150 chosen using a
conventional information retrieval system). The
set of seed patterns filtered documents with a
recall of 11% and a precision of 93%; the final
set of patterns filtered documents with a recall of
80% and a precision of 72%. This is a large
improvement in filtering performance using
purely automatic enhancements.
Evaluating the patterns with respect to the final
goal of event extraction is less direct because
additional information must be provided (at
present, manually) before these patterns can be
used within the extraction system: the patterns
need to be extended to incorporate certain
modifiers and paired with domain predicates.
Patterns which do not correspond to domain
predicates must be discarded (again, manually).
In addition, they need to be combined with other
patterns, such as those for the names of executive
positions. The evaluation must be therefore seen
as a preliminary attempt to isolate the gain
achieved by the pattern discovery procedure.
We compared extraction results using four
different pattern sets: the seed patterns described
above; the seed patterns plus the patterns
discovered by the automatic procedure; the
patterns created manually over a period of one
month for the MUC-6 evaluation; and the larger
set of patterns, manually developed, which are
used in our current system. All of these results
were obtained on the official MUC-6 training
corpus.
Pattern Base
Recall
Precision
F
Seed patterns
28
78
41.32
seed +
discovered
patterns
51
76
61.18
manual - MUC
54
71
61.93
manual - now
69
79
73.91
The results indicate that the coverage of the
automatically discovered patterns is comparable
to those developed by hand during the onemonth MUC evaluation. This is certainly an
encouraging result in terms of our ability to
rapidly develop patterns for new tasks, but it will
need to be validated on additional extraction
tasks.
References
[Appelt 1993] Doug Appelt et al. FASTUS: a
finite-state processor for information extraction
from real-world text. Proc. 13th Int’l Joint Conf.
on Artificial Intelligence (IJCAI-93), 1172-1178.
[Brin 1998] Sergey Brin. Extracting patterns and
relations from the World-Wide Web.Proc. 1998
Int'l Workshop on the Web and Databases
(WebDB '98), March 1998.
[Riloff 1993] Ellen Riloff.
Automatically
constructing a dictionary for information
extraction tasks. Proc. Eleventh National Conf.
on Artificial Intelligence, 811-816.
[Riloff 1996] Ellen Riloff. Automatically
generating extraction patterns from untagged
text.
Proc. Thirteenth National Conf. on
Artificial Intelligence (AAAI-96), 1996, 10441049.
[Califf 1998] Mary Elaine Califf. Relational
Learning Techniques for Natural Language
Information Extraction. PhD Thesis, Univ. of
Texas, Austin, August 1998.
[Riloff 1999] E. Riloff and R. Jones, Learning
Dictionaries for Information Extraction by
Multi-Level Bootstrapping.
Proc. Sixteenth
National Conf. on Artificial Intelligence (AAAI99), Orlando, FL, 1999
[Fisher 1995] David Fisher, Stephen Soderland,
Joseph McCarthy, Fangfang Feng, and Wendy
Lehnert. Description of the UMass system as
used for MUC-6.
Proc. Sixth Message
Understanding Conf., Morgan Kaufmann, 1995.
[Rindflesch 2000] Thomas Rindflesch, Lorraine
Tanabe, John Weinstein, and Lawrence Hunter.
EDGAR: Extraction of Drugs, Genes, and
Relations from the Biomedical Literature. Proc.
Pacific Symposium on Biocomputing 2000.
[Freitag 1999] Dayne Freitag and Andrew
McCallum. Information extraction with HMMs
and shrinkage. Proc. Workshop on Machine
Learning and Information Extraction, AAAI-99,
Orlando, FL, July 1999.
[Sasaki 1999] Yukata Sasaki. Applying typeoriented ILP to IE rule generation. Proc.
Workshop on Machine Learning and Information
Extraction, AAAI-99, Orlando, FL, July 1999.
[Friedman 1995] C. Friedman, G. Hripcsak, W.
DuMouchel, S. B. Johnson, and P. D. Clayton.
Natural language processing in an operational
clinical environment.
Natural Language
Engineering, 1 (1): 83-108, 1995.
[Sekimizu 1998]. Takeshi Sekimizu, Hyun Park,
Juniichi Tsujii. Identifying the Interaction
between Genes and Gene Products Based on
Frequently Seen Verbs in Medline Abstracts.
Genome Informatics 1998, Tokyo, December
1998. Universal Academy Press, Inc.
[Grishman 1997] Ralph Grishman. Information
extraction:
techniques and challenges.
In
Information Extraction, ed. Maria Teresa
Pazienza, Springer Notes in Artificial
Intelligence, Springer-Verlag, 1997.
[Humphreys 2000] Kevin Humphreys, George
Demetriou, and Robert Gaizauskas.
Two
applications of information extraction to
biological science journal articles: enzyme
interactions and protein structure. Proc. Pacific
Symposium on Biocomputing 2000.
[Miller 1998] Scott Miller, Michael Crystal,
Heidi Fox, Lance Ramshaw, Richard Schwartz,
Rebecca Stone, Ralph Weischedel, and the
Annotation Group. Algorithms that learn to
extract information -- BBN: description of the
SIFT system as used for MUC-7. In Proceedings
of the MUC-7 Conference. Proceedings
available at http://muc.www.saic.com/
[MUC-6] Proc. Sixth Message Understanding
Conf., Morgan Kaufmann, 1995.
[Soderland 1999] S. Soderland.
Learning
information extraction rules for semi-structured
and free text. Machine Learning 34, 233-272.
[Thompson 1999] C. Thompson, M. E. Califf,
and R. J. Mooney. Active learning for natural
language parsing and information extraction.
Proc. Sixteenth Int'l Machine Learning Conf.,
406-414.
[Yangarber 1997] Roman Yangarber and Ralph
Grishman.
Customization of information
extraction systems.
In Paola Velardi, ed.,
International Workshop on Lexically Driven
Information Extraction, 1-11, Frascati, Italy, July
1997. Universita di Roma.
[Yangarber 2000] Roman Yangarber, Ralph
Grishman, Pasi Tapanainen, and Silja Huttunen.
Unsupervised discovery of scenario-level
patterns for information extraction. To appear in
Proc. Conf. Applied Natural Language
Processing, 2000.
Download