Issues in Corpus-Trained Information Extraction Ralph Grishman and Roman Yangarber Computer Science Department New York University New York, NY {grishman, roman}@cs.nyu.edu 1. Introduction What is Information Extraction? Information extraction is the process of automatically identifying specified types of entities, relations, or events in free text, and recording this information in a structured form. Particular examples include finding the names of all the people and places in a document; retrieving information on people's ages from a document; and finding all mentions of murders from a newspaper (including tabulating who was killed, by whom, when, and where). The focus of our interest in this paper will be on relations and events, although (since events often involve named people and groups) this is often dependent on good name recognition. Why do Information Extraction? Information extraction means getting specific facts from a document; it provides a qualitatively different capability from the much more widely used information retrieval technology, which can find relevant documents, web pages, or (at best) passages. In some cases automated extraction can replace (or speed) an extraction task which is currently performed manually. For example, for various types of medical research and quality assurance, large numbers of medical reports containing narrative information must be reduced to tabular form, and some of this can be done through automated extraction [Friedman 1995]. In many other areas, extraction has the potential of providing much more precise access to information in the literature than is possible with keyword-based techniques: documents can be retrieved based on specific relationships or interactions. Several such studies have been conducted in the areas of biomedicine and genomics [Humphreys 2000, Rindflesch 2000, Sekimizu 1998]. By extracting information on a range of common event types, it should similarly be possible to provide detailed indexing of current news stories. 2. How is Extraction performed? In simplest terms, the basic idea behind information extraction is to look for one of a set of patterns which convey the relation or event of interest [Grishman 1997]. For example, if we are looking for reports of people's income, we might look for patterns of the form "person made amount-of-money", or "person earned amount-ofmoney". Such patterns assume that we have preprocessed the text to recognize people's names and currency amounts as units. Beyond such name recognition, the amount of preprocessing varies among extraction systems. Some extraction systems basically operate at the level of sequences of words. Others do a limited amount of syntactic analysis, such as identifying verb groups and noun groups (non-recursive noun phrases) [Appelt 1993]. Such analysis allows a system to easily handle other tense forms, such as "Fred Smith will be earning $3,000,000" and noun phrases such as "the new internet guru will make $800,000 next week". Some systems go further, performing some degree of clause-level syntactic analysis in order to identify subject-verb-object relations; the patterns are then stated in terms of those relations (something like: "subject: person; verb: earn; object: amount-of-money). As the patterns are elaborated to cover more examples, we find that they can involve many different nouns and verbs. For example, in the pattern "company earns amount-of-money", the company position may be filled by the name of a company or any of a large number of words that can be used to refer to a company: "the firm", "the manufacturer", "the clothes retailer", etc. Since the same set of words may appear in multiple patterns, it is convenient to introduce a set of semantic word classes, arranged in a hierarchy, and have the patterns refer to these word classes. Once these patterns have operated to find information from individual phrases or sentences, this information must be combined to produce an information extract for the entire document. At the very least, anaphoric references must be resolved in order to make the event participants explicit. Thus if it says "Fred Smith is the newly hired linguist. He will be making $59,000 this year." we need to determine that "He" refers to "Fred Smith". The system also needs to identify multiple mentions of the same event; this may occur, for example, in a newspaper story, where the main event is described in the opening paragraph and then elaborated later in the story. Finally, the system may need to make inferences in order to determine the appropriate facts: "Fred Smith earned $60,000 last year and got a $5,000 raise this year." While we cannot expect an extraction system to make open-ended inference, selected inferences may be incorporated if they are frequently relevant to the extraction of a particular type of relation. 3. Problems of Information Extraction There are several obstacles to the wider use of information extraction. One obstacle is the cost of developing extraction systems for new tasks. Although system designers have been quite successful in separating a task-independent core from the taskspecific knowledge sources, the cost of customization remains considerable. Several knowledge sources need to be adapted: the largest are typically the patterns for identifying the events of interest, although additions will typically also be required to the lexicon and patterns for name recognition; specialized rules for filling the data base and for inference will also generally be needed. The cost is particularly high if these knowledge sources need to be built by the extraction system developers. There has therefore been a push towards allowing user customization, through special interfaces, example-based methods, and corpus-trained systems. Another problem has been the performance levels obtained for event extraction. A number of different extraction tasks, of widely varying complexity, have been tested in the Message Understanding Conferences.1 Performance on these tasks (after one month of customization, in the recent evaluations) has rarely exceeded F=0.60, where F is an average of slot recall and slot precision.2 These performance limitations represent a combination of several factors, since successful extraction may require a combination of successful name recognition, event recognition, reference resolution, and in some cases inference. One problem in particular is the long 'tail' of infrequent patterns, which is typical of natural language phenomena. While is it relatively easy to identify the frequent linguistic patterns which account for the bulk of instances of a particular type of event, it is very hard to get good coverage of the less frequent patterns. 4. Learning Methods The key to improving portability to new tasks has been a shift from the explicit manual construction of patterns, which requires considerable expertise, to the automated, or semi-automated, construction of such patterns from examples of relevant text and the information to be extracted. Semi-automated learning methods Several systems have been based on the automated creation of patterns from individual extraction examples, followed by their manual review and possible generalization. One of the first was the AutoSlog system at the Univ. of Massachusetts, which constructed a pattern for each instance of each slot, and then had these patterns manually reviewed [Riloff 1993]. The PET system, developed at New York University, is intended to provide an integrated interface for the user customization of extraction [Yangarber 1997]. It creates an extraction pattern from an individual example of a phrase along with the information to be extracted, and then allows the user to generalize or specialize the pattern using a semantic hierarchy. The user can also modify the hierarchy, add lexical items, etc. In this way the user is given broad control over 1 The Proceedings of the MUC-7 Conference and other information about the MUC evaluations is available at http://muc.www.saic.com/ 2 F 2 precision recall precision recall the customization process without becoming involved in the system internals. Learning from fully annotated corpora A number of systems over the past several years have moved to fully automated creation of extraction patterns from an annotated corpus, including specifically the creation of generalized rules from multiple examples. Experiments have been done both with probabilistic (HMM) models [Miller 1998, Freitag 1999] and with symbolic rules for identifying slot fillers [Califf 1998], using methods such as Inductive Logic Programming to learn the rules. The shortcoming of this approach is the need for relatively large amounts of annotated training data, even for relatively simple extraction tasks. With the MUC-6 corpus, for example (100 articles), the University of Massachusetts automated training could not do as well as traditionally-developed systems [Fisher 1995]. One reason for the need for large training corpora is that most such systems learn templatefilling rules stated in terms of individual lexical items, although recent work from NTT describes a system capable of specializing and generalizing rules stated in terms of a pre-existing hierarchy of word classes [Sasaki 1999]. Active learning The methods just described require a corpus annotated with the information to be extracted. Such approaches can quickly learn the most common patterns, but require a large corpus in order to achieve good coverage of the less frequent patterns. Given the skewed distribution of patterns, the typical process of sequential corpus annotation can be quite inefficient, not to say frustrating, as the user annotates similar examples again and again before finding something new. Active learning methods try to make this process more efficient by selecting suitable candidates for the user to annotate. Some gains in learning efficiency have been reported by selecting examples which match patterns which are 'similar' to good patterns [Soderland 1999], or examples which match patterns about which there is considerable uncertainty (patterns supported by few examples) [Thompson 1999]. Learning from relevance-marked corpora In order to reduce the burden of annotation, Riloff [1996] has developed methods for learning from documents or passages marked only with regard to task relevance. In her approach, all possible patterns (a word and its immediate syntactic context) are generated for a collection containing both relevant and irrelevant documents. Patterns are then ranked based on their frequency in the two segments of the collection, preferring patterns which occur more often in the relevant documents. The top-ranked patterns using this approach turn out to be effective extraction patterns for the task. Learning from unannotated corpora Brin's DIPRE system [Brin 1998] uses a bootstrapping method to find patterns without any pre-annotation of the data. The process is initiated with a seed set of pairs in some given relation, such as people and their home town or their age. It then searches a large corpus for patterns in which one of these pairs appears. Given these patterns, it can then find additional examples which are added to the seed, and the process can then be repeated. This approach takes advantage of facts or events which are stated in multiple forms within the corpus. Similar bootstrapping methods have been used recently to find patterns for name classification. The challenge: finding promising patterns As we noted earlier, one of the obstacles to portability and performance is the difficulty of finding the many infrequent patterns which may be used to express the events of interest. The methods described above point to at least three different properties of patterns which may be used to aid in their search, using active learning or bootstrapping methods: patterns relevant to a given topic are likely to co-occur within a given document. Riloff [1996] used the connection from documents of known relevance to patterns in her discovery work, and we extend this (as described below) to a bootstrapping method. some relevant patterns are likely to be 'similar' to other relevant patterns. The active learning methods described above make use of a simple notion of similarity to enhance the search for patterns. By employing more sophisticated notions of similarity, using word similarity within a thesaurus, it may be possible to make the search even more effective. facts are likely to be mentioned several times in different forms, so we can look for related patterns by looking for patterns involving the same arguments as known relevant patterns. Not all of these methods are likely to be effective in any particular case. Some relations, such as people's ages, are not likely to be very topicspecific, so bootstrapping based on relevant documents may not be effective. If the analysis is based on a single news source, information is not as likely to be repeated, so the DIPRE strategy will be less effective. However, through a combination of methods, it may be possible to formulate an efficient search for a broad range of relevant patterns. 5. Our Work Our own recent work has examined the possibility of using topic relevance for active learning or bootstrapped search for patterns. Our goal is to be able to locate patterns in unannotated corpora, and in that way to be able to make use of very large corpora for our searches --- much larger corpora than we could hope to annotate exhaustively for a new topic. Our basic approach is as follows: 1. create a few seed patterns characterize the topic of interest which 2. retrieve the documents containing these patterns 3. take the patterns which appear in at least one relevant document and rank them based on their frequency in relevant set compared to their frequency in the remainder of the corpus 4. select the top-ranked pattern and add it to the set of seed patterns 5. repeat from step 2 This method can be used as fully-automatic bootstrapping or we can present the selected pattern in step 4 to the user for approval, creating a form of active learning. Methods Our corpus is pre-processed as follows: first, a named-entity tagger marks all instances of names of people, companies, and locations. Then a parser is used to extract all the clauses from each document;3 for each clause we build a tuple consisting of the head of the subject, the verb, the head of the object, locative and temporal modifiers, and certain other verb arguments. Different clause structures, including passives and relative clauses, are normalized to produce tuples with a uniform structure. These tuples will be the basis for our pattern extraction process. We then begin the customization process by selecting a few seed patterns which characterize the topic of interest. For example, for the topic of management succession, the seed patterns included "company appoints person" and "person resigns" (where company and person would match any company name and person name, respectively). The system then retrieves all the documents containing any instance of one of the seed patterns. These constitute the initial relevant document set, R. For each pattern p (tuple) in R, we compute a score, proportional to the pattern’s precision and the log of its coverage. We then select the pattern p with the highest score, subject to two filtering constraints: the pattern must appear in at least 2 documents in R, and it must appear in no more than 10% of the documents in the entire collection (patterns which are more frequent are considered uninformative). The selected pattern p is added to the seed patterns. The documents containing pattern p are added to the relevant set R. However, we do not place the same total confidence in the relevance of these new documents. Instead, we define a measure of graded relevance for the documents, Reli(d), the estimated relevance of document d after the ith iteration (after the ith pattern has been added to the pattern set). Initially, Rel0(d)=1 if the document contains a seed pattern, and =0 otherwise. Let A(d) be the set of accepted patterns which appear in document d. For a set of patterns P, let D(P) be the set of documents which contain all of the patterns in P. Then 3 We have used the parser from Conexor oy, Helsinki, Finland, and we gratefully acknowledge their assistance in our research. Rel i 1 (d ) max( Rel i (d ), d 'D ( A( d )) Rel i (d ' ) ) | D( A(d )) | These graded relevance values are then used as document weights in the selection of additional patterns. These procedures will be described in more detail in [Yangarber 2000]. Because full patterns may not repeat sufficiently often to obtain reliable statistics, we have employed in our experiments 'generalized patterns' consisting of pairs from the original tuples: a subject-verb pair, a subject-object pair, etc. Once we have identified pairs relevant to the extraction task, we can use them to build semantic word classes, by grouping tuples containing the same (task-relevant) pair. For example, if "X appoints person" is a relevant pair, we could collect all the values of X as a word class. This will allow us to build the semantic classes concurrently with the patterns. 4 Results We report here on our initial results in applying this procedure to the "management succession" extraction task. This task involves the extraction of information on the job changes of corporate executives, and was the basis of the evaluation for Message Understanding Conference - 6 [MUC-6]. We used two seed patterns for this task, company appoint-verb person person resign-verb where appoint-verb consists of the four verbs {appoint, elect, promote, and name} and resignverb consists of the verbs {resign, depart, quit, and step down}. We performed 50 iterations of our pattern discovery procedure. We evaluated the effectiveness of our pattern discovery procedure in two ways, as a text filter and as a source for event extraction patterns. We can treat the set of patterns as a document filter, where a document is selected if it matches any one of the patterns. We can then ask how effective the patterns are at selecting precisely those documents which are relevant to the topic of management succession (as defined for MUC6). For this evaluation, we used a set of 250 documents which had been classified for task relevance (100 documents from the MUC 4 [Riloff 1999] also reports work on creating semantic word classes from extraction patterns. evaluation and an additional 150 chosen using a conventional information retrieval system). The set of seed patterns filtered documents with a recall of 11% and a precision of 93%; the final set of patterns filtered documents with a recall of 80% and a precision of 72%. This is a large improvement in filtering performance using purely automatic enhancements. Evaluating the patterns with respect to the final goal of event extraction is less direct because additional information must be provided (at present, manually) before these patterns can be used within the extraction system: the patterns need to be extended to incorporate certain modifiers and paired with domain predicates. Patterns which do not correspond to domain predicates must be discarded (again, manually). In addition, they need to be combined with other patterns, such as those for the names of executive positions. The evaluation must be therefore seen as a preliminary attempt to isolate the gain achieved by the pattern discovery procedure. We compared extraction results using four different pattern sets: the seed patterns described above; the seed patterns plus the patterns discovered by the automatic procedure; the patterns created manually over a period of one month for the MUC-6 evaluation; and the larger set of patterns, manually developed, which are used in our current system. All of these results were obtained on the official MUC-6 training corpus. Pattern Base Recall Precision F Seed patterns 28 78 41.32 seed + discovered patterns 51 76 61.18 manual - MUC 54 71 61.93 manual - now 69 79 73.91 The results indicate that the coverage of the automatically discovered patterns is comparable to those developed by hand during the onemonth MUC evaluation. This is certainly an encouraging result in terms of our ability to rapidly develop patterns for new tasks, but it will need to be validated on additional extraction tasks. References [Appelt 1993] Doug Appelt et al. FASTUS: a finite-state processor for information extraction from real-world text. Proc. 13th Int’l Joint Conf. on Artificial Intelligence (IJCAI-93), 1172-1178. [Brin 1998] Sergey Brin. Extracting patterns and relations from the World-Wide Web.Proc. 1998 Int'l Workshop on the Web and Databases (WebDB '98), March 1998. [Riloff 1993] Ellen Riloff. Automatically constructing a dictionary for information extraction tasks. Proc. Eleventh National Conf. on Artificial Intelligence, 811-816. [Riloff 1996] Ellen Riloff. Automatically generating extraction patterns from untagged text. Proc. Thirteenth National Conf. on Artificial Intelligence (AAAI-96), 1996, 10441049. [Califf 1998] Mary Elaine Califf. Relational Learning Techniques for Natural Language Information Extraction. PhD Thesis, Univ. of Texas, Austin, August 1998. [Riloff 1999] E. Riloff and R. Jones, Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Proc. Sixteenth National Conf. on Artificial Intelligence (AAAI99), Orlando, FL, 1999 [Fisher 1995] David Fisher, Stephen Soderland, Joseph McCarthy, Fangfang Feng, and Wendy Lehnert. Description of the UMass system as used for MUC-6. Proc. Sixth Message Understanding Conf., Morgan Kaufmann, 1995. [Rindflesch 2000] Thomas Rindflesch, Lorraine Tanabe, John Weinstein, and Lawrence Hunter. EDGAR: Extraction of Drugs, Genes, and Relations from the Biomedical Literature. Proc. Pacific Symposium on Biocomputing 2000. [Freitag 1999] Dayne Freitag and Andrew McCallum. Information extraction with HMMs and shrinkage. Proc. Workshop on Machine Learning and Information Extraction, AAAI-99, Orlando, FL, July 1999. [Sasaki 1999] Yukata Sasaki. Applying typeoriented ILP to IE rule generation. Proc. Workshop on Machine Learning and Information Extraction, AAAI-99, Orlando, FL, July 1999. [Friedman 1995] C. Friedman, G. Hripcsak, W. DuMouchel, S. B. Johnson, and P. D. Clayton. Natural language processing in an operational clinical environment. Natural Language Engineering, 1 (1): 83-108, 1995. [Sekimizu 1998]. Takeshi Sekimizu, Hyun Park, Juniichi Tsujii. Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Informatics 1998, Tokyo, December 1998. Universal Academy Press, Inc. [Grishman 1997] Ralph Grishman. Information extraction: techniques and challenges. In Information Extraction, ed. Maria Teresa Pazienza, Springer Notes in Artificial Intelligence, Springer-Verlag, 1997. [Humphreys 2000] Kevin Humphreys, George Demetriou, and Robert Gaizauskas. Two applications of information extraction to biological science journal articles: enzyme interactions and protein structure. Proc. Pacific Symposium on Biocomputing 2000. [Miller 1998] Scott Miller, Michael Crystal, Heidi Fox, Lance Ramshaw, Richard Schwartz, Rebecca Stone, Ralph Weischedel, and the Annotation Group. Algorithms that learn to extract information -- BBN: description of the SIFT system as used for MUC-7. In Proceedings of the MUC-7 Conference. Proceedings available at http://muc.www.saic.com/ [MUC-6] Proc. Sixth Message Understanding Conf., Morgan Kaufmann, 1995. [Soderland 1999] S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning 34, 233-272. [Thompson 1999] C. Thompson, M. E. Califf, and R. J. Mooney. Active learning for natural language parsing and information extraction. Proc. Sixteenth Int'l Machine Learning Conf., 406-414. [Yangarber 1997] Roman Yangarber and Ralph Grishman. Customization of information extraction systems. In Paola Velardi, ed., International Workshop on Lexically Driven Information Extraction, 1-11, Frascati, Italy, July 1997. Universita di Roma. [Yangarber 2000] Roman Yangarber, Ralph Grishman, Pasi Tapanainen, and Silja Huttunen. Unsupervised discovery of scenario-level patterns for information extraction. To appear in Proc. Conf. Applied Natural Language Processing, 2000.