doc - NYU Computer Science Department

advertisement

Extraction Pattern Discovery through Corpus Analysis

Roman Yangarber and Ralph Grishman

Computer Science Department

New York University

New York, NY 10003

{yangarber, grishman}@cs.nyu.edu

Abstract

If we examine texts concerning different topics or subject domains, one way in which they can be characterized is by the classes of entities, and the relations between these entities, which are expressed in the texts. Such characterizations provide a valuable complement to the syntactic and lexical analyses which are primarily used in corpus linguistics. These characterizations are also crucial to the development of information extraction systems for new topics and domains. To develop a new extraction system, we must determine what relations need to be captured, and — as comprehensively as possible — how these relations are expressed. We describe in this paper a method for acquiring this information automatically from unannotated text, through a bootstrapping procedure which identifies both a set of texts about a topic and a set of relations which occur primarily in these texts. We present several examples of topics and retrieved relations, as well as an evaluation in terms of text filtering.

The call for participation for this workshop, “Information

Extraction meets Corpus Linguistics,” notes that “The goals of information extraction and corpus linguistics have thus far had little in common.” It goes on to observe that corpus linguistics aims to understand the differences of medical reports, or a science sublanguage. One characteristic of such domain-based sublanguages is that they talk about different, sublanguage-specific sets of concepts, relations, and events. Weather reports will discuss temperature and precipitation; medical reports will discuss symptoms, diagnoses, and treatments. among texts in heterogeneous collections, largely through statistical analysis of lexical, syntactic, and discourse structures (Biber et al., 1998). Information extraction, on the other hand, has generally been applied to more homogenous collections, and focused on the

“engineering” problems of developing finite-state patterns to extract the desired information.

There has been, however, a thread of research in computational linguistics in which these two areas have been closely interwoven — a thread that goes back to the

1950's, almost the beginnings of computational linguistics.

In this paper, we shall briefly look at this research thread,

Zellig Harris (1968) provided a formal description of such sublanguages: they are characterized by a set of sublanguage word classes, and constraints on the way in which these classes can be combined in sentences to express relations and events. Thus, for example, medical reports may have a class of symptom nouns, N symptom

, and a class of disease nouns, N disease

, which can be combined in the form N symptom

indicates N disease

, but not N disease indicates N symptom

. By cataloging the relations of a sublanguage, Harris suggested, it was possible to create a formal structure for the information content of the texts. and see how it connects to the new interest in machine learning for information extraction. In particular, we shall describe a linguistic pattern discovery procedure which can serve a role both in information extraction and in corpus linguistics.

Sublanguages and Relations

Information extraction covers a range of tasks, including name identification and classification, entity tracking, and the capture of relations and events. In this paper we shall focus primarily on the extraction of relations and events.

Corpus linguistics aims to identify differences in language characteristics between texts — differences which can be attributed to differences in authorship, differences in purpose, and differences in the domain of discourse.

When differences in domain or purpose lead to clear cut differences in language, the different forms of language are sometimes referred to as sublanguages (Grishman &

Kittredge, 1986). We speak, for example, of the sublanguage of weather reports, the sublanguage of a class

Information extraction can be seen as the task of pulling a single, externally specified relationship from some texts.

In a more expansive view, however, we can envision analyzing a sublanguage and then extracting the principal relationships expressed in this sublanguage. This may be a small set for weather reports, but potentially a larger and quite complex set for a science sublanguage. Harris

(1958) indicated how such extraction procedures, if they could be implemented, could have a major effect on access to the scientific literature. The information processing and language processing technology of the time was not ready for such applications, but the ideas advanced then are still appropriate today.

Starting with these initial proposals, efforts were made to apply these techniques to several types of text, both scientific articles and later medical reports (Sager et al.,

1987). This was initially done by hand, but later automated in a procedure called “information formatting.”

This approach raised the issue of how the relations can be identified in a systematic way from a sample of text in the

corpus. Again, the work was initially done by hand, identifying word classes and then aligning different sentence structures involving the same classes of words.

Some of these steps were then automated, using the limited sublanguage corpora then available.

Distributional analysis, applied to parsed text, was used to identify the major sublanguage word classes (Hirschman et al., 1975). Once the word classes had been identified, a statistical analysis of clause and noun phrase patterns was used to identify the dominant sublanguage relationships

(Grishman et al., 1986).

If, following Harris, we define a sublanguage as a set of texts obeying certain constraints on co-occurrence — only certain classes of words appear in particular sentential relations — then we face some circularity in the discovery procedure: we need to identify the texts to find the relations, and need to identify the relations to find the texts. In a later section, we consider how a discovery procedure can operate by discovering texts and relations in parallel. linguistic analysis, there remains the burden of annotating hundreds or even thousands of documents in order to collect a range of the rarer linguistic expressions.

One alternative to such extensive annotation is the use of statistical methods to identify combinations of words or word classes which are likely to be relevant to the task.

This can be seen as an analog of the sublanguage studies described earlier, although focussed more narrowly on a specific extraction task. One such experiment was conducted by Riloff (1996) using a corpus of (translated) foreign broadcast news reports. The goal was to extract information about terrorism, and each article was classified (by hand) to indicate whether or not it was relevant to the extraction task. Each article was decomposed into a set of patterns (such as subject-verb pairs, verb-object pairs, etc.), and then the patterns were ranked based on their frequency in the relevant articles, and their relative frequency in the relevant articles when compared to their frequency in the entire corpus. By selecting high-ranking patterns, it was possible to create a pattern set which performed almost as well at extracting selected information from the texts as a manuallyconstructed pattern set.

Extraction and Pattern Discovery

Over the past decade there has been a rapid growth in interest in information extraction. This has been prompted in part by opportunity — the explosive growth in on-line text data available through the web. It has also been driven by the recognition that simple, robust language analyzers — including, in particular, finite state transducers — can be quite effective for extraction purposes. Extraction system development has therefore become, in large part, a task of developing the (sometimes large) set of linguistic patterns required to recognize the events of interest. Limitations of extraction performance and concerns about the cost of porting systems to new domains have focused attention on the problem of pattern discovery.

The challenge here is to find all the possible linguistic expressions of information relevant to a particular specified relation or event. For a given event type, we can quickly think of the most common forms, but finding the rarer forms — the long "tail" in the distribution of expressions — is much harder, and is a primary source of the limitations in performance of extraction systems.

The “traditional” approach to IE system development has involved a manual analysis of a modest-size corpus, looking for relevant expressions and reducing them to linguistic patterns. The difficulty of developing large pattern sets by hand has led, over the past few years, to the development of supervised learning methods (Califf,

1998; Sasaki, 1999; Soderland, 1999). These methods rely on corpora which have been annotated, by hand, with the information to be extracted. Using such a corpus, a system can develop a pattern corresponding to each annotated example, and can then attempt generalizations, combining individual (specific) patterns into more general ones. While annotation of individual examples (marking the information to be extracted) may require less skill than

Bootstrapping

The approach just described still requires manual relevance judgements for a substantial corpus (for the terrorist corpus, Riloff used 1500 texts, of which about

50% were relevant). While manual relevance judgements are less labor-intensive than full event annotation, they are still a significant burden when dealing with a large collection.

We have recently conducted a series of experiments which acquire such relations without any pre-annotation of the corpus, and thus can take advantage of much larger collections ... tens or hundreds of thousands of documents.

This approach requires only a small set of seed patterns to start, and builds the set of patterns and the corpus of relevant articles in tandem. Our method is based on the observation that the patterns of interest tend to co-occur at the document level: a document containing one pattern about a topic is more likely to contain other relevant patterns than a document taken at random from the collection. This is the analog at the level of patterns of techniques of word co-occurrence analysis, which have been used for many years to build word clusters and thesauri for information retrieval.

To begin our discovery procedure, we select a small set of

“seed” patterns typical of the class of events of interest.

For example, to extract information about the hiring and firing of personnel (the MUC-6 “management succession” task), one of the seed patterns was “ company appointed person as president”. The system labels as relevant all the documents containing one of these patterns. Then, for each clause pattern appearing in the relevant documents, it determines its frequency in the relevant documents and in the rest of the collection, and ranks the patterns based both on the frequency in the relevant documents and on

the ratio of frequencies. The system selects the highest ranking pattern, adds it to the set of patterns, and repeats the process (retrieving a new set of relevant documents).

In the remainder of this section we describe the steps of this process, and an evaluation for one extraction task; further details are available in (Yangarber et al., 2000).

Method

We begin by preprocessing the corpus. The text is first processed by a ‘named entity’ tagger which identifies names of people, organizations, and locations. Then a parser is used to extract all the clauses from each document; 1 for each clause we build a tuple consisting of the head of the subject, the verb, the head of the object, locative and temporal modifiers, and certain other verb arguments. Different clause structures, including passives and relative clauses, are normalized to produce tuples with a uniform structure. These tuples will be the basis for our pattern extraction process.

2

Next, we create a small set of ‘seed’ patterns relevant to the extraction task. These patterns will be used to initiate the discovery process.

We retrieve all documents from the corpus containing at least one seed pattern. These form our initial set of relevant documents R . Each tuple which appears in some document in R is then ranked using the following metric:

L ( p )

| H ( p )

R |

 log |

| H ( p ) |

H ( p )

R | where H(p) is the set of documents containing tuple p .

(This is similar to the metric used by Riloff (1996).) Very frequent tuples, occurring in over 10% of the documents, are rejected as uninformative, as are rare patterns which appear only once among the relevant documents. The topranked tuple based on this metric is added to the set of patterns and the process is then repeated.

The discovery procedure takes into account our confidence about the relevance of discovered patterns and retrieved documents. Each pattern is assigned a

‘precision’ measure between 0 and 1, and each document is assigned a relevance score between 0 and 1. Seed patterns have a precision of 1, and documents matched by these seed patterns have a relevance of 1. For discovered patterns, the precision is the average relevance measure of the documents which contain that pattern. For documents which are added to the set of relevant documents in subsequent iterations, the relevance score is determined by the precision of the patterns which appear in that document. The details are presented in (Yangarber et al.,

2000).

1 We have used the parser from Conexor oy, Helsinki,

Finland, and we gratefully acknowledge their assistance in our research.

2 We expect later to extend this approach to noun phrase patterns as well.

The pattern selection process is somewhat more complicated than just described. In selecting patterns, we only consider for candidacy generalized patterns, in which one or more of the slots are filled with ‘wild cards.’ In computing the score of the generalized pattern, we do not take into consideration all possible values of the wild card. We instead constrain the wild card to those values which themselves in turn produce patterns with high scores. The effect of this process is evident in some of the discovered patterns listed below, where some of the slot positions are filled with sets of values.

This process is a very selective form of distributional analysis (where we only use task-relevant patterns as context), and the sets of values obtained are natural candidates for domain-specific noun and verb classes, akin to those obtained in (Riloff & Jones, 1999). We do not presently make use of these classes in subsequent iterations of the discovery procedure, but we intend to do so in the near future.

Evaluation

We have tried this method initially to acquire patterns for the “management succession” task, which was used for

MUC-6 (1995). We began with a seed containing the patterns

Subject Verb Direct Object

C-company

C-person

C-appoint

C-resign

C-person where C-company and C-person are the classes containing the named entities for companies and people. C-appoint is the set of verbs {appoint, elect, promote, name}; Cresign is the set {resign, depart, quit, step-down}. We used a collection of 6000 news articles from the Wall

Street Journal, including those from the MUC-6 training corpus.

The discovered patterns cannot be used directly for information extraction; we need to add information about which predicate to associate with each pattern (that appoint involves starting a job, while remove involves leaving a job), and which position in the pattern to associate with each predicate argument. However, it is possible to evaluate the discovered patterns by themselves, automatically, by looking at their effectiveness at text filtering — at differentiating relevant from irrelevant documents.

To do so, we used a test corpus consisting of 100 MUC-6 formal training documents (which were included in the development corpus), plus another 150 documents picked at random from the main corpus and judged for relevance by hand. (Note that these relevance judgements were only used for evaluation, and not in the discovery procedure.)

The seed patterns by themselves retrieved 184 documents.

Measured against the test corpus, they produced an initial document recall of 0.11 and a precision of 0.93. After 80

iterations of the discovery procedure, the set of patterns yielded a non-zero relevance for 982 documents. On the test corpus, it produced a text filtering recall of 0.78 with a precision of 0.80.

3 This is a high level of retrieval performance, considering that only automatic methods were used to expand the initial pattern set.

It is also interesting to examine the patterns that were discovered. NYU produced an extraction system for

MUC-6, developing the patterns by hand, and has since continued to extend its pattern set through the review of additional documents, so one might expect the pattern set to be relatively complete. Nonetheless, the discovery procedure yielded a number of patterns which were not present in the hand-coded system, including 4 person came [to company] [as officer] person rejoined [company] [as officer] person returned [as officer] company brought person [in] [as officer] person tapped person [to be officer]

This is evidence of the difficulty of preparing a pattern set with good coverage by hand, and of the value of this discovery procedure for information extraction.

Clusters of Relations

As we noted above, most extraction tasks define in advance a particular relation to be extracted. However, if we are given the goal of extracting all the information

‘about’ a given class of events, we may not be sure in advance of all the relations involved. The bootstrapping approach we have just described allows us to automatically build up a set of interconnected relations — the set of relations that co-occur in the articles about a given topic.

We have used the procedure just described to explore various news topics. In each case we created a small seed, and then collected the patterns discovered by our procedure.

The first topic was ‘natural disasters’. We defined a set of disasters, disaster = {earthquake tornado flood

hurricane landslide snowstorm

avalanche} and some seed patterns involving these disasters: disaster caused ... disaster {hit | destroyed | ravaged | damaged}

{street | bridge | house | home | - }

3 More details of this evaluation are reported in

(Yangarber et al., 2000).

4 Items in square brackets are constituents outside of the subject-verb-object pattern, included here for clarity; person and company refer to the corresponding classes of named entities. For ease of reading, patterns are shown with singular nouns and past tense verbs; the actual patterns involve base noun and verb forms.

The set of patterns discovered 5 included:

3. quake measured (number)

4. quake was felt

5. earthquake shook area

7. earthquake was centered

9. quake struck

11. quake was considered

12. quake occurred

14. quake knocked [out] power

22. storm knocked [out] power

23. damage was reported

28. aftershock injured –

29. aftershock killed people

30. the quake registered

38. it caused damage

40. quake {killed | injured} people

Here the number before the pattern indicates its relative position among the patterns discovered by our procedure.

One other topic we tried was “lawsuits”, for which the seed was

... {sued | litigated} organization

... brought suit

This small seed was sufficient to discover a large number of relevant patterns, including

2. person heard case

3. plaintiff sought damage

4. group charged [that ... ]

6. ... faced suit

7. company appealed

9. we marshalled argument

14. they recouped loss

15. they alerted regulator

16. company abetted violation

19. lawsuit alleged ...

20. suit sought ...

22. assets were frozen

23. {person government-org} filed suit

24. {person court} rejected argument

26. person ruled [that ...]

27. company denied {allegation charge

wrongdoing}

28. company settled charge

30. court ruled

Good results were also found for several other topics, including corporate acquisitions, labor strikes, and terrorism, although for some topics the proportion of relevant relations among the discovered set was smaller.

5 A larger corpus of 14,000 articles from the Associated

Press was used for this topic.

While the current results are quite encouraging, we expect to improve these results by

 having human review of the patterns during the discovery procedure (in the fully automatic procedure, an irrelevant pattern discovered early on can bring in irrelevant documents and so lead to several other irrelevant patterns)

 using the noun and verb classes which are developed as part of the discovery procedure in order to aid further pattern discovery

We may not want to call the texts involving each of these topics a sublanguage, in particular since the boundaries between topics may not be so clear-cut, and some texts may involve several topics. However, we can observe, at a smaller scale, the same phenomena noted earlier for sublanguages: a set of texts which are characterized by the predominant use of a limited set of semantic relations.

Concluding remarks

As we look beyond surface phenomena, one of the basic ways in which texts differ is in

 the relations that they express

 the linguistic constructs which they use to express these relations

These characteristics have been important for sublanguage studies and are properly part of the domain of corpus linguistics.

As information extraction sets more ambitious goals, it too must address these questions: for a given collection of texts, it must decide which relations to make explicit, and must identify all the relevant linguistic constructs for these relations. In this way, there is a clear overlap of corpus linguistics and information extraction. The methods which are being developed to analyze these corpus characteristics will play a central role in allowing us to automatically adapt extraction systems to new domains.

Harris, Z. S. (1958). Linguistic transformations for information retrieval. Proceedings International

Conference on Scientific Information.

Harris, Z. S. (1968). Mathematical Structures of

Language. Wiley-Interscience, New York.

Hirschman, L., Grishman, R., and Sager, N. (1975).

Grammatically-Based Automatic Word Class Formation.

Information Processing and Management 11, 39-57.

MUC-6. (1995). Proc. Sixth Message Understanding

Conf.

, Morgan Kaufmann.

Riloff, E. (1996). Automatically generating extraction patterns from untagged text. Proc. Thirteenth National

Conf. on Artificial Intelligence (AAAI-96), 1044-1049.

Riloff, E. and Jones, R. (1999). Learning Dictionaries for

Information Extraction by Multi-Level Bootstrapping.

Proc. Sixteenth National Conf. on Artificial Intelligence

(AAAI-99) , Orlando, FL.

Sager, N., Friedman, C., and Lyman, M. (1987). Medical

Language Processing: Computer Management of

Narrative Data.

Addison-Wesley, Reading, MA.

Sasaki, Y. (1999). Applying type-oriented ILP to IE rule generation. Proc. Workshop on Machine Learning and

Information Extraction , AAAI-99, Orlando, FL.

Soderland, S. (1999). Learning information extraction rules for semi-structured and free text. Machine Learning

34 , 233-272.

Yangarber, R., Grishman, R., Tapanainen, T., and

Huttunen, S. (2000). Unsupervised discovery of scenariolevel patterns for information extraction. Proc. Sixth Conf.

Applied Natural Language Processing .

References

Biber, D., Conrad, S., and Reppen, R. (1998). Corpus

Linguistics: Investigating Language Structure and Use ,

Cambridge University Press.

Califf, M. E. (1998). Relational Learning Techniques for

Natural Language Information Extraction.

PhD Thesis,

Univ. of Texas, Austin.

Grishman, R., Hirschman, L., and Nhan, N. T. (1986).

Discovery Procedures for Sublanguage Selectional

Patterns: Initial Experiments. Computational Linguistics

12 (3), 205-215.

Grishman, R., and Kittredge, R., eds. (1986). Analyzing

Language in Restricted Domains: Sublanguage

Description and Processing.

Lawrence Erlbaum Assoc.,

Hillsdale, NJ.

Download