G22.2590 - Natural Language Processing - Spring 2001 Lecture 6 Outline Prof. Grishman February 21, 2001 Partial Parsers, cont’d Part-of speech tagging: English words are quite ambiguous with regard to part of speech. In full sentence parsing, these ambiguities are (usually) resolved as part of the parsing process. In partial parsing, other methods are required … part-of-speech taggers. The crudest approximation is to gather statistics (from a large corpus) of the part-of-speech distribution of each word, and then to assign each word (in new text we must analyze) its most common part of speech. This crude guess is already more than 90% correct. Improved guesses can then be obtained by using local context: either specific correction rules, or statistics on how often one part of speech follows another. (Allen sec. 7.3). JET incorporates a very simple tagging mechanism to resolve noun/verb ambiguities, which we will use in later pattern sets. Corpus-trained Parsing A quite different approach to tackling the problems of grammar construction is to build a grammar from a large, hand-annotated corpus. A treebank is a large collection of sentences which have been parsed by hand. From these sentences one can automatically derive a grammar in a straightforward manner. More interestingly, it is possible to derive a probabilistic grammar, in which the probability of each production reflects the frequency with which is was used in the corpus. This can then be used in a probabilistic parser, which seeks the most probable sentence analysis. By ranking parses, it addresses the ambiguity problems which would otherwise arise with a broad-coverage grammar. (Allen sec. 7.5 – 7.6). For newspaper texts, which have been the most extensively studied, parsing accuracy of almost 90% can now be achieved. Semantic analysis Syntactic analysis has made some relationships in the sentence explicit, and performed some regularization of structure, but there is still much to be done to determine what a sentence “means”. For example, the syntactic structure is based on words, which are frequently ambiguous. It doesn’t indicate the quantificational structure of a sentence. And it provides many representations for the same meaning. How much of this needs to be resolved depends on the application, but almost all applications require some semantic analysis. Formal Languages for Meaning Representation We want to transform the sentences into a language which is unambiguous has simple rules of interpretation and inference, and in particular has a logical structure determined by its form These are the properties of the languages of logic. Actual systems may use different representations, but they are generally equivalent to the formal language (extensions of predicate calculus) we will use for presentation. The simplest form is propositional logic, but it is not powerful enough for our purposes. Predicate logic combines predicates and their arguments. [See Allen Appendix A for a review of predicate calculus.] Predicate calculus is intended for representing “eternal truths” (like the facts of mathematics). There are a number of notions common to natural language which are not in predicate logic: generalized quantifiers (for ‘some’, ‘most’, …) tense (though we can add a time argument to each time-dependent predicate) modality and belief (need to allow formulas as arguments: “John believes Fred likes Mary” = believe(John,like(Fred,Mary)) ) presupposition (“All the men on Mars drink Coca-Cola.”) fuzziness (“The milk is warm.”) Context-independent meaning (Allen sec. 6.1) We will first translate each sentence, independent of context, into an intermediate meaning representation, called logical form. This logical form will then be interpreted, in the context of the current situation and the prior discourse (if any), to produce a final knowledge representation. The logical form will include indexicals, which are references to objects in the context (‘the ball’, ‘Fred’, etc.). Choice of predicates: the problem of word senses (Allen sec. 6.2) Informally, we shall use words as predicates. This is really not satisfactory, however, because most words have many senses. Thus “bank” may be a financial institution or the side of a river. “know” may mean to know a fact or to be acquainted with a person. A “sentence” can be a sequence of words or a term of confinement (“a 10-year sentence in the jail”). In such cases, we can imagine enumerating the senses and qualifying the word by a sense number … bank1, sentence2, etc. However, it’s not always easy to know `when to stop’ in dividing senses. (Allen suggests one criterion based on coreference (p. 232).) Consider for example, “He walked through the door.” and “He knocked on the door.” The two senses of “door” are clearly different, but at the same time related (they would be one sense by Allen’s test.). There are a variety of such systematic polysemies, which are not accounted for by simply enumerating senses. (This is discussed at length by Pustejovsky in his book The Generative Lexicon.) Assignment #6 Due March 7th a. Allen Chapter 8, Exercise 1. (state why sentences are ambiguous) b. Allen Chapter 8, Exercise 5. (write logical form) c. Using predicate calculus or quasi-logical form, formalize the difference between the following two sentences: John read a book and Mary read it too. John read a book and Mary read one too. d. For assignment #5, you were asked to make one fix to the noun group / verb group patterns. Evaluate this fix quantitatively: - you will be provided with a Washington Square News article - run it with and without your fix - score it with the SGML scorer - report the scores (and maybe fix the fix if the score went down!) Instructions on running a document through JET and scoring it will be provided next week.