G22

advertisement
G22.2590 - Natural Language Processing - Spring 2001
Lecture 6 Outline
Prof. Grishman
February 21, 2001
Partial Parsers, cont’d
Part-of speech tagging:
English words are quite ambiguous with regard to part of speech. In full sentence
parsing, these ambiguities are (usually) resolved as part of the parsing process. In partial
parsing, other methods are required … part-of-speech taggers. The crudest
approximation is to gather statistics (from a large corpus) of the part-of-speech
distribution of each word, and then to assign each word (in new text we must analyze) its
most common part of speech. This crude guess is already more than 90% correct.
Improved guesses can then be obtained by using local context: either specific correction
rules, or statistics on how often one part of speech follows another. (Allen sec. 7.3).
JET incorporates a very simple tagging mechanism to resolve noun/verb ambiguities,
which we will use in later pattern sets.
Corpus-trained Parsing
A quite different approach to tackling the problems of grammar construction is to build
a grammar from a large, hand-annotated corpus. A treebank is a large collection of
sentences which have been parsed by hand. From these sentences one can automatically
derive a grammar in a straightforward manner. More interestingly, it is possible to derive
a probabilistic grammar, in which the probability of each production reflects the
frequency with which is was used in the corpus. This can then be used in a probabilistic
parser, which seeks the most probable sentence analysis. By ranking parses, it addresses
the ambiguity problems which would otherwise arise with a broad-coverage grammar.
(Allen sec. 7.5 – 7.6).
For newspaper texts, which have been the most extensively studied, parsing accuracy
of almost 90% can now be achieved.
Semantic analysis
Syntactic analysis has made some relationships in the sentence explicit, and performed
some regularization of structure, but there is still much to be done to determine what a
sentence “means”. For example, the syntactic structure is based on words, which are
frequently ambiguous. It doesn’t indicate the quantificational structure of a sentence.
And it provides many representations for the same meaning. How much of this needs to
be resolved depends on the application, but almost all applications require some semantic
analysis.
Formal Languages for Meaning Representation
We want to transform the sentences into a language which

is unambiguous

has simple rules of interpretation and inference, and in particular

has a logical structure determined by its form
These are the properties of the languages of logic. Actual systems may use different
representations, but they are generally equivalent to the formal language (extensions of
predicate calculus) we will use for presentation.
The simplest form is propositional logic, but it is not powerful enough for our purposes.
Predicate logic combines predicates and their arguments. [See Allen Appendix A for a
review of predicate calculus.] Predicate calculus is intended for representing “eternal
truths” (like the facts of mathematics). There are a number of notions common to natural
language which are not in predicate logic:

generalized quantifiers (for ‘some’, ‘most’, …)

tense (though we can add a time argument to each time-dependent predicate)

modality and belief (need to allow formulas as arguments: “John believes Fred likes
Mary” = believe(John,like(Fred,Mary)) )

presupposition (“All the men on Mars drink Coca-Cola.”)

fuzziness (“The milk is warm.”)
Context-independent meaning (Allen sec. 6.1)
We will first translate each sentence, independent of context, into an intermediate
meaning representation, called logical form. This logical form will then be interpreted, in
the context of the current situation and the prior discourse (if any), to produce a final
knowledge representation. The logical form will include indexicals, which are references
to objects in the context (‘the ball’, ‘Fred’, etc.).
Choice of predicates: the problem of word senses (Allen sec. 6.2)
Informally, we shall use words as predicates. This is really not satisfactory, however,
because most words have many senses. Thus “bank” may be a financial institution or the
side of a river. “know” may mean to know a fact or to be acquainted with a person. A
“sentence” can be a sequence of words or a term of confinement (“a 10-year sentence in
the jail”). In such cases, we can imagine enumerating the senses and qualifying the word
by a sense number … bank1, sentence2, etc.
However, it’s not always easy to know `when to stop’ in dividing senses. (Allen suggests
one criterion based on coreference (p. 232).) Consider for example, “He walked through
the door.” and “He knocked on the door.” The two senses of “door” are clearly different,
but at the same time related (they would be one sense by Allen’s test.). There are a
variety of such systematic polysemies, which are not accounted for by simply
enumerating senses. (This is discussed at length by Pustejovsky in his book The
Generative Lexicon.)
Assignment #6
Due March 7th
a. Allen Chapter 8, Exercise 1. (state why sentences are ambiguous)
b. Allen Chapter 8, Exercise 5. (write logical form)
c. Using predicate calculus or quasi-logical form, formalize the difference between
the following two sentences:
John read a book and Mary read it too.
John read a book and Mary read one too.
d. For assignment #5, you were asked to make one fix to the noun group / verb
group patterns. Evaluate this fix quantitatively:
- you will be provided with a Washington Square News article
- run it with and without your fix
- score it with the SGML scorer
- report the scores (and maybe fix the fix if the score went down!)
Instructions on running a document through JET and scoring it will be provided
next week.
Download