p(Noun | TO)

advertisement
Introduction to Computational Natural
Language Learning
Linguistics 79400 (Under: Topics in Natural Language Processing)
Computer Science 83000 (Under: Topics in Artificial Intelligence)
The Graduate School of the City University of New York
Fall 2001
William Gregory Sakas
Hunter College, Department of Computer Science
Graduate Center, PhD Programs in Computer Science and Linguistics
The City University of New York
1
Someone shot the servant of the actress who was on the balcony.
Who was on the balcony and how did they get there?
Lecture 9:
Learning the "best" parse from corpora and tree-banks
READING :
Charniak, E. (1997) Statistical techniques for natural language parsing,
AI Magazine.
http://www.cs.brown.edu/people/ec/papers/aimag97.ps
This is a wonderfully easy to read introduction to how simple patterns in corpora can be used to
resolve ambiguities in tagging and parsing. (This is a must read.)
Costa et al. (2001) Wide coverage incremental parsing by learning attachment
preferences
http://www.dsi.unifi.it/~costa/online/AIIA2001.pdf
A novel approach to learning parsing preferences that incorporates an artificial neural network.
Read Charniak first, but try to get started on this before the meeting after Thanksgiving.
2
Review: Context-free Grammars
1.
S
The dog ate.
-> NP VP
The diner ate seafood.
2. VP -> V NP
3. VP -> V NP NP
4. NP -> det N
The boy ate the fish.
5. NP -> N
Order of rules in a top down parse
with one word look-ahead:
6. NP -> det N N
(see blackboard)
7. NP -> NP NP
10 dollars
a share.
Just like our first language models, ambiguity plays an important role.
3
Salespeople sold the dog biscuits.
At least three parses. (See Charniak, see blackboard).
Wide coverage parsers can generate 100's of parses for every sentence.
See Costa et al. for some numbers from the Penn tree-bank. Most are
pretty senseless.
Traditionally, non-statistically-minded NLP engineering types thought of
disambiguation as a post-parsing problem.
Statistically-minded NLP engineering folk think more of a continuum parsing
and disambiguation go toghether, it's just that some parses are more
reasonable than others.
4
POS tagging:
The
can
will
rust.
det
aux
noun
verb
aux
noun
verb
noun
verb
Learning algorithm I
Input for training: A pre-tagged training corpus.
1)
record the frequencies, for each word, of parts of speech. E.g.:
The det 1,230
can aux 534
etc
noun 56
verb
6
2)
on an unseen corpus, apply, for each word, the most frequent POS
observed from step (1). For words with frequency of 0 (they
didin't appear in the training corpus) guess proper-noun.
ACHIEVES 90% (in English)!
5
Learning algorithm I
Input for training: A pre-tagged training corpus.
1)
record the frequencies, for each word, of parts of speech.
The det 1,230
can aux 534
etc
noun 56
verb
6
2)
on an unseen corpus, apply, for each work, the most frequent POS
observed from step (1). For words with frequency of 0 (they didin't
appear in the training corpus, guess Proper-noun).
ACHIEVES 90%! (But remember totally unambiguous words like "the" are
relatively frequent in English which pushes the number way up).
Easy to turn frequencies into approximations of the probability that a
POS tag is correct given the word: p(t | w). For can these would be:
p( aux | can) = 534 / (534 + 56 + 6) = .90
p(noun | can) = 56 / (534 + 56 + 6) = .09
p(verb | can) = 6 / (534 + 56 + 6) = .01
6
Notation for:
the tag (t) out of all possible t's that maximizes the
probability of t given the current word (wi ) under
consideration:
arg max p (t | wi )
t
arg max p(t |" can" ) = the tag (t) that generates the maximum p of
p( aux | can) = 534 / (534 + 56 + 6) = .90
p(noun | can) = 56 / (534 + 56 + 6) = .09
p(verb | can) = 6 / (534 + 56 + 6) = .01
t
= aux
Extending the notation for a sequence of tags :
n
arg max  p (ti | wi )
t1,n
i 1
This means give the sequence of tags (t1,n), that
maximizes the product of the probabilities
that a tag is correct for each word.
7
Hidden Markov Models (HMM)
p( wi | ti ) =
p( ti | ti-1) =
p(adj | det)
p(large | adj)
p(small | adj)
.218
adj
large .004
small .005
.0016
det
a .245
the .586
.45
.475
noun
house .001
stock .001
8
• Secretariat is expected to race tomorrow
• People continue to inquire the reason for the race for outer
space
Consider:
to/TO
race/????
the/Det race/????
• The naive Learning Algorithm I would simply assign the most
probable tag – ignoring the preceding word - which would
obviously be wrong for one of the two sentences above.
9
Consider the first sentence, with the "to race."
The (simple bigram) HMM model would choose the greater of the these
two probabilities:
p(Verb | TO) p(race | Verb)
p(Noun | TO) p(race | Noun)
Let's look at the first expression:
How likely are we to find a verb given the previous tag TO?
Can find calculate from a training corpus (or corpora), a verb following a
tag TO, is 15 times more likely:
p(Verb | TO) = .021
p(Noun | TO) = .34
The second expression:
Given each tag (Verb and Noun), ask "if were expecting the tag
Verb, would the lexical item be "race?" and "if were expecting
the tag Noun, would the lexical item be "race?" I.e. want the
likelihood of:
p(race | Verb) = .00003 and p(race | Noun) = .00041
10
Putting them together:
The bigram HMM correctly predicts that race should be a Verb
despite fact that race as a Noun is more common:
p(Verb | TO) p(race|Verb) =
p(Noun | TO) p(race|Noun) =
0.00001
0.000007
So a bigram HMM taggers chooses the tage sequence that
maximizes (it's easy to increase the number of tags looked at):
p(word|tag) p(tag|previous tag)
a bit more formally:
ti = argmaxj P(tj | tj-1 )P(wi | tj )
n
arg max  p (ti | ti 1 ) p ( wi | ti )
t1,n
i 1
11
After some math (application of Bayes theorem and the chain rule) and
Two important simplifying assumptions (the probability of a word
depends only on its tag and only the previous tag is enough to
approximate the current tag),
We have, for a whole sequence of tags t1,n an HMM bigram model for the
predicted tag sequence, given words w1 . . . wn :
n
arg max  p (ti | ti 1 ) p ( wi | ti )
t1,n
i 1
12
Want the most likely path trough this graph.
noun
noun
noun
DT
aux
the
can
aux
will
verb
rust
13
This is done by the Viterbi algorithm.
Accuracy of this method is around 96%.
But what about if no training data to calculate
likelyhood of tags and words | tags?
Can estimate using the forward-backward algorithm.
Doesn't work too well without at least a small training
set to get it started.
14
PCFGs
Given that there can be many, many parses of a
sentence given a typical, large CFG, we should pick the
most probable parse as defined by:
Probability of a parse of a sentence = the product of
the probabilities of all the rules that were applied to
expand each constituent.
p( s,  )  p(r (c))
c
prob of a parse
π of sentence s
probability of expanding
constituent c by
context-free rule r
product over all
constituents in
parse π
15
Some example "toy" probabilites attached to CF rules.
1.
S
-> NP VP
(1.0)
2. VP -> V NP
(0.8)
3. VP -> V NP NP
(0.2)
4. NP -> det N
(0.5)
5. NP -> N
(0.3)
6. NP -> det N N
(0.15)
7. NP -> NP NP
(0.05)
How do we come by these
probabilities? Simply count
the number of times they
are applied in a tree-bank.
For example, if the rule:
NP -> det N
is used 1,000 times, and
overall, NP -> X (i.e. any NP
rule) is applied 2,000 times,
then the probability of
NP -> det N = 0.5.
Apply to Salespeople sold the dog biscuits see
Charniak and blackboard.
16
Basic tree-bank grammars parse surprising well (at around 75%)
But often mis-predict the correct parse (according to humans).
The most troublesome report may be the August merchandise
trade deficit due out tomorrow.
Tree-bank grammar gets the (incorrect) reading:
His worst nightmare may be [the telephone bill due] over $200.
N ADJ PP NP
deficit due out tomorrow
Trees on blackboard or can construct from Charniak.
17
Preferences depend on many factors
•
On type of verb:
The women kept the dogs on the beach
The women discussed the dogs on the beach
Kept:
(1) Kept the dogs which were on the beach
(2) Kept them (the dogs), while on the beach
Discussed:
(1) Discussed the dogs which were on the beach
(2) Discussed them (the dogs), while on the beach
18
Verb-argument relations
• Subcategorization
• But also selects into the type of the Prepositional Phrase
(Hindle and Roth)
• Or, even more deeply, seems to depend on the frequency of
semantic associations:
The actress delivered flowers threw them in the trash
The postman delivered flowers threw them in the trash
19
On ‘selectional’ restrictions
•
•
“walking on air”; “skating on ice”, vs. “eating on ice”
Verb takes a certain kind of argument
•
Subject sometimes must be certain type:
John admires honesty vs. ?? Honesty admires John
20
Download