pptx

advertisement
NYU
Part-of-Speech Tagging
CSCI-GA.2590 – Lecture 4
Ralph Grishman
Parts of Speech
Grammar is stated in terms of parts of speech
(‘preterminals’):
– classes of words sharing syntactic properties:
noun
verb
adjective
…
1/16/14
NYU
2
POS Tag Sets
Most influential tag sets were those defined for
projects to produce large POS-annotated corpora:
• Brown corpus
– 1 million words from variety of genres
– 87 tags
• UPenn Tree Bank
–
–
–
–
1/16/14
initially 1 million words of Wall Street Journal
later retagged Brown
first POS tags, then full parses
45 tags (some distinctions captured in parses)
NYU
3
The Penn POS Tag Set
• Noun categories
•
•
•
•
NN (common singular)
NNS (common plural)
NNP (proper singular)
NNPS (proper plural)
Penn POS tags
• Verb categories
•
•
•
•
•
•
1/16/14
VB (base form)
VBZ (3rd person singular present tense)
VBP (present tense, other than 3rd person singular)
VBD (past tense)
VBG (present participle)
VBN (past participle)
NYU
4
some tricky cases
• present participles which act as prepositions:
– according/JJ to
• nationalities:
– English/JJ cuisine
– an English/NNP sentence
• adjective vs. participle
–
–
–
–
1/16/14
the striking/VBG teachers
a striking/JJ hat
he was very surprised/JJ
he was surprised/VBN by his wife
NYU
5
Tokenization
• any annotated corpus assumes some
tokenization
• relatively straightforward for English
– generally defined by whitespace and punctuation
– treat negative contraction as separate token:
do | n’t
– treat possessive as separate token: cat | ‘s
– do not split hyphenated terms: Chicago-based
1/16/14
NYU
6
the Tagging Task
Task: assigning a POS to each word
• not trivial: many words have several tags
• dictionary only lists possible POS, independent of
context
• how about using a parser to determine tags?
– some analysis (e.g., partial parsers) assume input is tagged
1/16/14
NYU
7
Why tag?
• POS tagging can help parsing by reducing
ambiguity
• Can resolve some pronunciation ambiguities
for text-to-speech (“desert”)
• Can resolve some semantic ambiguities
1/16/14
NYU
8
Simple Models
• Natural language is very complex
– we don't know how to model it fully,
so we build simplified models which provide some
approximation to natural language
1/16/14
NYU
9
Corpus-Based Methods
How can we measure 'how good' these models
are?
• we assemble a text corpus
• annotate it by hand with respect to the
phenomenon we are interested in
• compare it with the predictions of our model
– for example, how well the model predicts part-ofspeech or syntactic structure
1/16/14
NYU
10
Preparing a Good Corpus
• To build a good corpus
– we must define a task people can do reliably
(choose a suitable POS set, for example)
– we must provide good documentation for the task
• so annotation can be done consistently
– we must measure human performance (through
dual annotation and inter-annotator agreement)
• Often requires several iterations of refinement
Training the model
How to build a model?
– need a goodness metric
– train by hand, by adjusting rules and analyzing
errors (ex: Constraint Grammar)
– train automatically
• develop new rules
• build probabilistic model (generally very hard to do by
hand)
• choice of model affected by ability to train it (NN)
1/16/14
NYU
12
The simplest model
• The simplest POS model considers each word
separately:
P(T | W )   P(ti | wi )
i
• We tag each word with its most likely part-of-speech
– this works quite well: about 90% accuracy when trained
and tested on similar texts

– although many words have multiple parts of speech, one
POS typically dominates within a single text type
• How can we take advantage of context to do better?
1/16/14
NYU
13
A Language Model
• To see how we might do better, let us consider
a related problem: building a language model
– a language model can generate sentences
following some probability distribution
1/16/14
NYU
14
Markov Model
• In principle each word we select depends on
all the decisions which came before (all
preceding words in the sentence)
• But we’ll make life simple by assuming that
the decision depends on only the immediately
preceding decision
• [first-order] Markov Model
• representable by a finite state transition network
• Tij = probability of a transition from state i to state j
Finite State Network
0.30
dog:
woof
0.50
start
0.40
0.50
0.30
end
0.40
cat:
meow
0.30
0.30
Our bilingual pets
• Suppose our cat learned to say “woof” and
our dog “meow”
• … they started chatting in the next room
• … and we wanted to know who said what
Hidden State Network
woof
meow
dog
start
end
cat
woof
meow
• How do we predict
• When the cat is talking: ti = cat
• When the dog is talking: ti = dog
• We construct a probabilistic model of the phenomenon
• And then seek the most likely state sequence S
arg max
S
P(t1 ...t n | w1 ...wn )
t1 ...t n

Hidden Markov Model
• Assume current word depends only on current tag
arg max
S
P(t1 ...t n | w1 ...wn )
t1 ...t n
arg max

P(w1 ,...,wn | t1 ,...,t n )P(t1 ,...,t n )
t1 ...t n



arg max n
 P(wi | t i )P(t i | t i1 )

t1 ...t n i1
HMM for POS Tagging
• We can use the same formulas for POS tagging
states  POS tags
1/16/14
NYU
21
Training an HMM
• Training an HMM is simple if we have a
completely labeled corpus:
– have marked the POS of each word.
– can directly estimate both P ( ti | ti-1 )
and P ( wi | ti ) from corpus counts
• using the Maximum Likelihood Estimator.
1/16/14
NYU
22
Greedy Decoder
• simplest decoder (tagger) assign tags
deterministically from left to right
• selects ti to maximize P(wi|ti) * P(ti|ti-1)
• does not take advantage of right context
• can we do better?
1/16/14
NYU
23
< Viterbi decoder >
1/16/14
NYU
24
Performance
• Accuracy with good unknown-word model
trained and tested on WSJ is 96.5% to 96.8%
1/16/14
NYU
25
Unknown words
• Problem (as with NB) of zero counts … words
not in the training corpus
– simplest: assume all POS equally likely for
unknown words
– can make better estimate by observing unknown
words are very likely open class words, and most
likely nouns
• base P(t|w) of unknown word on probability
distribution of words which occur once in corpus
1/16/14
NYU
26
Unknown words, cont’d
– can do even better by taking into account the
form of a word
• whether it is capitalized
• whether it is hyphenated
• its last few letters
1/16/14
NYU
27
Trigram Models
• in some cases we need to look two tags back
to find an informative context
– e.g, conjunction (N and N, V and V, …)
• but there’s not enough data for a pure trigram
model
• so combine unigram, bigram, and trigram
– linear interpolation
– backoff
1/16/14
NYU
28
Domain adaptation
• Substantial loss in shifting to new domain
– 8-10% loss in shift from WSJ to biology domain
– adding small annotated sample (200-500
sentences) in new domain greatly reduces error
– some reduction possible without annotated target
data (Blitzer, Structured Correspondence Learning)
1/16/14
NYU
29
Jet Tagger
• HMM–based
• trained on WSJ
• file pos_hmm.txt
1/16/14
NYU
30
Transformation-Based Learning
• TBL provides a very different corpus-based
approach to part-of-speech tagging
• It learns a set of rules for tagging
– the result is inspectable
1/16/14
NYU
31
TBL Model
• TBL starts by assigning each word its most likely
part of speech
• Then it applies a series of transformations to the
corpus
– each transformation states some condition and some
change to be made to the assigned POS if the
condition is met
– for example:
• Change NN to VB if the preceding tag is TO.
• Change VBP to VB if one of the previous 3 tags is MD.
1/16/14
NYU
32
Transformation Templates
• Each transformation is based on one of a small
number of templates, such as
•
•
•
•
•
•
Change tag x to y if the preceding tag is z.
Change tag x to y if one of the previous 2 tags is z.
Change tag x to y if one of the previous 3 tags is z.
Change tag x to y if the next tag is z.
Change tag x to y if one of the next 2 tags is z.
Change tag x to y if one of the next 3 tags is z.
1/16/14
NYU
33
Training the TBL Model
• To train the tagger, using a hand-tagged
corpus, we begin by assigning each word its
most common POS.
• We then try all possible rules (all
instantiations of one of the templates) and
keep the best rule -- the one which corrects
the most errors.
• We do this repeatedly until we can no longer
find a rule which corrects some minimum
number of errors.
1/16/14
NYU
34
Some Transformations
the first 9 transformations found for WSJ corpus
Change
to
if
NN
VB
previous tag isTO
VBP
VB
one of previous 3 tags is MD
NN
VB
one of previous 2 tags is MD
VB
NN
one of previous 2 tags is DT
VBD
VBN
one of previous 3 tags is VBZ
VBN
VBD
previous tag is PRP
VBN
VBD
previous tag is NNP
VBD
VBN
previous tag is VBD
VBP
VB
previous tag is TO
1/16/14
NYU
35
TBL Performance
• Performance competitive with good HMM
• accuracy 96.6% on WSJ
• Compared to HMM, much slower to train, but
faster to apply
1/16/14
NYU
36
Download