Word Classes and POS Tagging

advertisement
Word Classes and POS Tagging
Read J & M Chapter 8.
You may also want to look at:
http://www.georgetown.edu/faculty/ballc/ling361/tagging_over
view.html
Why Do We Care about Parts of Speech?
•Pronunciation
Hand me the lead pipe.
•Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
•Stemming
-s means singular for verbs, plural for nouns
•As the basis for syntactic parsing and then meaning extraction
I will lead the group into the lead smelter.
•Machine translation
• (E) content +N  (F) contenu +N
• (E) content +Adj  (F) content +Adj or satisfait +Adj
Remember the Mapping Problem
We’ve sort of ignored this issue as we’ve looked at:
•Dealing with a noisy channel,
•Probabilistic techniques we can use for various subproblems
•Corpora we can analyze to collect our facts.
We need to return to it now. POS tagging is the first step.
Understanding – the Big Picture
Morphology
POS Tagging
Syntax
Semantics
Discourse Integration
Generation goes backwards. For this reason, we generally want
declarative representations of the facts. POS tagging is an
exception to this.
Two Kinds of Issues
•Linguistic – what are the facts about language?
•Algorithmic – what are effective computational procedures
for dealing with those facts?
What is a Part of Speech?
Is this a semantic distinction? For example, maybe Noun is the
class of words for people, places and things. Maybe Adjective
is the class of words for properties of nouns.
Consider:
green book
book is a Noun
green is an Adjective
Now consider:
book worm
This green is very soothing.
Morphological and Syntactic Definition of POS
An Adjective is a word that can fill the blank in:
It’s so __________.
A Noun is a word that can be marked as plural.
A Noun is a word that can fill the blank in:
the __________ is
What is green?
It’s so green.
Both greens could work for the walls.
The green is a little much given the red rug.
How Many Parts of Speech Are There?
A first cut at the easy distinctions:
Open classes:
•nouns, verbs, adjectives, adverbs
Closed classes: function words
•conjunctions: and, or, but
•pronounts: I, she, him
•prepositions: with, on
•determiners: the, a, an
But It Gets Harder
provided, as in “I’ll go provided John does.”
there, as in “There aren’t any cookies.”
might, as in “I might go.” or “I might could go.”
no, as in “No, I won’t go.”
What’s a Preposition
From the CELEX online dictionary. Frequencies are from the
COBUILD 16 million word corpus.
What’s a Pronoun?
CELEX dictionary list of pronouns:
Tagsets
Brown corpus tagset (87 tags):
http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html
Penn Treebank tagset (45 tags):
http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6)
C7 tagset (146 tags)
http://www.comp.lancs.ac.uk/ucrel/claws7tags.html
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Ambiguity – In the Brown corpus, 11.5% of the word
types are ambiguous (using 87 tags):
Worse, 40% of the tokens are ambiguous.
Algorithms for POS Tagging
Why can’t we just look them up in a dictionary?
•Words that aren’t in the dictionary
http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid
=578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc
•One idea: P(ti | wi) = the probability that a random hapax
legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more likely
than pronouns.
•Another idea: use morphology.
Algorithms for POS Tagging - Knowledge
•Dictionary
•Morphological rules, e.g.,
•_____-tion
•_____-ly
•capitalization
•N-gram frequencies
•to _____
•DET _____ N
•But what about rare words, e.g, smelt (two verb forms, melt
and past tense of smell, and one noun form, a small fish)
•Combining these
• V _____-ing
I was gracking vs. Gracking is fun.
Algorithms for POS Tagging - Approaches
•Basic approaches
•Rule-Based
•Stochastic
•Do we return one best answer or several answers and let later
steps decide?
•How does the requisite knowledge get entered?
Training/Teaching an NLP Component
Each step of NLP analysis requires a module that knows what
to do. How do such modules get created?
•By hand
•By training
Advantages of hand creation: based on sound linguistic
principles, sensible to people, explainable
Advantages of training from a corpus: less work, extensible to
new languages, customizable for specific domains.
Training/Teaching a POS Tagger
The problem is tractable. We can do a very good job with just:
•a dictionary
•A tagset
•a large corpus, usually tagged by hand
There are only somewhere between 50 and 150 possibilities for
each word and 3 or 4 words of context is almost always enough.
The task:
____ _ __ ______ __ _ _____
What is the weather like in Austin?
Contrast with Training Other NLP Parts
The task:
____ _ __ ______ __ _ _____
What is the weather like in Austin?
The weather in Austin is like what?
RainfallByStation
Months
Stations
station
Month
year
Days
month
station
rainfall
City
Rule-Based POS Tagging
Step 1: Using a dictionary, assign to each word a list of possible
tags.
Step 2: Figure out what to do about words that are unknown or
ambiguous. Two approaches:
•Rules that specify what to do.
•Rules that specify what not to do:
Example: Adverbial “that” rule
Given input: “that”
If
(+1 A/ADV/QUANT)
(+2 SENT-LIM)
(NOT -1 SVOC/A)
Then eliminate non-ADV tags
Else eliminate ADV
It isn’t that odd
vs
I consider that odd vs
I believe that he is right.
From ENGTWOL
Stochastic POS Tagging
First approximation: choose the tag that is most likely for the
given word.
Next try: consider N-gram frequencies and choose the tag that is
most likely in the current context. Should the context be the last
N words or the last N classes?
Next try: combine the two:
P(ti in context | wi ) 
P( wi | ti in context) P(ti in context)
P( wi )
ti  arg max P(t j | ti 1 ) P(wi | t j )
j
Hybrids – the Brill Tagger
Learning rules stochastically: Transformation Based Learning
Step 1: Assign each word the tag that is most likely given no
contextual information.
Race example: P(NN|race) = .98
P(VB|race) = .02
Step 2: Apply transformation rules that use the context that
was just established.
Race example: Change NN to VB when the previous tag is TO.
Secretariat is expected to race tomorrow.
The race is already over.
Learning Brill Tagger Transformations
Three major stages:
1. Label every word with its most-likely tag.
2. Examine every possible transformation and select the one
with the most improved tagging.
3. Retag the data according to this rule.
These three stages are repeated until some stopping point is
reached.
The output of TBL is an ordered list of transformations, which
constitute a tagging procedure that can be applied to a new
corpus.
The Universe of Possible Transformations?
One or Many Answers
Example:
I’m going to water ski.
I’m going to water the lawn.
The architecture issue:
•If we just give one answer, we can follow a single path.
•If we don’t decide yet, we’ll need to manage search.
Search
•Managing search:
•Depth-first
•Breadth-first – chart parsing
S
S
VP
VP
NP
PP
NP
V
PR N
I hit
det
the
N
boy
NP
V
PREP DET N
with
a
bat.
Evaluation
•Given an algorithm, how good is it?
•What is causing the errors? Can anything be done about
them?
How Good is An Algorithm?
•How good is the algorithm?
•What’s the maximum performance we have any reason to
believe is achievable? (How well can people do?)
•How good is good enough?
Is 97% good enough?
•Example 1: A speech dialogue system correctly
assigns a meaning to a user’s input 97% of the time.
•Example 2: An OCR systems correctly determines
letters 97% of the time.
Download