Lecture 14

advertisement
Ch 9 Part of Speech Tagging
(slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada
Mihalcea, and Bonnie Dorr and Mitch Marcus.)
Parts of Speech
 8 (ish) traditional parts of speech
• Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, etc
• This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)
• Called: parts-of-speech, lexical category, word classes,
morphological classes, lexical tags, POS
• We’ll use POS most frequently
POS examples for English







N
V
ADJ
ADV
P
PRO
DET
noun
verb
adj
adverb
preposition
pronoun
determiner
chair, bandwidth, pacing
study, debate, munch
purple, tall, ridiculous
unfortunately, slowly,
of, by, to
I, me, mine
the, a, that, those
Open Class Words
 Every known human language has nouns
and verbs
 Nouns: people, places, things
• Classes of nouns
—proper vs. common
—count vs. mass
 Verbs: actions and processes
 Adjectives: properties, qualities
 Adverbs: hodgepodge!
• Unfortunately, John walked home extremely
slowly yesterday
Definition:
An adverb is a part of speech. It is any word that
modifies any othe r part of language: verbs,
adjectives (including numbers), clauses, sentences
and other adverbs, except for nouns; modifiers of
nouns are primarily determiners and adjectives.
Closed Class Words
 Differ more from language to language
than open class words
 Examples:
•
•
•
•
•
•
•
prepositions: on, under, over, …
particles: up, down, on, off, …
determiners: a, an, the, …
pronouns: she, who, I, ..
conjunctions: and, but, or, …
auxiliary verbs: can, may should, …
numerals: one, two, three, third, …
Prepositions from CELEX
Pronouns in CELEX
Conjunctions
Auxiliaries
NLP Task I – Determining Part of Speech Tags
 The Problem:
Word
POS listing in Brown Corpus
heat
noun
verb
oil
noun
in
prep
noun
adv
a
det
noun
noun-proper
large
adj
noun
adv
pot
noun
POS Tagging: Definition
 The process of assigning a part-of-speech or lexical
class marker to each word in a corpus:
WORDS
the
koala
put
the
keys
on
the
table
TAGS
N
V
P
DET
POS Tagging example
WORD
tag
the
koala
put
the
keys
on
the
table
DET
N
V
DET
N
P
DET
N
What is POS tagging good for?
 Speech synthesis:
•
•
•
•
•
•
How to pronounce “lead”?
INsult
inSULT
OBject
obJECT
OVERflow
overFLOW
DIScount
disCOUNT
CONtent
conTENT
 Stemming for information retrieval
• Knowing a word is a N tells you it gets plurals
• Can search for “aardvarks” get “aardvark”
 Parsing and speech recognition and etc
• Possessive pronouns (my, your, her) followed by nouns
• Personal pronouns (I, you, he) likely to be followed by verbs
Related Problem in Bioinformatics




Durbin et al. Biological Sequence
Analysis, Cambridge University
Press.
Several applications, e.g.
proteins
From primary structure
ATCPLELLLD
Infer secondary structure
HHHBBBBBC..
History: From Yair Halevi (Bar-Ilan U.)
Trigram Tagger
(Kempe)
96%+
DeRose/Church
Efficient HMM
Sparse Data
95%+
Greene and
Rubin
Rule Based - 70%
1960
Brown Corpus
Created (EN-US)
1 Million Words
HMM Tagging
(CLAWS)
93%-95%
1970
Brown Corpus
Tagged
LOB Corpus
Created (EN-UK)
1 Million Words
Tree-Based Statistics
(Helmut Shmid)
Rule Based – 96%+
Transformation Based
Tagging
(Eric Brill)
Rule Based – 95%+
1980
Combined Methods
98%+
Neural Network
96%+
1990
2000
LOB Corpus
Tagged
POS Tagging
separated from
other NLP
Penn Treebank
Corpus
(WSJ, 4.5M)
British National
Corpus
(tagged by CLAWS)
British National Carpus
What is it used for?
Ultimately, its use is limited only by our imagination; if you have any need for up
to 100 million words of modern British English, you can make use of the British
National Corpus.
 The main uses of the corpus, are as follows:
 Reference Book Publishing
•

Linguistic Research
•

Extensive data test bed for program development.
Natural language processing
•

Raw data for studying lexis, syntax, morphology, semantics, discourse analysis,
stylistics, sociolinguistics...
Artificial Intelligence
•

Dictionaries, grammar books, teaching materials, usage guides, thesauri.
Increasingly, publishers are referring to the use they make of corpus facilities: it's
important to know how well their corpora are planned and constructed.
Taggers, parsers, natural language understanding programs, spell checking word
lists...
English Language Teaching
•
Syllabus and materials design, classroom reference, independent learner
research.
Penn Treebank Tagset
A Simplified Tagset for English
 Tagsets for English have grown progressively larger
since the Brown Corpus until the Penn Treebank
project.
Brown Corpus: 87 tags
LOB Corpus: 135 tags
Lancaster UCREL 166 tags
group:
London-Lund Corpus: 197 tags
UPenn Treebank: 34 tags + punctuation
Rationale behind British & European tag sets
To provide “distinct codings for all classes of words
having distinct grammatical behaviour” – Garside
et al. 1987

The Lund tagset for adverb distinguishes between
•
•
•
•
•
•
•
Adjunct – Process, Space, Time
Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S
Conjunct – Appositional, Contrastive, Inferential, Listing, …
Disjunct – Content, Style
Postmodifier – “else”
Negative – “not”
Discourse Item – Appositional, Expletive, Greeting,
Hesitator, …
Reasons for a Smaller Tagset
 Many tags are unique to particular lexical items,
and can be recovered automatically if desired.
Brown Tags For Verbs
be/BE
have/HV
sing/VB
is/BEZ
has/HVZ
sing/VBZ
was/BED
had/HVD
sang/VBD
being/BEG
having/HVG singing/VBG
been/BEN
had/HVN
sung/VBN
Penn Treebank Tags For Verbs
be/VB
have/VB
sing/VB
is/VBZ
has/VBZ
sing/VBZ
was/VBD
had/VBD
sang/VBD
being/VBG
having/VBG singing/VBG
been/VBN
had/VBN
sung/VBN
Task I – Determining Part of Speech Tags
 The Problem:
Word
POS listing in Brown
heat
noun
verb
oil
noun
in
prep
noun
adv
a
det
noun
noun-proper
large
adj
noun
adv
pot
noun
 The Old Solution: Combinatorial search.
• If each of n words has k tags on average, try the nk
combinations until one works.
NLP Task I – Determining Part of Speech Tags
 Machine Learning Solutions: Automatically learn
Part of Speech (POS) assignment.
• The best techniques achieve 96-97% accuracy per word
on new materials, given large training corpora.
Simple Statistical Approaches: Idea 1
Simple Statistical Approaches: Idea 2
For a string of words
w = w1w2w3…wn
find the string of POS tags
T = t1 t2 t3 …tn
which maximizes P(T|W)
• i.e., the probability of tag string T given that the
word string was w
• i.e., that w was tagged T
Again, The Sparse Data Problem …
A Simple, Impossible Approach to Compute P(T|W):
Count up instances of the string "heat oil in a large pot"
in the training corpus, and pick the most common
tag assignment to the string..
A Practical Statistical Tagger
A Practical Statistical Tagger II
But we can't accurately estimate more than tag
bigrams or so…
We change to a model that we CAN estimate:
A Practical Statistical Tagger III
So, for a given string W = w1w2w3…wn, the tagger needs to
find the string of tags T which maximizes
Training and Performance
 To estimate the parameters of this model, given an
annotated training corpus:
 Because many of these counts are small, smoothing is
necessary for best results…
 Such taggers typically achieve about 95-96% correct
tagging, for tag sets of 40-80 tags.
Download