Part-of-Speech Tagging - user.phil.uni

advertisement
Introduction to (still more)
Computational Linguistics
Pawel Sirotkin
28.11-01.12.2008, Riga
Rule-based CL
Rule-based CL




Rules have to be generated by hand
Easily tailored to fit (or test) a particular theory
First results with just a handful of rules
But:




4
Very hard to get “all” the rules
Rules may conflict
Rules are language- and domain-specific
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Statistical CL
Needed: an algorithm that can create rules
Algorithm needs training data to learn




More and more data around
Digitalized literature, official documents, corpora
These rules can be applied to new texts
Good points:




5
Largely independent from language, domain etc.
Computational power available in abundance
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
A brief aside: Corpora
First major corpus: Brown Corpus (mid-60ies)
•
–
–
–
500 samples of 2000 words each
From newspapers, fiction and non-fiction books
Around 80 part-of-speech-tags
•
Tagging took over 15 year to be completed
Modern corpora: BNC, COCA, ...
•
–
–
–
6
Sometimes hundreds of millions of words
Written and spoken texts
More or less syntactic and semantic annotation
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Part-of-Speech Tagging
Linguistic background



What are parts of speech?
How do we recognize them?
Practical usage



What are POS taggers good for?
What should they do?
Implementation



7
What are the possible problems?
What are the possible solutions?
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Parts of speech
Nouns, verbs, adjectives…

I have a dream that my four little children will one day live
in a nation where they will not be judged by the color of
their skin but by the content of their character.
(Martin Luther King)
How many nouns are there in this text?

8
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Parts of speech
Nouns, verbs, adjectives…

I have a dream that my four little children will one day live
in a nation where they will not be judged by the color of
their skin but by the content of their character.
(Martin Luther King)
What defines a noun?

9
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
What defines a part of speech?

Noun

a word (other than a pronoun) used to identify any of a class of
people, places, or things (common noun), or to name a particular
one of these (proper noun) [OED]


any member of a class of words that typically can be combined with
determiners to serve as the subject of a verb, can be interpreted as
singular or plural, can be replaced with a pronoun, and refer to an
entity, quality, state, action, or concept [Merriam-Webster]

10
Semantic definition
Syntactic and semantic definition
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
What parts of speech are there?
Open word classes
Closed word classes
Nouns (table, time, Wiebke)
Determiners (the, some, what)
Verbs (go, use, sleep)
Auxiliary verbs (be, have, must)
Adjectives (nice, white, absent)
Pronouns (I, ourselves, his)
Adverbs (quickly, clockwise, yesterday) Prepositions (on, by, after)
Interjections (wow, ouch, er)
Conjunctions (and, while, either ... or ...)
• More (closed) word classes in English
• More (or less, or different) word classes in other
languages
• Different word classes in different linguistic
models
11
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
How to recognize word classes?


Substitution test
The small boy sits in a car.





12
The, a, this: determiner
Small, big, angry, clever: adjectives
Boy, girl, cat, doll: nouns
Sits, cries, sleeps: verbs
In, on, outside: prepositions
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Why do we need POS tags?
•
•
Main aim: disambiguation
Useful for most advanced CLP applications
–
–
–
–
13
Machine translation
Named Entity Recognition/Extraction
Anaphora resolution
etc.
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Part-of-Speech Tagger

Not surprisingly, an application for determining parts of
speech in a text

NotADV surprisinglyADV, anDET applicationN forPREP
determiningV partsN ofPREP speechN inPREP aDET textN
14
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Part-of-Speech Tagging – rules?

Rule-based POS Tagging?

Possible rules (simplified):

If ends in „est“, then it‘s an adjective (superlative form)


If ends in „ed“, it‘s a verb (past or participle form)



15
Pest? Rest?
Bed? Sled?
Rules of this kind are few and unreliable
Largest problem: they don’t help with the ambiguous words!
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Part-of-Speech Tagging – rules!
•
The wind is blowing.
–
–
How do we know wind is a noun and not a verb?
Because it appears after an article and before a verb
•
•
•
ART ___ VERB  ART NOUN VERB
We need rules about inter-word relations
Hard to say what the rules are:
–
–
16
The cromulent wind
The cromulent wind up
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Part-of-Speech Tagging: Stats
•
•
•
Wind: 76% noun usage, 24% verb usage
ART ___ VERB: 72% noun, 1% adverb
The wind blows:
–
–
–
Verb probability: 24% x 0% = 0%
Adverb probability: 0% x 1% = 0%
Noun probability: 76% x 72% = 55%
Careful!
The numbers are invented, and the calculation is more complex than that.
17
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
What do we need?
This is a simple sentence.
This text, excogitated by Dr. Samākslots of New York, is a
bit more complicated. It consists of a few longer-thanusual sentences; also, it has punctuation etc. It will help us
to learn the complexities of part-of-speech tagging, or
POST.
18
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
We need…


A tokenizer to split the text into tokens
Tag probabilities for the tokens


Tag sequence probabilities



E.g. left: 46% adjective, 31% noun, 23% verb
E.g. ADJ ___ NOUN: 57% noun, 43% adjective
How long should the sequences be?
Methods for estimating unknown words


19
E.g. 80% proper noun probability if capitalized
No closed word classes
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Tag probabilities
The wind blows.
•
•
•
•
•
•
•
The: 98% article, 2% adverb
Wind: 76% noun, 24% verb
Blows: 53% verb, 47% noun
Article  Noun: 72%, Article  Verb 1%
Adverb  Noun 0%, Adverb  Verb 6%
Noun  Verb 61%, Noun  Noun 4%
Verb  Verb 3%, Verb  Noun 59%.
20
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Tag probability calculation
The wind blows.
•
•
•
•
•
•
Article – noun – verb: 98% x 72% x 76% x 61% x 53% = 17%
Article – noun – noun: 98% x 72% x 76% x 4% x 47% = 10%
Article – verb – noun: 98% x 1% x 24% x 39% x 47% = 0.04%
Article – verb – verb: 98% x 1% x 24% x 3% x 53% = 0.0004%
…
The complexity of calculations explodes when the length of
the sentences and the number of tags increase.
21
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Hidden Markov Models
The
98%
?
?
22
wind
72%
2%
76%
blows
61%
24%
6%
52%
47%
59%
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Viterbi Algorithm
The
98%
wind
72%
2%
adverb: 2%
23
61%
0.22%
24%
6%
article: 98%
54%
76%
blows
52%
47%
59%
article – noun: 54% article – noun – verb: 18%
17%
article – verb: 0.22%
0.2% article – noun – noun: 1%
adverb – noun: 0% article – verb – verb: 0.02%
adverb – verb: 0.02%article – noun – noun: 0.05%
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
HMMs – the theory

A five-tuple (S, K, Π, A, B)

Set of states S


Output alphabet K


here: tag sequence probabilities
Symbol emission probabilities B

24
here: probabilities for first item in a sentence/text
State transition probabilities A


here: the possible tokens
Initial probabilities Π


here: the possible tags at any point
Here: token-tag-probabilities
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
POST: Current state


Baseline approach (tagging each token with most
frequently used tag) delivers up to 90% accuracy
State-of-the-art taggers reach 96-97% accuracy



But: Given an average sentence length of 20 words in a
newspaper text, we get errors in most sentences!
POS taggers are used as a first step in most complex CL
applications
Some free online taggers: CLAWS, CST, CCG…
25
Computational Linguistics, NLL Riga
2008, by Pawel Sirotkin
Download