Morphology and POS Tagging

advertisement
Language Processing and Computational Linguistics
EDA171/DATN06 – Lecture 6
Part-of-Speech Tagging Using Rules
Pierre Nugues
Pierre.Nugues@cs.lth.se
www.cs.lth.se/~pierre
© Pierre Nugues, Lecture 6, September 2009
1
© Pierre Nugues, Lecture 6, September 2009
2
Word Ambiguity
Part of speech
Semantic
English
French
German
can modal
le article
der article
can noun
le pronoun
der pronoun
great big
grand big
groß
great notable grand notable groß
© Pierre Nugues, Lecture 6, September 2009
3
POS Tagging
Words
Possible Tags
that
subordinating conjunction that he can swim is good
round
Example of Use
determiner
that white table
adverb
it is not that easy
pronoun
that is the table
relative pronoun
the table that collapsed
verb
round the usual suspects
preposition
turn round the corner
noun
a big round
adjective
a round box
adverb
he went round
© Pierre Nugues, Lecture 6, September 2009
4
table
might
noun
that white table
verb
I table that
noun
the might of the wind
modal verb
she might come
collapse noun
verb
© Pierre Nugues, Lecture 6, September 2009
the collapse of the empire
the empire can collapse
5
Part-of-Speech Ambiguity in Swedish
The word som in the Norstedts svenska ordbok, 1999. Three
entries:
1. Om jag vore lika vacker som du, skulle jag vara lycklig.
(konjunktion)
2. Bilen som jag köpte ifjol. (pronomen)
3. Som jag har saknat dig. (adverb)
POS difference is significant:
© Pierre Nugues, Lecture 6, September 2009
6
Compare the pronunciation of vaken, adjective and vaken,
noun in Swedish
In English, compare object in I object to violence, verb, or I
could see an object, noun.
© Pierre Nugues, Lecture 6, September 2009
7
Phrase Structure Rules are not Satisfying
I/noun see/noun a/noun bird/noun
Because of city school committee meeting.
The disambiguation methods are based on
 Handcrafted rules
 Automatically learned rules
 Statistical methods
Currently accuracy is greater than 95% for many languages
© Pierre Nugues, Lecture 6, September 2009
8
POS Annotation with Rules
The phrase The can rusted has two readings
Let’s suppose that can/modal is more frequent than can/noun
in our corpus
First step: Assign the most likely POS
The/art can/modal rusted/verb
Second step: Apply rules
Change the tag from modal to noun if one of the
two previous words is an article
© Pierre Nugues, Lecture 6, September 2009
9
This is the idea of the Brill tagger.
© Pierre Nugues, Lecture 6, September 2009
10
Rule Templates
Rules
Explanation
alter(A, B, prevtag(C))
Change A to B if previous tag is C
alter(A, B, prev1or2or3tag(C))
Change A to B if previous one or two
or three tag is C
alter(A, B, prev1or2tag(C))
Change A to B if previous one or two
tag is C
alter(A, B, next1or2tag(C))
Change A to B if next one or two tag
is C
alter(A, B, nexttag(C))
Change A to B if next tag is C
alter(A, B, surroundingtag(C, D)) Change A to B if surrounding tags are
© Pierre Nugues, Lecture 6, September 2009
11
C and D
alter(A, B, nextbigram(C, D))
Change A to B if next bigram tag is C
D
alter(A, B, prevbigram(C, D))
Change A to B if previous bigram tag
is C D
© Pierre Nugues, Lecture 6, September 2009
12
Learning Rules Automatically
Compare the hand-annotation of the reference corpus with the
automatic one
Automatic tagging
Hand annotation: gold
standard
The/art can/modal
The/art can/noun rusted/verb
rusted/verb
For each error instantiate the templates
Rules correcting the error
© Pierre Nugues, Lecture 6, September 2009
13
alter(modal, noun, prevtag(art)).
alter(modal, noun, prev1or2tag(art)).
alter(modal, noun, nexttag(verb))
alter(modal, noun, surroundingtag(art, verb))
Rules introduce good and bad transformations
Select the rule that has the greatest error reduction and apply it
© Pierre Nugues, Lecture 6, September 2009
14
Brill’s Learning Algorithm
Step
1.
Operation
Annotate each word of
Input
Corpus
Output
AnnotatedCorpus(1)
the corpus with its most
likely part-of-speech
2.
Compare pairwise part-
AnnotationReference List of errors
of-speech of each word
AnnotatedCorpus(i)
of the
AnnotationReference and
AnnotatedCorpus(i)
3.
For each error instantiate List of errors
© Pierre Nugues, Lecture 6, September 2009
List of tentative rules
15
the rule templates to
correct the error
4.
For each instantiated rule, AnnotatedCorpus(i) Scored tentative rules
compute on
Tentative rules
AnnotatedCorpus(i) the
number of good
transformations minus
the number of bad
transformations the rule
yields
5.
Select the rule that has
Tentative rules
Rule(i)
the greatest error
© Pierre Nugues, Lecture 6, September 2009
16
reduction and append it
to the ordered list of
transformations
6.
7.
Apply Rule(i) to
AnnotatedCorpus(i) AnnotatedCorpus(i+1)
AnnotatedCorpus(i)
Rule(i)
If number of errors is
List of rules
under predefined
threshold end the
algorithm else go to step
2.
© Pierre Nugues, Lecture 6, September 2009
17
First Brill’s Rules
Change
# From To
1 NN
VB
Condition
Previous tag is TO
2 VBP VB
One of the previous three tags is MD
3 NN
VB
One of the previous two tags is MD
4 VB
NN One of the previous two tags is DT
© Pierre Nugues, Lecture 6, September 2009
18
5 VBD VBN One of the previous three tags is VBZ
© Pierre Nugues, Lecture 6, September 2009
19
Standard POS Tagsets
1. CC
Coordinating conjunction
17. POS Possessive ending
34. WP wh-pronoun
2. CD
Cardinal number
18. PRP Personal pronoun
35. WP$ Possessive wh-pronoun
3. DT
Determiner
19. PRP$ Possessive pronoun
36. WRB wh-adverb
4. EX
Existential there
20. RB
37. #
Pound sign
5. FW
Foreign word
21. RBR Adverb, comparative
38. $
Dollar sign
6. IN
Preposition /subordinating
22. RBS Adverb, superlative
39. .
Sentence final
conjunction
23. RP
7. JJ
Adjective
24. SYM Symbol
40. ,
Comma
8. JJR
Adjective, comparative
25. TO
to
41. :
Colon, semi-colon
9. JJS
Adjective, superlative
26. UH Interjection
42. (
Left bracket character
10. LS
List item marker
27. VB
Verb, base form
43. )
Right bracket character
11. MD Modal
28. VBD Verb, past tense
44. "
Straight double quote
12. NN Noun, singular or mass
29. VBG Verb, gerund/present
45. `
Left open single quote
46. ``
Left open double quote
13. NNS
Noun, plural
© Pierre Nugues, Lecture 6, September 2009
Adverb
punctuation
Particle
participle
20
14. NNP Proper noun, singular
30. VBN Verb, past participle
47. '
Right close single quote
15. NNPS
31. VBP Verb, non-3rd ps. sing.
48. ''
Right close double
Proper noun, plural
Present
16. PDT Predeterminer
quote
32. VBZ Verb, 3rd ps. sing. Present
33. WDT wh-determiner
© Pierre Nugues, Lecture 6, September 2009
21
An Example
Battle-tested/JJ industrial/JJ managers/NNS here/RB always/RB
buck/VBP up/RP nervous/JJ newcomers/NNS with/IN the/DT tale/ NN
of/IN the/DT first/JJ of/IN their/PP$ countrymen/NNS to/TO visit/VB
Mexico/NNP ,/, a/DT boatload/NN of/IN samurai/FW warriors/NNS
blown/VBN ashore/RB 375/CD years/NNS ago/RB ./.
“/“ From/IN the/DT beginning/NN ,/, it/PRP took/VBD a/DT man/NN
with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB in/IN
Mexico/NNP ”/” says/VBZ Kimihide/NNP Takimura/NNP ,/,
© Pierre Nugues, Lecture 6, September 2009
22
president/NN of/IN the/DT Mitsui/NNP group/NN ’s/POS
Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN ./.
© Pierre Nugues, Lecture 6, September 2009
23
Measuring Quality: The Confusion Matrix
From Franz (1996, p. 124)
© Pierre Nugues, Lecture 6, September 2009
24
© Pierre Nugues, Lecture 6, September 2009
25
Recognizing Parts of Speech
Parts of speech denomination is comparable in Western
European languages and roughly corresponds
They follow Donatus’ teaching
(http://ccat.sas.upenn.edu/jod/texts/donatus.4.html)
If you are not sure, look up in a dictionary
Two common mistakes in the labs:
 Confusion between noun and the Swedish word namn.
© Pierre Nugues, Lecture 6, September 2009
26
o A common noun, or more simply a noun, corresponds to
substantiv
o Proper noun, or name, (or proper name) corresponds to namn.
 Possessive pronouns like my, your, his, her, … are not pronouns.
They should be called possessive adjectives or determiners.
© Pierre Nugues, Lecture 6, September 2009
27
Multext
Categories
Code Categories
Code
Noun
N
Adposition (Preposition)
S
Verb
V
Conjunction
C
Adjective
A
Numeral
M
Pronoun
P
Interjection
I
Determiner
D
Residual
X
Adverb
R
© Pierre Nugues, Lecture 6, September 2009
28
Attributes for Nouns
Position
Attribute
Value
Code
1
Type
common
c
proper
p
masculine
m
feminine
f
neuter
n
singular
s
plural
p
nominative
n
2
3
4
Gender
Number
Case
© Pierre Nugues, Lecture 6, September 2009
29
© Pierre Nugues, Lecture 6, September 2009
genitive
g
dative
d
accusative
a
30
Annotation for Swedish: Tokens
Bilen framför justitieministern svängde fram och tillbaka över
vägen så att hon blev rädd.
‘The car in front of the Justice Minister swung back and forth
and she was frightened.’
<tokens>
<token id="8">över</token>
<token id="1">Bilen</token>
<token id="9">vägen</token>
<token id="2">framför</token>
<token id="10">så</token>
<token id="3">justitieministern</token>
<token id="11">att</token>
<token id="4">svängde</token>
<token id="12">hon</token>
<token id="5">fram</token>
<token id="13">blev</token>
<token id="6">och</token>
<token id="14">rädd</token>
<token id="7">tillbaka</token>
<token id="15">.</token>
© Pierre Nugues, Lecture 6, September 2009
31
</tokens>
© Pierre Nugues, Lecture 6, September 2009
32
Parts of Speech for Swedish
<taglemmas>
<taglemma id="1" tag="nn.utr.sin.def.nom" lemma="bil"/>
<taglemma id="2" tag="pp" lemma="framför"/>
<taglemma id="3" tag="nn.utr.sin.def.nom" lemma="justitieminister"/>
<taglemma id="4" tag="vb.prt.akt" lemma="svänga"/>
<taglemma id="5" tag="ab" lemma="fram"/>
<taglemma id="6" tag="kn" lemma="och"/>
<taglemma id="7" tag="ab" lemma="tillbaka"/>
<taglemma id="8" tag="pp" lemma="över"/>
<taglemma id="9" tag="nn.utr.sin.def.nom" lemma="väg"/>
<taglemma id="10" tag="ab" lemma="så"/>
<taglemma id="11" tag="sn" lemma="att"/>
<taglemma id="12" tag="pn.utr.sin.def.sub" lemma="hon"/>
<taglemma id="13" tag="vb.prt.akt.kop" lemma="bli"/>
<taglemma id="14" tag="jj.pos.utr.sin.ind.nom" lemma="rädd"/>
<taglemma id="15" tag="mad" lemma="."/>
© Pierre Nugues, Lecture 6, September 2009
33
</taglemmas>
© Pierre Nugues, Lecture 6, September 2009
34
Download