POS Tagging and Context-Free Grammars CS 4705 1

advertisement
POS Tagging and Context-Free
Grammars
CS 4705
1
From Words to Syntactic Structures
• Words  morphological structures
Cows, cowed, reformulation
• Ngrams  statistical co-occurrences
The man/unicorn said, The man the
• POS  word classes
DET N V, DET N DET
• Syntactic Constituents  word relationships
S  NP VP, S  S conj S
2
POS
• Review: Words of the same class behave similarly
– Are these words nouns or adjectives?
• a blue seat
• a very blue seat
• this seat is blue
a child seat
*a very child seat
*this seat is child
– What are the word classes?
– How do we identify a word’s POS?
3
How many word classes are there?
• A basic set:
– Open class/content words: Noun, Verb,
Adjective, Adverb,
– Closed class/function words: Preposition,
Pronoun, Determiner, Auxiliary and copular
verbs, Particle, Conjunction
4
Nouns
• Words that describe people, places, things, events,
abstractions, activities, …
Hector, mirror, New York, cat, love, government
– Can take the possessive: Hector’s, the cat’s
– Can (usually) occur as plurals: governments, many
New Yorks
– Can occur with determiners: the cat, Homer’s Hector
• Subclasses:
– Proper Nouns: Hector, New York
– Common Nouns: cat, dog, football
• Mass vs count nouns: enumerable or not (cat, sand)
5
Verbs
• Refer to actions, events, conditions, processes
Go, kicked, think, manage, trying
• Tense: when did the action,… take place?
– Present: I kick (simple), I am kicking (progressive), I
have kicked (perfect), I have been kicking (perfect
progressive)
– Past: I kicked, I was kicking, I had kicked, I had been
kicking
– Future: I will kick, I will be kicking, I will have kicked,
I will have been kicking
• Aspect: the nature of the action,… -- simple/indefinite,
complete, continuing
6
Adjectives
• Describe properties or qualities
Pretty, red, careful, cat-like, wishful, silly
7
Adverbs
•
•
•
•
•
•
Modify verbs or adverbs or ….
Directional or locative: here, upward
Degree modifiers very, too
Manner: slowly
Temporals: today, now
Are they adverbs or nouns?
8
Prepositions and Particles
• Prepositions: indicate spatial or temporal relations
To Boston, From Boston
In, for, with, toward, into, by
• Particles: act like prepositions or adverbs but
behave like semantic units with their verbs
• Test: can you move the prep/part and what follows
to the front of the sentence?
Prep: We ran up the hill. Up the hill we ran.
Part: We ran up the bill. *Up the bill we ran.
9
• Some particles with their verbs:
Run into (*Into Bill we ran)
Find out (*Out the truth we found)
Turn on (*On the light we turned)
Throw up (*Up his dinner he threw)
10
Determiners
•
•
•
•
•
Articles: the cat, a cat, an idiot
Possessive nouns/pronouns: her cat, Sally’s cat
Numbers: five cats
Indefinite pronouns: each cat, some cats
Demonstrative pronouns: that cat, those cats
11
Conjunctions
• Coordinate:
and, but
• Subordinate/complementizers:
…that the war is over,
…because I love you,
…unless you change your ways
12
Pronouns
– Personal: I, he,...
– Possessive: my, his,…
– Indefinite: someone, everyone, anybody,
nothing
– Interrogative or wh: who, whom,...
– And many more…
13
Auxiliary Verbs
• Indicate features of a main verb, such as tense and
aspect
Be (copula), have, do, can/will/shall/may (modal)
He is silent, She has done that, We can help
14
• And more…
– Interjections/discourse markers
– Existential there : There is a unicorn in the
garden
– Greetings, politeness terms
15
Part-of-Speech Tagging
• It’s useful to know the POS of words in a sentence
– Time/N flies/V like/Prep an/Det arrow/N
– Fruit/N flies/N like/V a/DET banana/N
16
POS can disambiguate
• Some words have only one POS tag: is, Mary,
very, smallest
• Others have a single most likely tag: a, dog
• Many are more ambiguous: likes, bass
• But luckily….tags tend to co-occur regularly with
other tags (e.g. DET N more likely than N DET)
– We can learn POS ngram probabilities P(t1|tn-1)
from a tagged corpus just as we learn word
ngram probabilities
17
Approaches to POS Tagging
• Hand-written rules
• Statistical approaches (e.g. HMM-based taggers)
• Hybrid systems (e.g. Brill’s TBL: transformationbased learning)
18
Brill Tagging: TBL
• Start with simple rules…learn better ones from
tagged corpus
– Init: Start with a (hand) tagged corpus and
remove the tags from a copy
– Tag each word in the copy with most likely
POS (obtained from the original or another
tagged corpus)
– Select a transformation that most improves
tagging accuracy (compared to original)
– Re-tag the whole corpus applying just this
23
transformation and put it on the list of
transformations
– Compare the new tags of the copy to the
original
– Again, select the transformation that most
improves the accuracy of the (better) tags on
the copy compared to the original
– Iterate until performance doesn’t improve (no
transformation improves tagging accuracy)
• Result: tagging procedure (set of transformations)
which can be applied to new, untagged text
24
Transformations
Change tag a to tag b when….
25
An Example
Time flies like an arrow.
1) Tag every word with most likely tag and score
Time/N flies/V like/V an/DET arrow/N
2) For each template, try every instantiation and
apply to tagged corpus and score
e.g. Change V to N when the preceding word is
tagged V
Time/N flies/V like/N an/DET arrow/N
e.g. Change V to Prep when the preceding word
is tagged V
26
Time/N flies/V like/Prep an/DET arrow/N
3) Select the transformation rule that most improves
the overall accuracy of POS assignments on the
training corpus
4) Add the new rule to the tagging procedure list
5) Iterate from (2) until no transformation improves
score
• Result: ordered list of transformation rules
which can be applied sequentially to new,
untagged data (after initializing with most
common tag)
27
Basic Constituents and Rewrite Rules
•
•
•
•
•
•
•
•
•
•
S  NP VP
NP  DET NOM
NP  PropN
NOM  N | NOM
DET  a | an | the
PropN  George | Morocco
N  cat | box
VP  V NP
VP  V
V  exploded
33
More Constituents and Rules
• VP  V PP
• PP  Prep NP
• Prep  at | over | under | in | by
34
How to write a grammar
• Scenario: You are a lowly programmer in IT at a
major financial institution in NYC. Your boss tells
you the department needs to port data from an old
database in which the person name field was not
divided into multiple fields (title, firstname,
middle name, surname, suffix) to a new modern
database
• Your task: Separate these names into their proper
fields for the new database
• What do you do?
35
Solutions
• Go through the old database names one at a time
and type them into the new db
• Create a script with regular expressions to search
for names with different components and write
each out into a standard set of fields
• Build an FST to process the names and output
field-separated components
• Write a Context Free Grammar to parse the names
into their constituents
36
A Name Grammar
• Name  Title Firstname Middlename Surname
Honorific
• Name  Title Firstname Middlename Surname
• Name  Firstname Middlename Surname
Honorific
• Name  Firstname Middlename Surname
• Name  Title Firstname MiddleInitial Surname
Honorific
• …….
37
A Better Name Grammar
•
•
•
•
•
•
•
•
•
Name  Title BaseName Suffix
Name  Title BaseName
Name  Basename Suffix
Basename  Firstname Middle Surname
Middle  Middlename
Middle  MiddleInitial
Title  Mr. | Mrs.| Ms.| Miss | Dr. | Gen. | …
Suffix  Jr. | Sr. | Esq. | DDS | …
…….
38
Next Class
• How do we use CFGs for parsing
• Read Chapter 11
39
Download