lec05-pos

advertisement
Part of Speech
• Each word belongs to a word class. The word class of a word is
known as part-of-speech (POS) of that word.
• Most POS tags implicitly encode fine-grained specializations of
eight basic parts of speech:
– noun, verb, pronoun, preposition, adjective, adverb,
conjunction, article
• These categories are based on morphological and distributional
similarities (not semantic similarities).
• Part of speech is also known as:
– word classes
– morphological classes
– lexical tags
BİL711 Natural Language Processing
1
Part of Speech (cont.)
• A POS tag of a word describes the major and minor word
classes of that word.
• A POS tag of a word gives a significant amount of information
about that word and its neighbours. For example, a possessive
pronoun (my, your, her, its) most likely will be followed by a
noun, and a personal pronoun (I, you, he, she) most likely will
be followed by a verb.
• Most of words have a single POS tag, but some of them have
more than one (2,3,4,…)
• For example, book/noun or book/verb
– I bought a book.
– Please book that flight.
BİL711 Natural Language Processing
2
Tag Sets
• There are various tag sets to choose.
• The choice of the tag set depends on the nature of the application.
– We may use small tag set (more general tags) or
– large tag set (finer tags).
• Some of widely used part-of-speech tag sets:
– Penn Treebank has 45 tags
– Brown Corpus has 87 tags
– C7 tag set has 146 tags
• In a tagged corpus, each word is associated with a tag from
the used tag set.
BİL711 Natural Language Processing
3
English Word Classes
• Part-of-speech can be divided into two broad categories:
– closed class types -- such as prepositions
– open class types -- such as noun, verb
• Closed class words are generally also function words.
– Function words play important role in grammar
– Some function words are: of, it, and, you
– Functions words are most of time very short and frequently occur.
• There are four major open classes.
– noun, verb, adjective, adverb
– a new word may easily enter into an open class.
• Word classes may change depending on the natural language, but all natural
languages have at least two word classes: noun and verb.
BİL711 Natural Language Processing
4
Nouns
• Nouns can be divided as:
– proper nouns -- names for specific entities such as Ankara,
John, Ali
– common nouns
• Proper nouns do not take an article but common nouns may take.
• Common nouns can be divided as:
– count nouns -- they can be singular or plural -- chair/chairs
– mass nouns -- they are used when something is
conceptualized as a homogenous group -- snow, salt
• Mass nouns cannot take articles a and an, and they can not be
plural.
BİL711 Natural Language Processing
5
Verbs
• Verb class includes the words referring actions and processes.
• Verbs can be divided as:
– main verbs -- open class -- draw, bake
– auxiliary verbs -- closed class -- can, should
• Auxiliary verbs can be divided as:
– copula -- be, have
– modal verbs -- may, can, must, should
• Verbs have different morphological forms:
–
–
–
–
–
non-3rd-person-sg eat
3rd-person-sg - eats
progressive -- eating
past -- ate
past participle -- eaten
BİL711 Natural Language Processing
6
Adjectives
• Adjectives describe properties or qualities
– for color -- black, white
– for age -- young, old
• In Turkish, all adjectives can also be used as noun.
– kırmızı kitap
red book
– kırmızıyı
the red one (ACC)
BİL711 Natural Language Processing
7
Adverbs
• Adverbs normally modify verbs.
• Adverb categories:
– locative adverbs -- home, here, downhill
– degree adverbs -- very, extremely
– manner adverbs -- slowly, delicately
– temporal adverbs -- yesterday, Friday
• Because of the heterogeneous nature of adverbs, some adverbs
such as Friday may be tagged as nouns.
BİL711 Natural Language Processing
8
Major Closed Classes
•
•
•
•
•
•
Prepositions -- on, under, over, near, at, from, to, with
Determiners -- a, an, the
Pronouns -- I, you, he, she, who, others
Conjunctions -- and, but, if, when
Participles -- up, down, on, off, in, out
Numerals -- one, two, first, second
BİL711 Natural Language Processing
9
Prepositions
• Occur before noun phrases
• indicate spatial or temporal relations
• Example:
– on the table
– under chair
• They occur so often. For example, some of the frequency
counts in a 16 million word corpora (COBUILD).
–
–
–
–
–
–
–
of
in
for
to
with
on
at
540,085
331,235
142,421
125,691
124,965
109,129
100,169
BİL711 Natural Language Processing
10
Particles
• A particle combines with a verb to form a larger unit called
phrasal verb.
– go on
– turn on
– turn off
– shut down
BİL711 Natural Language Processing
11
Articles
•
•
•
•
A small closed class
Only three words in the class: a an the
Marks definite or indefinite
They occur so often. For example, some of the frequency counts
in a 16 million word corpora (COBUILD).
– the
– a
– an
1,071,676
413,887
59,359
• Almost 10% of words are articles in this corpus.
BİL711 Natural Language Processing
12
Conjunctions
• Conjunctions are used to combine or join two phrases, clauses or
sentences.
• Coordinating conjunctions -- and or but
– join two elements of equal status
– Example: you and me
• Subordinating conjunctions -- that who
– combines main clause with subordinate clause
– Example:
• I thought that you might like milk
BİL711 Natural Language Processing
13
Pronouns
• Shorthand for referring to some entity or event.
• Pronouns can be divided:
– personal
you she I
– possessive
my your his
– wh-pronouns who what
-- who is the president?
BİL711 Natural Language Processing
14
TagSets for English
• There are popular actual tagsets for part-of-speech
• PENN TREEBANK tagset has 45 tags
–
–
–
–
–
–
–
IN
DT
JJ
NN
NNS
VB
VBD
preposition/subordinating conj.
determiner
adjective
noun, singular or mass
noun, plural
verb, base form
verb, past tense
• A sentence from Brown corpus which is tagged using
Penn Treebank tagset.
– The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
BİL711 Natural Language Processing
15
Part of Speech Tagging
• Part of speech tagging is simply assigning the correct part of
speech for each in an input sentence
• We assume that we have the following:
– A set of tags (our tag set)
– A dictionary that tells us the possible tags for each word
(including all morphological variants).
– A text to be tagged.
• There are different algorithms for tagging.
– Rule Based Tagging
– Statistical Tagging (Stochastic Tagging)
– Transformation Based Tagging
BİL711 Natural Language Processing
16
How hard is tagging?
• Most words in English are unambiguous. They have only a single tag.
• But many of most common words are ambiguous:
– can/verb can/auxiliary
can/noun
• The number of word types in Brown Corpus
– unambiguous (one tag)
35,340
– ambiguous (2-7 tags)
4,100
•
•
•
•
•
•
2 tags
3 tags
4 tags
5 tags
6 tags
7 tags
3760
264
61
12
2
1
• While only 11.5% of word types are ambiguous, over 40% of Brown corpus
tokens are ambiguous.
BİL711 Natural Language Processing
17
Rule-Based Part-of-Speech Tagging
• The rule-based approach uses handcrafted sets of rules to tag
input sentence.
• There are two stages in rule-based taggers:
– First Stage: Uses a dictionary to assign each word a list of
potential parts-of-speech.
– Second Stage: Uses a large list of handcrafted rules to
window down this list to a single part-of-speech for each word.
• The ENGTWOL is a rule-based tagger
– In the first stage, uses a two-level lexicon transducer
– In the second stage, uses hand-crafted rules (about 1100 rules)
BİL711 Natural Language Processing
18
After The First Stage
• Example: He had a book.
• After the fırst stage:
– he
he/pronoun
– had
have/verbpast have/auxliarypast
– a
a/article
– book
book/noun book/verb
BİL711 Natural Language Processing
19
Tagging Rule
Rule-1:
if (the previous tag is an article)
then eliminate all verb tags
Rule-2:
if (the next tag is verb)
then eliminate all verb tags
BİL711 Natural Language Processing
20
Transformation-Based Tagging
• Transformation-based tagging is also known as
Brill Tagging.
• Similar to rule-based taggers but rules are learned from
a tagged corpus.
• Then these learned rules are used in tagging.
BİL711 Natural Language Processing
21
How TBL Rules are Applied
• Before the rules are applied the tagger labels every word with
its most likely tag.
• We get these most likely tags from a tagged corpus.
• Example:
– He is expected to race tomorrow
– he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN
• After selecting most-likely tags, we apply transformation rules.
– Change NN to VB when the previous tag is TO
– This rule converts race/NN into race/VB
• This may not work for every case
– ….. According to race
BİL711 Natural Language Processing
22
How TBL Rules are Learned
• We will assume that we have a tagged corpus.
• Brill’s TBL algorithm has three major steps.
– Tag the corpus with the most likely tag for each (unigram
model)
– Choose a transformation that deterministically replaces an
existing tag with a new tag such that the resulting tagged
training corpus has the lowest error rate out of all
transformations.
– Apply the transformation to the training corpus.
• These steps are repeated until a stopping criterion is reached.
• The result (which will be our tagger) will be:
– First tags using most-likely tags
– Then apply the learned transformations
BİL711 Natural Language Processing
23
Transformations
• A transformation is selected from a small set of templates.
Change tag a to tag b when
- The preceding (following) word is tagged z.
- The word two before (after) is tagged z.
- One of two preceding (following) words is tagged z.
- One of three preceding (following) words is tagged z.
- The preceding word is tagged z and the following word is tagged
w.
- The preceding (following) word is tagged z and the word
two before (after) is tagged w.
BİL711 Natural Language Processing
24
Basic Results
• We get 91% accuracy just picking the most likely tag.
• We should improve the accuracy further.
• Some taggers can perform 99% percent.
BİL711 Natural Language Processing
25
Statistical Part-of-Speech Tagging
• Choosing the best tag sequence T=t1,t2,…,tn for a given word
sequence W = w1,w2,…,wn (sentence):
^
T  arg max P(T | W )
T 
By Bayes Rule:
P(W | T ) P(T )
T  arg max
P(W )
T 
^
Since P(W) will be same for each tag sequence:
^
T  arg max P(W | T ) P(T )
T 
BİL711 Natural Language Processing
26
Statistical POS Tagging (cont.)
• If we assume a tagged corpus and a trigram language model, then
P(T) can be approximated as:
n
P(t1 ) P(t2 | t1 ) P(ti | ti 2 ti 1 )
i 3
To evaluate this formula is simple, we get from simple word
counting
(and smoothing).
BİL711 Natural Language Processing
27
Statistical POS Tagging (cont.)
To evaluate P(W|T), we will make the simplifying assumption that
the word depends only on its tag.
n
 P( w
i 1
i
| ti )
So, we want the tag sequence that maximizes the following quantity.
 n

P(t1 ) P(t2 | t1 ) P(ti | ti 2 ti 1 )  P( wi | ti )
i 3
 i 1

n
The best tag sequence can be found by Viterbi algorithm.
BİL711 Natural Language Processing
28
Download
Related flashcards

Semantics

31 cards

Punctuation

15 cards

Parts of speech

13 cards

Parts of speech

13 cards

Semantics

36 cards

Create Flashcards