Part of Speech

advertisement

Word classes and part of speech tagging

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 1

Definition

“The process of assigning a part-of-speech or other lexical class marker to each word in a corpus”

WORDS the girl kissed the baby on the cheek

TAGS

N

V

P

DET

Slide 2

An Example

WORD the girl kissed the baby on the cheek

LEMMA the girl kiss the baby on the cheek

TAG

+DET

+NOUN

+VPAST

+DET

+NOUN

+PREP

+DET

+NOUN

Slide 3

Motivation

Speech synthesis — pronunciation

Speech recognition — class-based N-grams

Information retrieval — stemming, selection highcontent words

Word-sense disambiguation

Corpus analysis of language & lexicography

Slide 4

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 5

Word Classes

Basic word classes: Noun, Verb, Adjective, Adverb,

Preposition, …

Open vs. Closed classes

Open:

Nouns, Verbs, Adjectives, Adverbs.

Why “open”? (that readily accept new members)

Closed:

Definition: The category of function words--that is, parts of speech

(or word classes) that do not readily accept new members. determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, …

Slide 6

Open Class Words

Every known human language has nouns and verbs

Nouns: people, places, things

Classes of nouns proper vs. common count vs. mass

Verbs: actions and processes

Adjectives: properties, qualities

Adverbs: hodgepodge!

Unfortunately, John walked home extremely slowly yesterday

Numerals: one, two, three, third, …

Slide 7

Closed Class Words

Differ more from language to language than open class words

Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, ..

conjunctions: and, but, or, … auxiliary verbs: can, may should, …

Slide 8

Prepositions from CELEX

Slide 9

English Single-Word Particles

Slide 10

Pronouns in CELEX

Slide 11

Conjunctions

Slide 12

Auxiliaries

Slide 13

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 14

Word Classes: Tag Sets

• Vary in number of tags: a dozen to over 200

• Size of tag sets depends on language, objectives and purpose

– Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles

– Simple morphology = more ambiguity = fewer tags

Slide 15

Word Classes: Tag set example

PRP

PRP$

Slide 16

Example of Penn Treebank Tagging of

Brown Corpus Sentence

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .

Book that flight .

VBZ DT NN VB NN ?

Does that flight serve dinner ?

See http://www.infogistics.com/posdemo.htm

Slide 17

The Problem

Words often have more than one word class: this

This is a nice day = PRP

This day is nice = DT

You can go this far = RB

Slide 18

Word Class Ambiguity

(in the Brown Corpus)

Unambiguous (1 tag): 35,340

Ambiguous (2-7 tags): 4,100

2 tags

3 tags

4 tags

5 tags

6 tags

7 tags

3,760

264

61

12

2

1

Slide 19

Part-of-Speech Tagging

• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)

• Stochastic Tagger: HMM-based

• Transformation-Based Tagger (Brill)

Slide 20

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 21

Rule-Based Tagging

• Basic Idea:

– Assign all possible tags to words

– Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv.

– Typically more than 1000 hand-written rules, but may be machine-learned.

Slide 22

Sample ENGTWOL Lexicon

Demo: http://www2.lingsoft.fi/cgi-bin/engtwol

Slide 23

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 24

Stochastic Tagging

• Based on probability of certain tag occurring given various possibilities

• Requires a training corpus

• No probabilities for words not in corpus.

• Training corpus may be different from test corpus.

Slide 25

Stochastic Tagging (cont.)

• Simple Method: Choose most frequent tag in training text for each word!

– Result: 90% accuracy

– Baseline

– Others will do better

Slide 26

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 27

Transformation-Based Tagging

(Brill Tagging)

• Combination of Rule-based and stochastic tagging methodologies

– Like rule-based because rules are used to specify tags in a certain environment

– Like stochastic approach because machine learning is used—with tagged corpus as input

• Input:

– tagged corpus

– dictionary (with most frequent tags)

• Usually constructed from the tagged corpus

Slide 28

Transformation-Based Tagging (cont.)

• Basic Idea:

– Set the most probable tag for each word as a start value

– Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order

• Training is done on tagged corpus:

– Write a set of rule templates

– Among the set of rules, find one with highest score

– Continue from 2 until lowest score threshold is passed

– Keep the ordered set of rules

• Rules make errors that are corrected by later rules

Slide 29

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 30

Tagging Unknown Words

• New words added to (newspaper) language 20+ per month

• Plus many proper names …

• Increases error rates by 1-2%

Method 1: assume they are nouns

Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set.

Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

Slide 31

Evaluation

• The result is compared with a manually coded “Gold

Standard”

– Typically accuracy reaches 96-97%

– This may be compared with result for a baseline tagger (one that uses no context).

• Important: 100% is impossible even for human annotators.

• Factors that affects the performance

– The amount of training data available

– The tag set

– The difference between training corpus and test corpus

– Dictionary

– Unknown words

Slide 32

Download