Part of Speech

Word classes and part of speech tagging

Outline

Why part of speech tagging?

Word classes

Tag sets and problem definition

Automatic approaches 1: rule-based tagging

Automatic approaches 2: stochastic tagging

Automatic approaches 3: transformation-based tagging

Other issues: tagging unknown words, evaluation

Slide 1

Definition

“The process of assigning a part-of-speech or other lexical class marker to each word in a corpus”

WORDS the girl kissed the baby on the cheek

TAGS

N

V

P

DET

Slide 2

An Example

WORD the girl kissed the baby on the cheek

LEMMA the girl kiss the baby on the cheek

TAG

+DET

+NOUN

+VPAST

+DET

+NOUN

+PREP

+DET

+NOUN

Slide 3

Motivation

Speech synthesis — pronunciation

Speech recognition — class-based N-grams

Information retrieval — stemming, selection highcontent words

Word-sense disambiguation

Corpus analysis of language & lexicography

Slide 4

Outline


Word classes






Slide 5

Word Classes

Basic word classes: Noun, Verb, Adjective, Adverb,

Preposition, …

Open vs. Closed classes

Open:

Nouns, Verbs, Adjectives, Adverbs.

Why “open”? (that readily accept new members)

Closed:

Definition: The category of function words--that is, parts of speech

(or word classes) that do not readily accept new members. determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, …

Slide 6

Open Class Words

Every known human language has nouns and verbs

Nouns: people, places, things

Classes of nouns proper vs. common count vs. mass

Verbs: actions and processes

Adjectives: properties, qualities

Adverbs: hodgepodge!

Unfortunately, John walked home extremely slowly yesterday

Numerals: one, two, three, third, …

Slide 7

Closed Class Words

Differ more from language to language than open class words

Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, ..

conjunctions: and, but, or, … auxiliary verbs: can, may should, …

Slide 8

Prepositions from CELEX

Slide 9

English Single-Word Particles

Slide 10

Pronouns in CELEX

Slide 11

Conjunctions

Slide 12

Auxiliaries

Slide 13

Outline


Word classes

Tag sets and problem definition





Slide 14

Word Classes: Tag Sets

• Vary in number of tags: a dozen to over 200

• Size of tag sets depends on language, objectives and purpose

– Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles

– Simple morphology = more ambiguity = fewer tags

Slide 15

Word Classes: Tag set example

PRP

PRP$

Slide 16

Example of Penn Treebank Tagging of

Brown Corpus Sentence

The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

VB DT NN .

Book that flight .

VBZ DT NN VB NN ?

Does that flight serve dinner ?

See http://www.infogistics.com/posdemo.htm

Slide 17

The Problem

Words often have more than one word class: this

This is a nice day = PRP

This day is nice = DT

You can go this far = RB

Slide 18

Word Class Ambiguity

(in the Brown Corpus)

Unambiguous (1 tag): 35,340

Ambiguous (2-7 tags): 4,100

2 tags

3 tags

4 tags

5 tags

6 tags

7 tags

3,760

264

61

12

2

1

Slide 19

Part-of-Speech Tagging

• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)

• Stochastic Tagger: HMM-based

• Transformation-Based Tagger (Brill)

Slide 20

Outline


Word classes


Automatic approaches 1: rule-based tagging




Slide 21

Rule-Based Tagging

• Basic Idea:

– Assign all possible tags to words

– Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv.

– Typically more than 1000 hand-written rules, but may be machine-learned.

Slide 22

Sample ENGTWOL Lexicon

Demo: http://www2.lingsoft.fi/cgi-bin/engtwol

Slide 23

Outline


Word classes



Automatic approaches 2: stochastic tagging



Slide 24

Stochastic Tagging

• Based on probability of certain tag occurring given various possibilities

• Requires a training corpus

• No probabilities for words not in corpus.

• Training corpus may be different from test corpus.

Slide 25

Stochastic Tagging (cont.)

• Simple Method: Choose most frequent tag in training text for each word!

– Result: 90% accuracy

– Baseline

– Others will do better

Slide 26

Outline


Word classes




Automatic approaches 3: transformation-based tagging


Slide 27

Transformation-Based Tagging

(Brill Tagging)

• Combination of Rule-based and stochastic tagging methodologies

– Like rule-based because rules are used to specify tags in a certain environment

– Like stochastic approach because machine learning is used—with tagged corpus as input

• Input:

– tagged corpus

– dictionary (with most frequent tags)

• Usually constructed from the tagged corpus

Slide 28

Transformation-Based Tagging (cont.)

• Basic Idea:

– Set the most probable tag for each word as a start value

– Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order

• Training is done on tagged corpus:

– Write a set of rule templates

– Among the set of rules, find one with highest score

– Continue from 2 until lowest score threshold is passed

– Keep the ordered set of rules

• Rules make errors that are corrected by later rules

Slide 29

Outline


Word classes





Other issues: tagging unknown words, evaluation

Slide 30

Tagging Unknown Words

• New words added to (newspaper) language 20+ per month

• Plus many proper names …

• Increases error rates by 1-2%

• Method 1: assume they are nouns

• Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set.

• Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

Slide 31

Evaluation

• The result is compared with a manually coded “Gold

Standard”

– Typically accuracy reaches 96-97%

– This may be compared with result for a baseline tagger (one that uses no context).

• Important: 100% is impossible even for human annotators.

• Factors that affects the performance

– The amount of training data available

– The tag set

– The difference between training corpus and test corpus

– Dictionary

– Unknown words

Slide 32

Part of Speech

Word classes and part of speech tagging

Outline

Definition

An Example

Motivation

Outline

Word Classes

Open Class Words

Closed Class Words

Prepositions from CELEX

English Single-Word Particles

Pronouns in CELEX

Conjunctions

Auxiliaries

Outline

Word Classes: Tag Sets

Word Classes: Tag set example

Example of Penn Treebank Tagging of

Brown Corpus Sentence

The Problem

Word Class Ambiguity

(in the Brown Corpus)

Part-of-Speech Tagging

Outline

Rule-Based Tagging

Sample ENGTWOL Lexicon

Outline

Stochastic Tagging

Stochastic Tagging (cont.)

Outline

Transformation-Based Tagging

(Brill Tagging)

Transformation-Based Tagging (cont.)

Outline

Tagging Unknown Words

Evaluation

Related documents

Products

Support

Part of Speech

Word classes and part of speech tagging

Outline

Definition

An Example

Motivation

Outline

Word Classes

Open Class Words

Closed Class Words

Prepositions from CELEX

English Single-Word Particles

Pronouns in CELEX

Conjunctions

Auxiliaries

Outline

Word Classes: Tag Sets

Word Classes: Tag set example

Example of Penn Treebank Tagging of

Brown Corpus Sentence

The Problem

Word Class Ambiguity

(in the Brown Corpus)

Part-of-Speech Tagging

Outline

Rule-Based Tagging

Sample ENGTWOL Lexicon

Outline

Stochastic Tagging

Stochastic Tagging (cont.)

Outline

Transformation-Based Tagging

(Brill Tagging)

Transformation-Based Tagging (cont.)

Outline

Tagging Unknown Words

Evaluation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib