1 2 Lecture Outline Word-Classes and Part-of-Speech Tagging • • • • • • Christopher Brewster University of Sheffield Computer Science Department Natural Language Processing Group C.Brewster@dcs.shef.ac.uk Definition and Example Motivation Word-classes A Basic Tagging System Transformation-Based Tagging Tagging Unknown Words 3 4 Definition An Example “the process of assigning a part-of-speech or other lexical class marker to each word in a corpus” – D. Jurafsky and J.H. Martin, 2000, Speech and Language Processing WORDS TAGS N the girl kissed the boy on the cheek V P ART … The girl kissed the boy on the cheek lemma the girl kiss the boy on the cheek tag +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN from http://www.xrce.xerox.com/research/mltt/toolhome.html 1 5 6 Motivation: the uses of Tagging • • • • • Word Classes • Basic words classes: Noun, Verb, Adjective, Adverb, Preposition, ….. • Open vs. Closed classes. Speech synthesis – pronunciation Speech recognition – class-based N-grams Information retrieval – stemming Word-sense disambiguation Corpus analysis of language & lexicography – Closed e.g determiners: a, an, the pronouns: she, he, I, others prepositions: on, under, over, near, by, at, from, to, with 7 8 Word Classes: Tag set example Word Classes: Tag sets • Vary in number of tags: a dozen to over 200 • Size of tag sets depends on language, objectives and purpose – Simple morphology = more ambiguity = fewer tags – Some tagging approaches (e.g. constraint grammar based) make fewer distinctions eg. conflating adverbs, particles and interjections CC coordin. conjunction and, but, IN or prepositi of, in, by on CD cardinal number determiner one, two, JJ three a, the JJR adjective yellow EX existential ‘there’ there FW foreign word mea culpa noun singular or mass noun, plural DT NN NNS adj. bigger compar. llama llamas from the Penn treebank part-of-speech tag set. 2 9 10 Word Class Ambiguity (in the Brown Corpus) The Problem Unambiguous (1 tag) Ambiguous (2-7 tags) • Words often have more than one word class: this – This is a nice day = PR – This day is nice = ADJ – You can go this far. = ADV 2 tags 3 tags 4 tags 5 tags 6 tags 7 tags 35, 340 4,100 3,760 264 61 12 2 1 (still) from DeRose (1988) 11 A Basic System: the PARTS program • “PARTS – A System for Assigning Word Classes to English Texts”, L.L.Cherry • Uses list of function words, and list of suffixes and auxiliaries as key sources of information • many combination classes e.g. noun_adj • words members of >2 classes initially assigned unk 12 The PARTS program: input • List of function words and irregular verbs with tags: able,adj every, adj own, adj ago, adj_adv will, aux do, auxv be, be and, conj or, conj but, conj begun, ed bitten, ed outside, prep up, prep over, prep until, prep_adv • List of suffixes with most probable tag for words of that suffix. ic, adj ance, noun ship, noun ant, noun_adj age, noun ize, ment, noun verb ary, adj • suffixes chosen by hand • if most words with suffix have only 1 or 2 tags, this single or combined class assigned, exceptions added to exception list • exception list has many obscure words • A text 3 13 The PARTS program: step 1 pre-processing 1. tokenises words and sentences • • word = string of characters separated by blanks or punctuation sentence = string of words ending in .?! (other punctuation is treated as a comma The PARTS program: step 2 suffix analysis 14 1. applies to words NOT assigned tags in step 1 2. look up suffix list 3. unassigned words go on to step 3 2. marks capitalised words not starting sentences as noun_adj 3. marks hyphenated words as noun_adj 4. lookup function words & irregular verbs in the list 15 The PARTS program: step 3 word class assignment 1. finds verb in the sentence (using auxiliary) 2. finds nouns 3. applies a set of rules of form: verb_adj & ~a => verb “if the word has been assigned the class verb_adj and the verb has not been recognised in the sentence, assign verb to it” 16 The PARTS program: results and example • 95% correct assignment • 41.5% of errors arise from noun-adjective confusion • Example: They act as messengers for the legislators. pronp unk prep_adv nv_pl prep_adv art nv_pl pron verb prep noun prep art noun 4 17 Other methods: Stochastic Tagging Stochastic tagging • Not based on rules, but on probability of a certain tag occurring given …. various possibilities. • Necessitates a TRAINING CORPUS i.e. a hand tagged text in order to derive probabilities. • Problem: no probabilities for words not in corpus • Problem: Bad results if training corpus is very different from test corpus Transformation-Based Learning Tagging (Brill Tagging) • Combination of rule-based AND stochastic tagging methodologies – Like rule-based because rules are used to specify tags in a certain environment – Like stochastic approach because machine learning is used using a tagged corpus as input • Input: – a tagged corpus – a dictionary (with the most frequent tags) 18 • Method: Choose most frequent tag in training text for each word. – Result: 90% accuracy – Reason: cf. figures on word class ambiguity where 90% of words have only one tag – Therefore: this is a base line, and any other method must do significantly better – cf. HMM tagging (lecture of Nick Webb) 19 20 TBL: Rule Application • Example rules: – Change NN to VB when previous tag is TO – For example: race has the following probabilities in the Brown corpus: • P(NN|race) = .98 • P(VB|race) = .02 … is/VBZ expected/VBN to/TO race/NN tomorrow/NN becomes … is/VBZ expected/VBN to/TO race/VB tomorrow/NN 5 TBL: Rule Learning 21 TBL: Rule Learning (2) • 2 parts to a rule: • Templates are like under specified rules: – Triggering environment – Rewrite rule – Replace tag X with tag Y, provided tag Z or word Z’ appears in some position • Rules are learned in ordered sequence – whichever gives best net improvement at each iteration of the learning algorithm. • Rules may interact i.e. Rule 1 may make a change which provides context for Rule 2 to fire. • Rules are compact (a few hundred) and can be inspected by humans (vs. impossibility of inspecting HMM transition probabilities) • The range of Triggering environments or templates(from Manning & Schutze 1999:363): Schema 1 2 3 4 5 6 7 8 9 22 t1-3 ti-2 ti-1 ti * * * * * * * * * ti+1 ti+2 ti+3 23 24 TBL: the Algorithm • Step 1: Label every word with most likely tag (from dictionary) • Step 2: Check every possible transformation & select one which most improves tagging (with respect to hand tagged corpus) • Step 3: Re-tag corpus applying the rules • Repeat 2-3 until some stopping criterion is reached e.g. x % correct with respect to training corpus • RESULT: a sequence of transformation rules TBL: Problems • Execution Speed: TBL tagger is slow compared to HMM approach – Solution: compile the rules to a Finite State Transducer (FST) • Learning Speed: Brill’s implementation over a day (600k tokens) 6 25 Further Reading Tagging Unknown Words • New words added to (newspaper) language 20+ per month. • Plus many proper names …. • Increases error rates by 1-2% • Method 1: assume they are nouns • Method 2: assume the unknown words have a probability distribution similar to hapax legomena • Method 3: use capitalisation, suffixes, etc. This works very well for morphologically complex languages 26 • Introductory: – Jurafsky, Daniel & James H. Martin, Speech and Language Processing, Prentice Hall: 2000 Chapter 8, pp285-322 – Manning, Christopher & Hinrich Schutze, Foundations of Statistical Natural Language Processing, Chap 10, pp341-380 • Texts: – Brill, Eric Transformation-based error-driven learning and natural language processing: A case-study in part-of-speech tagging. Computational Linguistics 21:543-565 – Cherry, L. PART: a system for assigning words classes to English text. AT &T memorandum. 1978 – Church, K. A stochastic parts program and noun phrase parser for unrestricted text. Second Conference on Applied NLP, Austin, 1988 – Garside, Roger, Geoffrey Sampson and Geoffrey Leach (eds) The Computational analysis of English: a corpus-based approach. London: 1987 Also check the papers referred to in the Introductory references. 7