Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) Parts of Speech 8 (ish) traditional parts of speech • Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc • This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) • Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS • We’ll use POS most frequently POS examples for English N V ADJ ADV P PRO DET noun verb adj adverb preposition pronoun determiner chair, bandwidth, pacing study, debate, munch purple, tall, ridiculous unfortunately, slowly, of, by, to I, me, mine the, a, that, those Open Class Words Every known human language has nouns and verbs Nouns: people, places, things • Classes of nouns —proper vs. common —count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! • Unfortunately, John walked home extremely slowly yesterday Definition: An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives. Closed Class Words Differ more from language to language than open class words Examples: • • • • • • • prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I, .. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, … Prepositions from CELEX Pronouns in CELEX Conjunctions Auxiliaries NLP Task I – Determining Part of Speech Tags The Problem: Word POS listing in Brown Corpus heat noun verb oil noun in prep noun adv a det noun noun-proper large adj noun adv pot noun POS Tagging: Definition The process of assigning a part-of-speech or lexical class marker to each word in a corpus: WORDS the koala put the keys on the table TAGS N V P DET POS Tagging example WORD tag the koala put the keys on the table DET N V DET N P DET N What is POS tagging good for? Speech synthesis: • • • • • • How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScount disCOUNT CONtent conTENT Stemming for information retrieval • Knowing a word is a N tells you it gets plurals • Can search for “aardvarks” get “aardvark” Parsing and speech recognition and etc • Possessive pronouns (my, your, her) followed by nouns • Personal pronouns (I, you, he) likely to be followed by verbs Related Problem in Bioinformatics Durbin et al. Biological Sequence Analysis, Cambridge University Press. Several applications, e.g. proteins From primary structure ATCPLELLLD Infer secondary structure HHHBBBBBC.. History: From Yair Halevi (Bar-Ilan U.) Trigram Tagger (Kempe) 96%+ DeRose/Church Efficient HMM Sparse Data 95%+ Greene and Rubin Rule Based - 70% 1960 Brown Corpus Created (EN-US) 1 Million Words HMM Tagging (CLAWS) 93%-95% 1970 Brown Corpus Tagged LOB Corpus Created (EN-UK) 1 Million Words Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Transformation Based Tagging (Eric Brill) Rule Based – 95%+ 1980 Combined Methods 98%+ Neural Network 96%+ 1990 2000 LOB Corpus Tagged POS Tagging separated from other NLP Penn Treebank Corpus (WSJ, 4.5M) British National Corpus (tagged by CLAWS) British National Carpus What is it used for? Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus. The main uses of the corpus, are as follows: Reference Book Publishing • Linguistic Research • Extensive data test bed for program development. Natural language processing • Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics... Artificial Intelligence • Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed. Taggers, parsers, natural language understanding programs, spell checking word lists... English Language Teaching • Syllabus and materials design, classroom reference, independent learner research. Penn Treebank Tagset A Simplified Tagset for English Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project. Brown Corpus: 87 tags LOB Corpus: 135 tags Lancaster UCREL 166 tags group: London-Lund Corpus: 197 tags UPenn Treebank: 34 tags + punctuation Rationale behind British & European tag sets To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al. 1987 The Lund tagset for adverb distinguishes between • • • • • • • Adjunct – Process, Space, Time Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S Conjunct – Appositional, Contrastive, Inferential, Listing, … Disjunct – Content, Style Postmodifier – “else” Negative – “not” Discourse Item – Appositional, Expletive, Greeting, Hesitator, … Reasons for a Smaller Tagset Many tags are unique to particular lexical items, and can be recovered automatically if desired. Brown Tags For Verbs be/BE have/HV sing/VB is/BEZ has/HVZ sing/VBZ was/BED had/HVD sang/VBD being/BEG having/HVG singing/VBG been/BEN had/HVN sung/VBN Penn Treebank Tags For Verbs be/VB have/VB sing/VB is/VBZ has/VBZ sing/VBZ was/VBD had/VBD sang/VBD being/VBG having/VBG singing/VBG been/VBN had/VBN sung/VBN Task I – Determining Part of Speech Tags The Problem: Word POS listing in Brown heat noun verb oil noun in prep noun adv a det noun noun-proper large adj noun adv pot noun The Old Solution: Combinatorial search. • If each of n words has k tags on average, try the nk combinations until one works. NLP Task I – Determining Part of Speech Tags Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. • The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora. Simple Statistical Approaches: Idea 1 Simple Statistical Approaches: Idea 2 For a string of words w = w1w2w3…wn find the string of POS tags T = t1 t2 t3 …tn which maximizes P(T|W) • i.e., the probability of tag string T given that the word string was w • i.e., that w was tagged T Again, The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string.. A Practical Statistical Tagger A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… We change to a model that we CAN estimate: A Practical Statistical Tagger III So, for a given string W = w1w2w3…wn, the tagger needs to find the string of tags T which maximizes Training and Performance To estimate the parameters of this model, given an annotated training corpus: Because many of these counts are small, smoothing is necessary for best results… Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.