Word Classes and POS Tagging Read J & M Chapter 8. You may also want to look at: http://www.georgetown.edu/faculty/ballc/ling361/tagging_over view.html Why Do We Care about Parts of Speech? •Pronunciation Hand me the lead pipe. •Predicting what words can be expected next Personal pronoun (e.g., I, she) ____________ •Stemming -s means singular for verbs, plural for nouns •As the basis for syntactic parsing and then meaning extraction I will lead the group into the lead smelter. •Machine translation • (E) content +N (F) contenu +N • (E) content +Adj (F) content +Adj or satisfait +Adj Remember the Mapping Problem We’ve sort of ignored this issue as we’ve looked at: •Dealing with a noisy channel, •Probabilistic techniques we can use for various subproblems •Corpora we can analyze to collect our facts. We need to return to it now. POS tagging is the first step. Understanding – the Big Picture Morphology POS Tagging Syntax Semantics Discourse Integration Generation goes backwards. For this reason, we generally want declarative representations of the facts. POS tagging is an exception to this. Two Kinds of Issues •Linguistic – what are the facts about language? •Algorithmic – what are effective computational procedures for dealing with those facts? What is a Part of Speech? Is this a semantic distinction? For example, maybe Noun is the class of words for people, places and things. Maybe Adjective is the class of words for properties of nouns. Consider: green book book is a Noun green is an Adjective Now consider: book worm This green is very soothing. Morphological and Syntactic Definition of POS An Adjective is a word that can fill the blank in: It’s so __________. A Noun is a word that can be marked as plural. A Noun is a word that can fill the blank in: the __________ is What is green? It’s so green. Both greens could work for the walls. The green is a little much given the red rug. How Many Parts of Speech Are There? A first cut at the easy distinctions: Open classes: •nouns, verbs, adjectives, adverbs Closed classes: function words •conjunctions: and, or, but •pronounts: I, she, him •prepositions: with, on •determiners: the, a, an But It Gets Harder provided, as in “I’ll go provided John does.” there, as in “There aren’t any cookies.” might, as in “I might go.” or “I might could go.” no, as in “No, I won’t go.” What’s a Preposition From the CELEX online dictionary. Frequencies are from the COBUILD 16 million word corpus. What’s a Pronoun? CELEX dictionary list of pronouns: Tagsets Brown corpus tagset (87 tags): http://www.scs.leeds.ac.uk/amalgam/tagsets/brown.html Penn Treebank tagset (45 tags): http://www.cs.colorado.edu/~martin/SLP/Figures/ (8.6) C7 tagset (146 tags) http://www.comp.lancs.ac.uk/ucrel/claws7tags.html Algorithms for POS Tagging Why can’t we just look them up in a dictionary? •Ambiguity – In the Brown corpus, 11.5% of the word types are ambiguous (using 87 tags): Worse, 40% of the tokens are ambiguous. Algorithms for POS Tagging Why can’t we just look them up in a dictionary? •Words that aren’t in the dictionary http://story.news.yahoo.com/news?tmpl=story&cid=578&ncid =578&e=1&u=/nm/20030922/ts_nm/iraq_usa_dc •One idea: P(ti | wi) = the probability that a random hapax legomenon in the corpus has tag ti. Nouns are more likely than verbs, which are more likely than pronouns. •Another idea: use morphology. Algorithms for POS Tagging - Knowledge •Dictionary •Morphological rules, e.g., •_____-tion •_____-ly •capitalization •N-gram frequencies •to _____ •DET _____ N •But what about rare words, e.g, smelt (two verb forms, melt and past tense of smell, and one noun form, a small fish) •Combining these • V _____-ing I was gracking vs. Gracking is fun. Algorithms for POS Tagging - Approaches •Basic approaches •Rule-Based •Stochastic •Do we return one best answer or several answers and let later steps decide? •How does the requisite knowledge get entered? Training/Teaching an NLP Component Each step of NLP analysis requires a module that knows what to do. How do such modules get created? •By hand •By training Advantages of hand creation: based on sound linguistic principles, sensible to people, explainable Advantages of training from a corpus: less work, extensible to new languages, customizable for specific domains. Training/Teaching a POS Tagger The problem is tractable. We can do a very good job with just: •a dictionary •A tagset •a large corpus, usually tagged by hand There are only somewhere between 50 and 150 possibilities for each word and 3 or 4 words of context is almost always enough. The task: ____ _ __ ______ __ _ _____ What is the weather like in Austin? Contrast with Training Other NLP Parts The task: ____ _ __ ______ __ _ _____ What is the weather like in Austin? The weather in Austin is like what? RainfallByStation Months Stations station Month year Days month station rainfall City Rule-Based POS Tagging Step 1: Using a dictionary, assign to each word a list of possible tags. Step 2: Figure out what to do about words that are unknown or ambiguous. Two approaches: •Rules that specify what to do. •Rules that specify what not to do: Example: Adverbial “that” rule Given input: “that” If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A) Then eliminate non-ADV tags Else eliminate ADV It isn’t that odd vs I consider that odd vs I believe that he is right. From ENGTWOL Stochastic POS Tagging First approximation: choose the tag that is most likely for the given word. Next try: consider N-gram frequencies and choose the tag that is most likely in the current context. Should the context be the last N words or the last N classes? Next try: combine the two: P(ti in context | wi ) P( wi | ti in context) P(ti in context) P( wi ) ti arg max P(t j | ti 1 ) P(wi | t j ) j Hybrids – the Brill Tagger Learning rules stochastically: Transformation Based Learning Step 1: Assign each word the tag that is most likely given no contextual information. Race example: P(NN|race) = .98 P(VB|race) = .02 Step 2: Apply transformation rules that use the context that was just established. Race example: Change NN to VB when the previous tag is TO. Secretariat is expected to race tomorrow. The race is already over. Learning Brill Tagger Transformations Three major stages: 1. Label every word with its most-likely tag. 2. Examine every possible transformation and select the one with the most improved tagging. 3. Retag the data according to this rule. These three stages are repeated until some stopping point is reached. The output of TBL is an ordered list of transformations, which constitute a tagging procedure that can be applied to a new corpus. The Universe of Possible Transformations? One or Many Answers Example: I’m going to water ski. I’m going to water the lawn. The architecture issue: •If we just give one answer, we can follow a single path. •If we don’t decide yet, we’ll need to manage search. Search •Managing search: •Depth-first •Breadth-first – chart parsing S S VP VP NP PP NP V PR N I hit det the N boy NP V PREP DET N with a bat. Evaluation •Given an algorithm, how good is it? •What is causing the errors? Can anything be done about them? How Good is An Algorithm? •How good is the algorithm? •What’s the maximum performance we have any reason to believe is achievable? (How well can people do?) •How good is good enough? Is 97% good enough? •Example 1: A speech dialogue system correctly assigns a meaning to a user’s input 97% of the time. •Example 2: An OCR systems correctly determines letters 97% of the time.