POS Tagging and Context-Free Grammars CS 4705 1 From Words to Syntactic Structures • Words morphological structures Cows, cowed, reformulation • Ngrams statistical co-occurrences The man/unicorn said, The man the • POS word classes DET N V, DET N DET • Syntactic Constituents word relationships S NP VP, S S conj S 2 POS • Review: Words of the same class behave similarly – Are these words nouns or adjectives? • a blue seat • a very blue seat • this seat is blue a child seat *a very child seat *this seat is child – What are the word classes? – How do we identify a word’s POS? 3 How many word classes are there? • A basic set: – Open class/content words: Noun, Verb, Adjective, Adverb, – Closed class/function words: Preposition, Pronoun, Determiner, Auxiliary and copular verbs, Particle, Conjunction 4 Nouns • Words that describe people, places, things, events, abstractions, activities, … Hector, mirror, New York, cat, love, government – Can take the possessive: Hector’s, the cat’s – Can (usually) occur as plurals: governments, many New Yorks – Can occur with determiners: the cat, Homer’s Hector • Subclasses: – Proper Nouns: Hector, New York – Common Nouns: cat, dog, football • Mass vs count nouns: enumerable or not (cat, sand) 5 Verbs • Refer to actions, events, conditions, processes Go, kicked, think, manage, trying • Tense: when did the action,… take place? – Present: I kick (simple), I am kicking (progressive), I have kicked (perfect), I have been kicking (perfect progressive) – Past: I kicked, I was kicking, I had kicked, I had been kicking – Future: I will kick, I will be kicking, I will have kicked, I will have been kicking • Aspect: the nature of the action,… -- simple/indefinite, complete, continuing 6 Adjectives • Describe properties or qualities Pretty, red, careful, cat-like, wishful, silly 7 Adverbs • • • • • • Modify verbs or adverbs or …. Directional or locative: here, upward Degree modifiers very, too Manner: slowly Temporals: today, now Are they adverbs or nouns? 8 Prepositions and Particles • Prepositions: indicate spatial or temporal relations To Boston, From Boston In, for, with, toward, into, by • Particles: act like prepositions or adverbs but behave like semantic units with their verbs • Test: can you move the prep/part and what follows to the front of the sentence? Prep: We ran up the hill. Up the hill we ran. Part: We ran up the bill. *Up the bill we ran. 9 • Some particles with their verbs: Run into (*Into Bill we ran) Find out (*Out the truth we found) Turn on (*On the light we turned) Throw up (*Up his dinner he threw) 10 Determiners • • • • • Articles: the cat, a cat, an idiot Possessive nouns/pronouns: her cat, Sally’s cat Numbers: five cats Indefinite pronouns: each cat, some cats Demonstrative pronouns: that cat, those cats 11 Conjunctions • Coordinate: and, but • Subordinate/complementizers: …that the war is over, …because I love you, …unless you change your ways 12 Pronouns – Personal: I, he,... – Possessive: my, his,… – Indefinite: someone, everyone, anybody, nothing – Interrogative or wh: who, whom,... – And many more… 13 Auxiliary Verbs • Indicate features of a main verb, such as tense and aspect Be (copula), have, do, can/will/shall/may (modal) He is silent, She has done that, We can help 14 • And more… – Interjections/discourse markers – Existential there : There is a unicorn in the garden – Greetings, politeness terms 15 Part-of-Speech Tagging • It’s useful to know the POS of words in a sentence – Time/N flies/V like/Prep an/Det arrow/N – Fruit/N flies/N like/V a/DET banana/N 16 POS can disambiguate • Some words have only one POS tag: is, Mary, very, smallest • Others have a single most likely tag: a, dog • Many are more ambiguous: likes, bass • But luckily….tags tend to co-occur regularly with other tags (e.g. DET N more likely than N DET) – We can learn POS ngram probabilities P(t1|tn-1) from a tagged corpus just as we learn word ngram probabilities 17 Approaches to POS Tagging • Hand-written rules • Statistical approaches (e.g. HMM-based taggers) • Hybrid systems (e.g. Brill’s TBL: transformationbased learning) 18 Brill Tagging: TBL • Start with simple rules…learn better ones from tagged corpus – Init: Start with a (hand) tagged corpus and remove the tags from a copy – Tag each word in the copy with most likely POS (obtained from the original or another tagged corpus) – Select a transformation that most improves tagging accuracy (compared to original) – Re-tag the whole corpus applying just this 23 transformation and put it on the list of transformations – Compare the new tags of the copy to the original – Again, select the transformation that most improves the accuracy of the (better) tags on the copy compared to the original – Iterate until performance doesn’t improve (no transformation improves tagging accuracy) • Result: tagging procedure (set of transformations) which can be applied to new, untagged text 24 Transformations Change tag a to tag b when…. 25 An Example Time flies like an arrow. 1) Tag every word with most likely tag and score Time/N flies/V like/V an/DET arrow/N 2) For each template, try every instantiation and apply to tagged corpus and score e.g. Change V to N when the preceding word is tagged V Time/N flies/V like/N an/DET arrow/N e.g. Change V to Prep when the preceding word is tagged V 26 Time/N flies/V like/Prep an/DET arrow/N 3) Select the transformation rule that most improves the overall accuracy of POS assignments on the training corpus 4) Add the new rule to the tagging procedure list 5) Iterate from (2) until no transformation improves score • Result: ordered list of transformation rules which can be applied sequentially to new, untagged data (after initializing with most common tag) 27 Basic Constituents and Rewrite Rules • • • • • • • • • • S NP VP NP DET NOM NP PropN NOM N | NOM DET a | an | the PropN George | Morocco N cat | box VP V NP VP V V exploded 33 More Constituents and Rules • VP V PP • PP Prep NP • Prep at | over | under | in | by 34 How to write a grammar • Scenario: You are a lowly programmer in IT at a major financial institution in NYC. Your boss tells you the department needs to port data from an old database in which the person name field was not divided into multiple fields (title, firstname, middle name, surname, suffix) to a new modern database • Your task: Separate these names into their proper fields for the new database • What do you do? 35 Solutions • Go through the old database names one at a time and type them into the new db • Create a script with regular expressions to search for names with different components and write each out into a standard set of fields • Build an FST to process the names and output field-separated components • Write a Context Free Grammar to parse the names into their constituents 36 A Name Grammar • Name Title Firstname Middlename Surname Honorific • Name Title Firstname Middlename Surname • Name Firstname Middlename Surname Honorific • Name Firstname Middlename Surname • Name Title Firstname MiddleInitial Surname Honorific • ……. 37 A Better Name Grammar • • • • • • • • • Name Title BaseName Suffix Name Title BaseName Name Basename Suffix Basename Firstname Middle Surname Middle Middlename Middle MiddleInitial Title Mr. | Mrs.| Ms.| Miss | Dr. | Gen. | … Suffix Jr. | Sr. | Esq. | DDS | … ……. 38 Next Class • How do we use CFGs for parsing • Read Chapter 11 39