Finite-State Methods 600.465 - Intro to NLP - J. Eisner 1 Finite state acceptors (FSAs) Things you may know about FSAs: c a e Defines the language a? c* = {a, ac, acc, accc, …, e, c, cc, ccc, …} 600.465 - Intro to NLP - J. Eisner Equivalence to regexps Union, Kleene *, concat, intersect, complement, reversal Determinization, minimization Pumping, Myhill-Nerode 2 n-gram models not good enough Want to model grammaticality A “training” sentence known to be grammatical: BOS mouse traps catch mouse traps EOS trigram model must allow these trigrams Resulting trigram model has to overgeneralize: allows sentences with 0 verbs BOS mouse traps EOS allows sentences with 2 or more verbs BOS mouse traps catch mouse traps catch mouse traps catch mouse traps EOS Can’t remember whether it’s in subject or object (i.e., whether it’s gotten to the verb yet) 600.465 - Intro to NLP - J. Eisner 3 Finite-state models can “get it” Want to model grammaticality BOS mouse traps catch mouse traps EOS Finite-state can capture the generalization here: Noun+ Verb Noun+ Noun Noun Noun Verb preverbal states (still need a verb to reach final state) 600.465 - Intro to NLP - J. Eisner Noun postverbal states (verbs no longer allowed) Allows arbitrarily long NPs (just keep looping around for another Noun modifier). Still, never forgets whether it’s preverbal or postverbal! (Unlike 50-gram model) 4 How powerful are regexps / FSAs? More powerful than n-gram models The hidden state may “remember” arbitrary past context With k states, can remember which of k “types” of context it’s in Equivalent to HMMs In both cases, you observe a sequence and it is “explained” by a hidden path of states. The FSA states are like HMM tags. Appropriate for phonology and morphology Word = = = = Syllable+ (Onset Nucleus Coda?)+ (C+ V+ C*)+ ( (b|d|f|…)+ (a|e|i|o|u)+ (b|d|f|…)* )+ 600.465 - Intro to NLP - J. Eisner 5 How powerful are regexps / FSAs? But less powerful than CFGs / pushdown automata Can’t do recursive center-embedding Hmm, humans have trouble processing those constructions too … This is the rat that ate the malt. This is the malt that the rat ate. This is the cat that bit the rat that ate the malt. This is the malt that the rat that the cat bit ate. finite-state can handle this pattern (can you write the regexp?) This is the dog that chased the cat that bit the rat that ate the malt. This is the malt that [the rat that [the cat that [the dog chased] bit] ate]. but not this pattern, which requires a CFG 600.465 - Intro to NLP - J. Eisner 6 How powerful are regexps / FSAs? But less powerful than CFGs / pushdown automata More important: Less explanatory than CFGs An CFG without recursive center-embedding can be converted into an equivalent FSA – but the FSA will usually be far larger Because FSAs can’t reuse the same phrase type in different places Noun S= Noun Noun Verb duplicated structure more elegant – using nonterminals like this is equivalent to a CFG Noun Noun duplicated structure S= 600.465 - Intro to NLP - J. Eisner NP = NP Verb Noun NP7 Strings vs. String Pairs FSA = “finite-state acceptor” Describes a language (which strings are grammatical?) FST = “finite-state transducer” Describes a relation (which pairs of strings are related?) underlying form surface form sentence translation original edited … 600.465 - Intro to NLP - J. Eisner 8 Example: Edit Distance position in upper string 0 1 2 3 4 5 Cost of best path relating these two strings? 600.465 - Intro to NLP - J. Eisner position in lower string 0 1 2 3 4 9 Example: Morphology VP [head=vouloir,...] V[head=vouloir, ... tense=Present, num=SG, person=P3] veut 600.465 - Intro to NLP - J. Eisner 10 slide courtesy of L. Karttunen (modified) Example: Unweighted transducer VP [head=vouloir,...] vouloir +Pres +Sing + P3 V[head=vouloir, ... Finite-state transducer tense=Present, num=SG, person=P3] veut veut canonical form v o u v e u l the relevant path 600.465 - Intro to NLP - J. Eisner o inflection codes i r +Pres +Sing +P3 t inflected form 11 slide courtesy of L. Karttunen Example: Unweighted transducer Bidirectional: generation or analysis Compact and fast Xerox sells for about 20 languges including English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, ... Research systems for many other languages, including Arabic, Malay vouloir +Pres +Sing + P3 Finite-state transducer veut canonical form v o u v e u l the relevant path 600.465 - Intro to NLP - J. Eisner o inflection codes i r +Pres +Sing +P3 t inflected form 12 Regular Relation (of strings) Relation: like a function, but multiple outputs ok Regular: finite-state a:e Transducer: automaton w/ outputs b:b b {b} ? a ?{} aaaaa {ac, ? aca, acab, acabc} Invertible? Closed under composition? 600.465 - Intro to NLP - J. Eisner a:a a:c ?:c b:b b:e ?:b ?:a 13 Regular Relation (of strings) Can weight the arcs: vs. b {b} a {} aaaaa {ac, aca, acab, acabc} a:e b:b a:a How to find best outputs? For aaaaa? For all inputs at once? 600.465 - Intro to NLP - J. Eisner a:c ?:c b:b b:e ?:b ?:a 14 Function from strings to ... Acceptors (FSAs) Unweighted c {false, true} a e c/.7 a/.5 .3 e/.5 600.465 - Intro to NLP - J. Eisner c:z strings a:x e:y numbers Weighted Transducers (FSTs) (string, num) pairs c:z/.7 a:x/.5 .3 e:y/.5 15 Sample functions Acceptors (FSAs) Unweighted Weighted Transducers (FSTs) {false, true} strings Grammatical? Markup Correction Translation numbers (string, num) pairs How grammatical? Better, how likely? Good markups Good corrections Good translations 600.465 - Intro to NLP - J. Eisner 16 Terminology (acceptors) Regular language defines Regexp recognizes compiles into implements FSA String 600.465 - Intro to NLP - J. Eisner 17 Terminology (transducers) Regular relation defines Regexp recognizes compiles into implements String pair 600.465 - Intro to NLP - J. Eisner FST ? 18 Perspectives on a Transducer Remember these CFG perspectives: 3 views of a context-free rule generation (production): S NP VP (randsent) parsing (comprehension): S NP VP (parse) verification (checking): S = NP VP Similarly, 3 views of a transducer: Given 0 strings, generate a new string pair (by picking a path) Given one string (upper or lower), transduce it to the other kind Given two strings (upper & lower), decide whether to accept the pair v o u v e u l o i r +Pres +Sing +P3 t FST just defines the regular relation (mathematical object: set of pairs). What’s “input” and “output” depends on what one asks about the relation. The 0, 1, or 2 given string(s) constrain which paths you can use. 600.465 - Intro to NLP - J. Eisner 19 Functions ab?d abcd abcd f 600.465 - Intro to NLP - J. Eisner g 20 Functions abcd ab?d Function composition: f g [first f, then g – intuitive notation, but opposite of the traditional math notation] 600.465 - Intro to NLP - J. Eisner 21 From Functions to Relations f ab?d g 3 2 6 abcd abed 4 2 8 abjd abcd abed abd ... 600.465 - Intro to NLP - J. Eisner 22 From Functions to Relations ab?d 3 4 2 2 Relation composition: f g 6 8 abcd abed abd ... 600.465 - Intro to NLP - J. Eisner 23 From Functions to Relations 3+4 ab?d 2+2 Relation composition: f g 6+8 abcd abed abd ... 600.465 - Intro to NLP - J. Eisner 24 From Functions to Relations ab?d 2+2 Pick min-cost or max-prob output abed Often in NLP, all of the functions or relations involved can be described as finite-state machines, and manipulated using standard algorithms. 600.465 - Intro to NLP - J. Eisner 25 slide courtesy of L. Karttunen (modified) Building a lexical transducer c l e e big | clear | clever | ear | fat | ... Regular Expression Lexicon f a t a v h r e Lexicon FSA Compiler Lexical Transducer (a single FST) composition Regular Expressions for Rules Composed Rule FSTs b i g b i g one path 600.465 - Intro to NLP - J. Eisner +Adj g +Comp e r 26 slide courtesy of L. Karttunen (modified) Building a lexical transducer c l e e big | clear | clever | ear | fat | ... Regular Expression Lexicon f a t a v h r e Lexicon FSA Actually, the lexicon must contain elements like big +Adj +Comp So write it as a more complicated expression: (big | clear | clever | fat | ...) +Adj (e | +Comp | +Sup) adjectives | (ear | father | ...) +Noun (+Sing | +Pl) nouns | ... ... Q: Why do we need a lexicon at all? 600.465 - Intro to NLP - J. Eisner 27 Inverting Relations f ab?d g 3 4 abcd abcd 2 2 abed abed 6 8 abjd abd ... 600.465 - Intro to NLP - J. Eisner 28 Inverting Relations f g-1 -1 ab?d 3 4 abcd abcd 2 2 abed abed 6 8 abjd abd ... 600.465 - Intro to NLP - J. Eisner 29 Inverting Relations 3+4 ab?d 2+2 (f g)-1 = g-1 f -1 6+8 abcd abed abd ... 600.465 - Intro to NLP - J. Eisner 30 slide courtesy of L. Karttunen (modified) Weighted version of transducer: Assigns a weight to each string pair être+IndP +SG + P1 suivre+IndP+SG+P1 “upper language” suivre+IndP+SG+P2 4 19 suivre+Imp+SG + P2 payer+IndP+SG+P1 20 50 12 Weighted French Transducer 3 suis “lower language” paie paye 600.465 - Intro to NLP - J. Eisner 31 Composition Cascades You can build fancy noisy-channel models by composing transducers … Examples: Phonological/morphological rewrite rules? English orthography English phonology Japanese phonology Japanese orthography e.g. ??? goruhubaggu Information extraction 600.465 - Intro to NLP - J. Eisner 32 FASTUS – Information Extraction Appelt et al, 1992-? Input: Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Taiwan Co., capitalized at 20 million new Taiwan dollars, will start production in January 1990 with … Output: Relationship: Entities: TIE-UP “Bridgestone Sports Co.” “A local concern” “A Japanese trading house” Joint Venture Company: “Bridgestone Sports Taiwan Co.” Amount: NT$20000000 600.465 - Intro to NLP - J. Eisner 33 FASTUS: Successive Markups (details on subsequent slides) Tokenization .o. Multiwords .o. Basic phrases (noun groups, verb groups …) .o. Complex phrases .o. Semantic Patterns .o. Merging different references 600.465 - Intro to NLP - J. Eisner 34 FASTUS: Tokenization Spaces, hyphens, etc. wouldn’t would not their them ’s company. company . but Co. Co. 600.465 - Intro to NLP - J. Eisner 35 FASTUS: Multiwords “set up” “joint venture” “San Francisco Symphony Orchestra,” “Canadian Opera Company” … use a specialized regexp to match musical groups. ... what kind of regexp would match company names? 600.465 - Intro to NLP - J. Eisner 36 FASTUS : Basic phrases Output looks like this (no nested brackets!): … [NG it] [VG had set_up] [NP a joint_venture] [Prep in] … Company Name: Verb Group: Noun Group: Noun Group: Verb Group: Noun Group: Preposition: Location: Preposition: Noun Group: Bridgestone Sports Co. said Friday it had set up a joint venture in Taiwan with a local concern 600.465 - Intro to NLP - J. Eisner 37 FASTUS: Noun Groups Build FSA to recognize phrases like approximately 5 kg more than 30 people the newly elected president the largest leftist political force a government and commercial project Use the FSA for left-to-right longest-match markup What does FSA look like? See next slide … 600.465 - Intro to NLP - J. Eisner 38 FASTUS: Noun Groups Described with a kind of non-recursive CFG … (a regexp can include names that stand for other regexps) NG Pronoun | Time-NP | Date-NP NG (Det) (Adjs) HeadNouns … Adjs sequence of adjectives maybe with commas, conjunctions, adverbs … Det DetNP | DetNonNP DetNP detailed expression to match “the only five, another three, this, many, hers, all, the most …” … 600.465 - Intro to NLP - J. Eisner 39 FASTUS: Semantic patterns BusinessRelationship = NounGroup(Company/ies) VerbGroup(Set-up) NounGroup(JointVenture) with NounGroup(Company/ies) | … ProductionActivity = VerbGroup(Produce) NounGroup(Product) NounGroup(Company/ies) NounGroup & … is made easy by the processing done at a previous level Use this for spotting references to put in the database. 600.465 - Intro to NLP - J. Eisner 40 Composition Cascades You can build fancy noisy-channel models by composing transducers … … now let’s turn to how you might build the individual transducers in the cascade. We’ll use a variety of operators that combine simpler transducers and acceptors into more complex ones. Composition is just one example. 600.465 - Intro to NLP - J. Eisner 41