Graphical Models Over String-Valued Random Variables Jason Eisner with ASRU, Dec. 2015 Nanyun Nick Ryan Markus Michael Cotterell (Violet) Andrews Dreyer Paul Peng 1 Pronunciation Dictionaries Probabilistic Inference of Strings Jason Eisner with ASRU, Dec. 2015 Nanyun Nick Ryan Markus Michael Cotterell (Violet) Andrews Dreyer Paul Peng 2 3 semantics lexicon (word types) entailment correlation inflection cognates transliteration abbreviation neologism language evolution tokens sentences N translation alignment editing quotation discourse context resources speech misspellings,typos formatting entanglement annotation To recover variables, model and exploit their correlations 5 Bayesian View of the World observed data probability distribution hidden data 6 Bayesian NLP Some good NLP questions: Underlying facts or themes that help explain this document collection? An underlying parse tree that helps explain this sentence? An underlying meaning that helps explain that parse tree? An underlying grammar that helps explain why these sentences are structured as they are? An underlying grammar or evolutionary history that helps explain why these words are spelled as they are? 7 Today’s Challenge Too many words in a language! Natural Language is Built from Words 9 Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax 123 ca [si.ei] NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD 126 cane [keɪn] NN (mass) 127 cane [keɪn] NN 128 canes [keɪnz] NNS (other columns would include translations, 10 topics, counts, embeddings, …) Problem: Too Many Words! Google analyzed 1 trillion words of English text Found > 13M distinct words with count ≥ 200 The problem isn’t storing such a big table … it’s acquiring the info for each row separately Need lots of evidence, or help from human speakers Hard to get for every word of the language Especially hard for complex or “low-resource” languages Omit rare words? Maybe, but many sentences contain them (Zipf’s Law) 11 Technically speaking, # words = Really the set of (possible) words is ∑* Names Neologisms Typos Productive processes: friend friendless friendlessness friendlessnessless … hand+bag handbag (sometimes can iterate) 12 Technically speaking, # words = Really the set of (possible) words is ∑* Turkish word: uygarlaştiramadiklarimizdanmişsinizcasina = uygar+laş+tir+ama+dik+lar+imiz+dan+miş+siniz+casina (behaving) as if you are among those whom we could not cause to become civilized Names Neologisms Typos Productive processes: friend friendless friendlessness friendlessnessless … hand+bag handbag (sometimes can iterate) 13 A typical Polish verb (“to contain”) Imperfective Perfective infinitive zawierać zawrzeć present zawieram zaweiramy zawierasz zawieracie zawiera zawierają zawierałem/zawierałam zawieraliśmy/zawierałyśmy zawarłem/zawarłam zawarliśmy/zawarłyśmy zawierałeś/zawierałaś zawieraliście/zawierałyście zawarłeś/zawarłaś zawarliście/zawarłyście zawierał/zawierała/zawierało zawierali/zawierały zawarł/zawarła/zawarło zawarli/zawarły będę zawierał/zawierała będziemy zawierali/zawierały zawrę zawrzemy będziesz zawierał/zawierała będziecie zawierali/zawierały zawresz zawrzecie będzie zawierał/zawierała/zawierało będą zawierali/zawierały zawrze zawrą zawierałbym/zawierałabym zawieralibyśmy/zawierałybyśmy zawarłbym/zawarłabym zawarlibyśmy/zawarłybyśmy zawierałbyś/zawierałabyś zawieralibyście/zawierałybyście zawarłbyś/zawarłabyś zawarlibyście/zawarłybyście zawierałby/zawierałaby/zawierałoby zawieraliby/zawierałyby zawarłby/zawarłaby/zawarłoby zawarliby/zawarłyby past future conditional imperative zawierajmy zawrzyjmy zawieraj zawierajcie zawrzyj zawrzjcie niech zawiera niech zawierają niech zawrze niech zawrą present active participles zawierający, -a, -e; -y, -e present passive participles zawierany, -a, -e; -, -e past passive participles zawarty, -a, -e; -, -te adverbial participle zawierając 100 inflected forms per verb Sort of predictable from one another! (verbs are more or less regular) 14 Solution: Don’t model every cell separately Positive ions Noble gases 16 Can store info about each word in a table Index Spelling Meaning Pronunciation Syntax 123 ca [si.ei] NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD 126 cane [keɪn] NN (mass) 127 cane [keɪn] NN 128 canes [keɪnz] NNS (other columns would include translations, 17 topics, counts, embeddings, …) What’s in the table? NLP strings are diverse … Use Orthographic (spelling) Phonological (pronunciation) Latent (intermediate steps not observed directly) Size Morphemes (meaningful subword units) Words Multi-word phrases, including “named entities” URLs 18 What’s in the table? NLP strings are diverse … Language English, French, Russian, Hebrew, Chinese, … Related languages (Romance langs, Arabic dialects, …) Dead languages (common ancestors) – unobserved? Transliterations into different writing systems Medium Misspellings Typos Wordplay Social media 19 Some relationships within the table spelling pronunciation word noisy word (e.g., with a typo) word related word in another language (loanwords, language evolution, cognates) singular plural (for example) (root, binyan) word underlying form surface form 20 Reconstructing the (multilingual) lexicon Index Spelling Meaning Pronunciation Syntax Ultimate goal: Probabilistically reconstruct all missing entries of 123this infinite ca multilingual table, given [si.ei] NNPtext. (abbrev) some entries and some 124 can Needed: Exploit the relationships [kɛɪn] (arrows). NN May have to discover those relationships. 125 canLinguistics + generative modeling [kæn], [kɛn], … MD Approach: + statistical inference. Modeling ingredients: Finite-state machines, graphical models, CRP. 126 cane NN (mass) [keɪn] Inference ingredients: MCMC, BP/EP, DD. 127 cane [keɪn] NN 128 canes [keɪnz] NNS (other columns would include translations, 22 …) topics, distributional info such as counts, Today’s Focus: Phonology (but methods also apply to other relationships among strings) What is Phonology? Orthography: cat Phonology: [kæt] • Phonology explains regular sound patterns 24 What is Phonology? Orthography: cat Phonology: [kæt] Phonetics: • Phonology explains regular sound patterns • Not phonetics, which deals with acoustics 25 Q: What do phonologists do? A: They find patterns among the pronunciations of words. 26 A Phonological Exercise Tenses Verbs 1P Pres. Sg. 3P Pres. Sg. TALK THANK HACK CRACK SLAP [tɔk] [θeɪŋk] [hæk] [slæp] [tɔks] [θeɪŋks] [hæks] [kɹæks] Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] 27 Matrix Completion: Collaborative Filtering Users Movies -37 -36 -24 -52 29 67 61 -79 19 77 74 29 22 12 -41 -39 28 Matrix Completion: Collaborative Filtering Movies [ 9 -7 2 4 3 -2 -37 -36 -24 29 67 61 -79 19 77 74 29 22 12 -41 -52 [ 9 -2 1 [ [ Users [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] [ [ -6 -3 2 -39 29 Matrix Completion: Collaborative Filtering Movies 9 -2 1 [ 9 -7 2 [ 4 3 -2 -37 -36 -24 59 -52 29 67 61 -79 6 19 77 74 -80 -39 29 22 12 -41 46 [ [ Users [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] -6 -3 2 Prediction! 30 Matrix Completion: Collaborative Filtering [1,-4,3] [-5,2,1] Dot Product -10 Gaussian Noise -11 31 Matrix Completion: Collaborative Filtering Movies 9 -2 1 [ 9 -7 2 [ 4 3 -2 -37 -36 -24 59 -52 29 67 61 -79 6 19 77 74 -80 -39 29 22 12 -41 46 [ [ Users [ [ [ [ [ 4 1 -5] [ 7 -2 0] [ 6 -2 3] [-9 1 4] [ 3 8 -5] -6 -3 2 Prediction! 32 A Phonological Exercise Tenses Verbs 1P Pres. Sg. 3P Pres. Sg. TALK THANK HACK CRACK SLAP [tɔk] [θeɪŋk] [hæk] [slæp] [tɔks] [θeɪŋks] [hæks] [kɹæks] Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] 33 A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ TALK THANK HACK CRACK SLAP [tɔk] [θeɪŋk] [hæk] [slæp] [tɔks] [θeɪŋks] [hæks] [kɹæks] /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] 34 A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ TALK THANK HACK CRACK SLAP [tɔk] [θeɪŋk] [hæk] [slæp] [tɔks] [θeɪŋks] [hæks] [kɹæks] /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] 35 A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ TALK THANK HACK CRACK SLAP [tɔk] [θeɪŋk] [hæk] [kɹæk] [slæp] [tɔks] [θeɪŋks] [hæks] [kɹæks] [slæps] /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] Prediction! 36 Why “talks” sounds like that tɔk s Concatenate tɔks “talks” 37 A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ /koʊd/ /bæt/ TALK THANK HACK CRACK SLAP CODE BAT [tɔk] [θeɪŋk] [hæk] [tɔks] [θeɪŋks] [hæks] [kɹæks] [slæp] [koʊdz] [bæt] /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] [koʊdɪd] [bætɪd] 38 A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ /koʊd/ /bæt/ TALK THANK HACK CRACK SLAP CODE BAT [tɔk] [θeɪŋk] [hæk] [tɔks] [θeɪŋks] [hæks] [kɹæks] [slæp] [koʊdz] [bæt] z instead of s /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] [koʊdɪd] [bætɪd] ɪd instead of39t A Phonological Exercise Suffixes /Ø/ /s/ Stems 1P Pres. Sg. 3P Pres. Sg. /tɔk/ /θeɪŋk/ /hæk/ /kɹæk/ /slæp/ /koʊd/ /bæt/ /it/ TALK THANK HACK CRACK SLAP CODE BAT EAT [tɔk] [θeɪŋk] [hæk] [tɔks] [θeɪŋks] [hæks] [kɹæks] [slæp] [koʊdz] [bæt] [it] eɪt instead of itɪd /t/ /t/ Past Tense Past Part. [tɔkt] [θeɪŋkt] [hækt] [tɔkt] [θeɪŋkt] [hækt] [kɹækt] [slæpt] [koʊdɪd] [eɪt] [bætɪd] [itən] 40 Why “codes” sounds like that koʊd s Concatenate koʊd#s Phonology (stochastic) koʊdz “codes” Modeling word forms using latent underlying morphs and phonology. Cotterell et. al. TACL 2015 41 Why “resignation” sounds like that rizaign ation Concatenate rizaign#ation Phonology (stochastic) rεzɪgneɪʃn “resignation” 42 Fragment of Our Graph for English 3rd-person singular suffix: very common! 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz “resignation” “resigns” Concatenation 2) Underlying words Phonology 3) Surface words “damnation” “damns” 43 Handling Multimorphemic Words • Matrix completion: each word built from one stem (row) + one suffix (column). WRONG • Graphical model: a word can be built from any # of morphemes (parents). RIGHT gə liːb t gəliːbt gəliːpt 44 “geliebt” (German: loved) Limited to concatenation? No, could extend to templatic morphology … 45 A (Simple) Model of Phonology [1,-4,3] [-5,2,1] Dot Product -10 Gaussian Noise -11 47 rizaign s Concatenate rizaign#s Phonology (stochastic) Sθ rizainz “resigns” 48 Phonology as an Edit Process ft Context Upper Right Context r i z a i g n s t Context 49 Phonology as an Edit Process ft Context Upper Right Context r i z a i g n s r t Context 50 Phonology as an Edit Process Upper Left Context Upper Right Context r i r i z a i g n s Lower Left Context 51 Phonology as an Edit Process Upper Left Context Upper Right Context r i z r i z a i g n s Lower Left Context 52 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a r i z a i g n s Lower Left Context 53 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i r i z a i g n s Lower Left Context 54 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ n s Lower Left Context 55 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g n r i z a i ɛ n s Lower Left Context 56 Phonology as an Edit Process Upper Left Context Upper Right Conte r i z a i g n s r i z a i ɛ n z Lower Left Context 57 Phonology as an Edit Process Upper Left Context Upper Right Conte r i z a i g n s r i z a i ɛ n z Lower Left Context 58 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g n r i z a i ɛ n s Lower Left Context 59 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ Lower Left Context Feature Function Weights s n Action Prob DEL COPY SUB(A) SUB(B) ... INS(A) INS(B) ... .75 .01 .05 .03 ... .02 .01 ... 60 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ Lower Left Context n s Features Feature Function Weights 61 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ Lower Left Context n s Features Feature Function Surface Form Weights 62 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ Lower Left Context n s Features Feature Function Surface Form Transduction Weights 63 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ Lower Left Context n s Features Feature Function Surface Form Transduction Upper String Weights 64 Phonological Attributes Binary Attributes (+ and -) 65 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ n s Lower Left Context 66 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i g r i z a i ɛ n s Lower Left Context Faithfulness Features EDIT(g, ɛ) EDIT(+cons, ɛ) EDIT(+voiced, ɛ) 67 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i r i z a i g n s Lower Left Context Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) 68 Phonology as an Edit Process Upper Left Context Upper Right Context r i z a i r i z a i g n s Lower Left Context Markedness Features BIGRAM(a, i) BIGRAM(-high, -low) BIGRAM(+back, -back) 69 Inference for Phonology Bayesian View of the World observed data probability distribution hidden data 71 r,εzɪgn’eɪʃn d,æmn’eɪʃn d’æmz 72 rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz 73 Bayesian View of the World observed data • some of the words probability distribution hidden data • • • the rest of the words! all of the morphs the parameter vectors θ, φ 74 Why this matters • Phonological grammars are usually handengineered by phonologists. • Linguistics goal: Create an automated phonologist? • Cognitive science goal: Model how babies learn phonology? • Engineering goal: Analyze and generate words we haven’t heard before? 76 The Generative Story (defines which iceberg shapes are likely) 1. Sample the parameters φ and θ from priors. These parameters describe the grammar of a new language: what tends to happen in the language. 2. Now choose the lexicon of morphs and words: – For each abstract morpheme aA, sample the morph M(a) ~ Mφ – For each abstract word a=a1,a2···, sample its surface pronunciation S(a) from Sθ(· | u), where u=M(a1)#M(a2) ··· 3. This lexicon can now be used to communicate. A word’s pronunciation is now just looked up, not sampled; so it is the same each time it is used. 77 Why Probability? • A language’s morphology and phonology are fixed, but probability models the learner’s uncertainty about what they are. • Advantages: – Quantification of irregularity (“singed” vs. “sang”) – Soft models admit efficient learning and inference • Our use is orthogonal to the way phonologists currently use probability to explain gradient phenomena 78 Basic Methods for Inference and Learning Train the Parameters using EM (Dempster et al. 1977) • E-Step (“inference”): – Infer the hidden strings (posterior distribution) r,εzɪgn’eɪʃn d,æmn’eɪʃn d’æmz Train the Parameters using EM (Dempster et al. 1977) • E-Step (“inference”): – Infer the hidden strings (posterior distribution) rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Train the Parameters using EM (Dempster et al. 1977) • E-Step (“inference”): – Infer the hidden strings (posterior distribution) • M-Step (“learning”): – Improve the continuous model parameters θ, φ (gradient descent: the E-step provides supervision) r r i i z z a a • Repeat till convergence i i g n s rizaign#s riz’ajnz 82 Directed Graphical Model (defines the probability of a candidate solution) Inference step: Find high-probability reconstructions of the hidden variables. 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Concatenation 2) Underlying words Phonology 3) Surface words High-probability if each string is likely given its parents. 83 Equivalent Factor Graph (defines the probability of a candidate solution) 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Concatenation 2) Underlying words Phonology 3) Surface words 84 85 86 87 88 89 90 91 92 93 Directed Graphical Model 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Concatenation 2) Underlying words Phonology 3) Surface words 94 Equivalent Factor Graph 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Concatenation 2) Underlying words Phonology 3) Surface words Each ellipse is a random variable • Each square is a “factor” – a function that jointly scores the values of its few neighboring variables • 95 Dumb Inference by Hill-Climbing 1) Morpheme URs ? 2) Word URs ? ? ? 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd ? ? ? 96 Dumb Inference by Hill-Climbing 1) Morpheme URs foo bar s da 2) Word URs ? ? ? 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 97 Dumb Inference by Hill-Climbing 1) Morpheme URs foo bar s da 2) Word URs bar#foo bar#s bar#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 98 Dumb Inference by Hill-Climbing 0.01 8e-3 1) Morpheme URs foo bar 0.05 s 0.02 da 2) Word URs bar#foo bar#s bar#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 99 Dumb Inference by Hill-Climbing 0.01 8e-3 1) Morpheme URs 2) Word URs bar foo bar#foo s bar#s r,εzɪgn’eɪʃn riz’ajnz 0.02 da bar#da 6e-1200 2e-1300 3) Word SRs 0.05 7e-1100 riz’ajnd 100 Dumb Inference by Hill-Climbing 0.01 8e-3 1) Morpheme URs 2) Word URs 3) Word SRs bar foo bar#foo s bar#s riz’ajnz 0.02 da bar#da 6e-1200 2e-1300 r,εzɪgn’eɪʃn 0.05 7e-1100 riz’ajnd 101 Dumb Inference by Hill-Climbing ? 1) Morpheme URs foo far s da 2) Word URs far#foo far#s far#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 102 Dumb Inference by Hill-Climbing ? 1) Morpheme URs foo size s da 2) Word URs size#foo size#s size#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 103 Dumb Inference by Hill-Climbing ? 1) Morpheme URs foo … s da 2) Word URs …#foo …#s …#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 104 Dumb Inference by Hill-Climbing 1) Morpheme URs foo rizajn s da 2) Word URs rizajn#foo rizajn#s rizajn#da 3) Word SRs r,εzɪgn’eɪʃn riz’ajnz riz’ajnd 105 Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs rizajn foo rizajn#foo 2e-5 3) Word SRs r,εzɪgn’eɪʃn s rizajn#s 0.01 riz’ajnz da rizajn#da 0.008 riz’ajnd 106 Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs eɪʃn rizajn rizajn#eɪʃn 0.001 3) Word SRs r,εzɪgn’eɪʃn s rizajn#s 0.01 riz’ajnz d rizajn#d 0.015 riz’ajnd 107 Dumb Inference by Hill-Climbing 1) Morpheme URs 2) Word URs 3) Word SRs eɪʃn rizajgn rizajgn#eɪʃn 0.008 r,εzɪgn’eɪʃn s rizajgn#s 0.008 riz’ajnz d rizajgn#d 0.013 riz’ajnd 108 Dumb Inference by Hill-Climbing Can we make this any smarter? This naïve method would be very slow. And it could wander around forever, get stuck in local maxima, etc. Alas, the problem of finding the best values in a factor graph is undecidable! (Can’t even solve by brute force because strings have unbounded length.) Exact methods that might not terminate (but do in practice) Approximate methods – which try to recover not just the best values, but the posterior distribution of values All our methods are based on finite-state automata 109 A Generative Model of Phonology • A Directed Graphical Model of the lexicon dˈæmz rizˈajnz rˌɛzɪgnˈeɪʃə n 110 About to sell our mathematical soul? Insight 111 Give up lovely dynamic programming? Big Models 112 Give up lovely dynamic programming? • Not quite! Insight – Yes, general algos … which call specialized algos as subroutines • Within a framework such as belief propagation, we may run – parsers (Smith & Eisner 2008) – finite-state machines (Dreyer & Eisner 2009) • A step of belief propagation takes time O(kn) in general – To update one message from a factor that coordinates n variables that have k possible values each – If that’s slow, we can sometimes exploit special structure in the factor! • large n: parser uses dynamic programming to coordinate many vars • infinite k: FSMs use dynamic programming to coordinate strings 113 Distribution Over Surface Form: a s r dˈæmz i e u h g e n ε UR Prob dæmeɪʃən .80 dæmneɪʃən .10 dæmineɪʃən. .001 dæmiineɪʃən .0001 … … chomsky .000001 … … rizˈajnz rˌɛzɪgnˈeɪʃə n 114 115 Experimental Design # of observed words per experiment Experimental Datasets • 7 languages from different families 67 54 43 71 200 to 800 200 to 800 200 to 800 – Maori – Tangale – Indonesian – Catalan – English – Dutch – German homework exercises: can we generalize correctly from small data? CELEX can we scale up to larger datasets? can we handle naturally occurring datasets that have more irregularity? 117 Evaluation Setup r,εzɪgn’eɪʃn riz’ajnz d’æmz 118 Evaluation Setup rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz did we guess this pronunciation right? 119 Exploring the Evaluation Metrics • 1-best error rate – Is the 1-best correct? Distribution Over Surface Form: UR Prob * dæmeɪʃən .80 dæmneɪʃən .10 dæmineɪʃən. .001 dæmiineɪʃən .0001 … … chomsky .000001 … … 120 Exploring the Evaluation Metrics • 1-best error rate – Is the 1-best correct? • Cross Entropy – What is the probability of the correct answer? Distribution Over Surface Form: UR Prob dæmeɪʃən .80 dæmneɪʃən .10 dæmineɪʃən. .001 dæmiineɪʃən .0001 … … chomsky .000001 … … 121 Exploring the Evaluation Metrics • 1-best error rate – Is the 1-best correct? • Cross Entropy – What is the probability of the correct answer? • Expected Edit Distance – How close am I on average? Distribution Over Surface Form: UR Prob dæmeɪʃən .80 dæmneɪʃən .10 dæmineɪʃən. .001 dæmiineɪʃən .0001 … … chomsky .000001 … … 122 Exploring the Evaluation Metrics • 1-best error rate – Is the 1-best correct? • Cross Entropy – What is the probability of the correct answer? • Expected Edit Distance – How close am I on average? • Average over many training-test splits Distribution Over Surface Form: UR Prob dæmeɪʃən .80 dæmneɪʃən .10 dæmineɪʃən. .001 dæmiineɪʃən .0001 … … chomsky .000001 … … 123 Evaluation • Metrics: (Lower is Always Better) – – – – 1-best error rate (did we get it right?) cross-entropy (what probability did we give the right answer?) expected edit-distance (how far away on average are we?) Average each metric over many training-test splits • Comparisons: – Lower Bound: Phonology as noisy concatenation – Upper Bound: Oracle URs from linguists 124 Evaluation Philosophy • We’re evaluating a language learner, on languages we didn’t examine when designing the learner. • We directly evaluate how well our learner predicts held-out words that the learner didn’t see. • No direct evaluation of intermediate steps: – Did we get the “right” underlying forms? – Did we learn a “simple” or “natural” phonology? – It’s hard to judge the answers. Anyway, we only want the answers to be “yes” because we suspect that this will give us a more predictive theory. So let’s just see if the theory is predictive. Proof is in the pudding! • Caveat: Linguists and child language learners also have access to other kinds of data that we’re not considering yet. 125 Results (using Loopy Belief Propagation for inference) German Results Error Bars with bootstrap resampling 127 CELEX Results 128 Phonological Exercise Results 129 Gold UR Recovery 130 Formalizing Our Setup Many scoring functions on strings (e.g., our phonology model) can be represented using FSMs What class of functions will we allow for the factors? (black squares) 1) Morphemes rizaign s eɪʃən dæmn rizaign#eɪʃən rizaign#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Concatenation 2) Underlying words Phonology 3) Surface words 132 Real-Valued Functions on Strings We’ll need to define some nonnegative functions. f(x) = score of a string f(x,y) = score of a pair of strings Can represent deterministic processes: f(x,y) {1,0} Is y the observable result of deleting latent material from x? Can represent probability distributions, e.g., f(x) = p(x) under some “generating process” f(x,y) = p(x,y) under some “joint generating process” f(x,y) = p(y | x) under some “transducing process” 133 Restrict to Finite-State Functions One string input c Boolean output a e Two string inputs (on 2 tapes) c:z a:x e:y c:z/.7 c/.7 Real output a:x/.5 a/.5 .3 e/.5 .3 e:y/.5 Path weight = product of arc weights Score of input = total weight of accepting paths 134 Example: Stochastic Edit Distance p(y|x) O(k) deletion arcs O(k2) substitution arcs O(k) insertion arcs O(k) identity arcs Likely edits = high-probability arcs 135 Computing p(y|x) 0 0c 1l 2a 3r 4a 5 These are different explanations for how x could have been edited to yield y (x-to-y alignments). Use dyn. prog. to find highest-prob path, or total prob of all paths. position in lower string Given (x,y), construct a graph of all accepting paths in the original FSM. 1 2 3 4 5 0 ? 0c 1a 2c 3a 4 position in upper string 1 2 3 4 136 Why restrict to finite-state functions? Can always compute f(x,y) efficiently. Construct the graph of accepting paths by FSM composition. Sum over them via dynamic prog., or by solving a linear system. Finite-state functions are closed under useful operations: Marginalization: h(x) = ∑y f(x,y) Pointwise product: h(x) = f(x) ∙ g(x) Join: h(x,y,z) = f(x,y) ∙ g(y,z) 137 Define a function family Use finite-state machines (FSMs). The arc weights are parameterized. We tune the parameters to get weights that predict our training data well. The FSM topology defines a function family. In practice, generalizations of stochastic edit distance. So, we are learning the edit probabilities. With more states, these can depend on left and right context. 138 Probabilistic FSTs 139 Probabilistic FSTs 140 Computational Hardness Ordinary graphical model inference is sometimes easy and sometimes hard, depending on graph topology But over strings, it can be hard even with simple topologies and simple finite-state factors Simple model family can be NP-hard Multi-sequence alignment problem Generalize edit distance to k strings of length O(n) Dynamic programming would seek best path in a hypercube of size O(n^k) Similar to Steiner string problem (“consensus”) 0 1 2 3 4 5 0 in lower string 1 2 144 Simple model family can be undecidable (!) Post’s Correspondence Problem (1946) Given a 2-tape FSM f(x,y) of a certain form, is there a string z for which f(z,z) > 0 ? e:a f = a:a b:b z = bbaabbbaa e:a a:b b:a b:b e:e a:e bba bba+ab+bba bba+ab bba+ab+bba+aee bbe+aa+bbe+baa bbe+aa+bbe bbe+aa bbe No Turing Machine can decide this in general So no Turing machine can determine in general whether this simple factor graph has any positive-weight solutions: z x y 145 Inference by Belief Propagation 147 148 149 150 151 152 153 Loopy belief propagation (BP) The beautiful observation (Dreyer & Eisner 2009): Each message is a 1-tape FSM that scores all strings. Messages are iteratively updated based on other messages, according to the rules of BP. The BP rules require only operations under which FSMs are closed! Achilles’ heel: The FSMs may grow large as the algorithm iterates. So the algorithm may be slow, or not converge at all. 154 Computing Marginal Beliefs X7 s ae h r i n g eu ε s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Computing Marginal Beliefs X7 X1 s ae h Computation of belief r i n g results in large state e us εae h space s sareaei hnhg r ri einung εgs a h s ae hes ueaueε εh r i en g r i n gr siX3naeg h u e ε eeur ihε n g e u εs a r i ne ug ε s eaue h r i n εg Ce u X4 ε X2 X5 Computing Marginal Beliefs X7 X1 X2 s ae h Computation of belief r i n g results in large state e us εae h space s sareaei hnhg r ri einung εgs a h s ae hes ueaueε εh r i en g r i n gr siX3naeg h u e ε eeur ihε n g e u εs a r i ne ug ε s eaue h r i n εg Ce u X4 ε What a hairball! X5 Computing Marginal Beliefs X7 s ae h r i n g eu ε Approximation Required!!! s ae s a h e r i n g X3 eu ε s ae h r i n g eu ε X1 h r i n g eu ε X4 X2 X5 BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex. a a a X1 ψ2 ψ1 X2 a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a ε a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a ε a a a BP over String-Valued Variables • In fact, with a cyclic factor graph, messages and marginal beliefs grow unboundedly complex! a a a X1 ψ2 ψ1 X2 a a a a a a a a a a a a a a a ε a a a a a a a a a a a a a Inference by Expectation Propagation Expectation Propagation (EP) Belief at X3 will be simple! X7 exponential-family approximations inside Messages to and from X3 will be simple! X3 X1 X4 X2 X5 178 Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed. Projects the message back into a tractable family. 179 Expectation Propagation (EP) Belief at X3 will be simple! Messages to and from X3 will be simple! X3 exponential-family approximations inside Expectation propagation (EP) EP solves the problem by simplifying each message once it is computed. Projects the message back into a tractable family. In our setting, we can use n-gram models. fapprox(x) = product of weights of the n-grams in x Just need to choose weights that give a good approx Best to use variable-length n-grams. 181 Expectation Propagation (EP) in a Nutshell X7 s ae h r i n g eu ε s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 s ae h r i n g X3 eu ε s ae h r i n g eu ε X1 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 s ae h r i n g eu ε s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 foo 1.2 bar 0.5 baz 4.3 s ae h r i n g eu ε X4 X2 X5 Expectation Propagation (EP) in a Nutshell X7 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X1 X3 foo 1.2 bar 0.5 baz 4.3 foo 1.2 bar 0.5 baz 4.3 X4 X2 X5 Expectation propagation (EP) 187 Variable Order Approximations • Use only the n-grams you really need! Approximating beliefs with n-grams How to optimize this? Option 1: Greedily add n-grams by expected count in f. Stop when adding next batch hurts the objective. Option 2: Select n-grams from a large set using a convex relaxation + projected gradient (= treestructured group lasso). Must incrementally expand the large set (“active set” method). 189 Results using Expectation Propagation Speed ranking (upper graph) Accuracy ranking (lower graph) … essentially opposites … Trigram EP (Cyan) – slow, very accurate Baseline (Black) – slow, very accurate (pruning) Penalized EP (Red) – pretty fast, very accurate Bigram EP (Blue) – fast but inaccurate Unigram EP (Green) – fast but inaccurate 191 192 Inference by Dual Decomposition Exact 1-best inference! (can’t terminate in general because of undecidability, but does terminate in practice) General Idea of Dual Decomp rεzign rizajgn s eɪʃən dæmn rεzɪgn#eɪʃən rizajn#s dæmn#eɪʃən dæmn#s r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz 196 General Idea of Dual Decomp rεzɪgn eɪʃən rizajn z dæmn eɪʃən dæmn z rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Subproblem 2 Subproblem 3 Subproblem 4 Subproblem 1 197 General Idea of Dual Decomp I prefer rεzɪgn rεzɪgn I prefer rizajn eɪʃən rizajn z dæmn eɪʃən dæmn z rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Subproblem 2 Subproblem 3 Subproblem 4 Subproblem 1 198 rεzɪgn eɪʃən rizajn z dæmn eɪʃən dæmn z rεzɪgn#eɪʃən rizajn#z dæmn#eɪʃən dæmn#z r,εzɪgn’eɪʃn riz’ajnz d,æmn’eɪʃn d’æmz Subproblem 2 Subproblem 3 Subproblem 4 Subproblem 1 199 Substring Features and Active Set I prefer rizajn Less i, a, j; more ε, ɪ, g (to match others) rεzɪgn eɪʃən I prefer rεzɪgn rεzɪgn#eɪʃən r,εzɪgn’eɪʃn Subproblem 1 rizajn z Less ε, ɪ, g; rizajn#z more i, a, j dæmn eɪʃən dæmn z dæmn#eɪʃən dæmn#z riz’ajnz d,æmn’eɪʃn d’æmz Subproblem 1 Subproblem 1 Subproblem 1 (to match others) 200 Features: “Active set” method • How many features? • Infinitely many possible n-grams! • Trick: Gradually increase feature set as needed. – Like Paul & Eisner (2012), Cotterell & Eisner (2015) 1. Only add features on which strings disagree. 2. Only add abcd once abc and bcd already agree. – Exception: Add unigrams and bigrams for free. 201 Fragment of Our Graph for Catalan ? ? ? ? gris grizos ? Stem of “grey” ? ? ? ? grize grizes Separate these 4 words into 4 subproblems as before … 202 Redraw the graph to focus on the stem … ? ? ? ? ? gris grizos grize grizes 203 Separate into 4 subproblems – each gets its own copy of the stem ? ? ? ? ? ? ? ? gris grizos grize grizes 204 nonzero features: {} Iteration: 1 ε ε ε ε ? ? ? ? gris grizos grize grizes 205 nonzero features: {} Iteration: 3 g g g g ? ? ? ? gris grizos grize grizes 206 nonzero features: {s, z, is, iz, s$, z$ } Iteration: 4 Feature weights (dual variable) gris griz griz griz ? ? ? ? gris grizos grize grizes 207 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 5 Feature weights (dual variable) gris griz grizo griz ? ? ? ? gris grizos grize grizes 208 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 6 Iteration: 13 Feature weights (dual variable) gris griz grizo griz ? ? ? ? gris grizos grize grizes 209 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 14 Feature weights (dual variable) griz griz grizo griz ? ? ? ? gris grizos grize grizes 210 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$ } Iteration: 17 Feature weights (dual variable) griz griz griz griz ? ? ? ? gris grizos grize grizes 211 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 18 Feature weights (dual variable) griz griz griz grize ? ? ? ? gris grizos grize grizes 212 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 19 Iteration: 29 Feature weights (dual variable) griz griz griz grize ? ? ? ? gris grizos grize grizes 213 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Feature weights (dual variable) griz griz griz griz ? ? ? ? gris grizos grize grizes 214 nonzero features: {s, z, is, iz, s$, z$, o, zo, o$, e, ze, e$} Iteration: 30 Converged! griz griz griz griz ? ? ? ? gris grizos grize grizes 215 Why n-gram features? • Positional features don’t understand insertion: griz giz I’ll try to arrange for r not i at position 2, i not z at position 3, z not e at position 4. • In contrast, our “z” feature counts the number of “z” phonemes, without regard to position. griz giz I need more r’s. These solutions already agree on “g”, “i”, “z” counts … they’re only negotiating over the “r” count. 216 Why n-gram features? • Adjust weights λ until the “r” counts match: griz giz I need more r’s … somewhere. • Next iteration agrees on all our unigram features: griz girz I need more gr, ri, iz, less gi, ir, rz. – Oops! Features matched only counts, not positions – But bigram counts are still wrong … so bigram features get activated to save the day – If that’s not enough, add even longer substrings … 217 Results using Dual Decomposition 7 Inference Problems (graphs) EXERCISE (small) CELEX (large) o 4 languages: Catalan, English, Maori, Tangale o 3 languages: English, German, Dutch o 16 to 55 underlying morphemes. o 341 to 381 underlying morphemes. o 55 to 106 surface words. o 1000 surface words for each language. 219 Experimental Questions o Is exact inference by DD practical? o Does it converge? o Does it get better results than approximate inference methods? o Does exact inference help EM? 221 primal (function of strings x) dual (function of weights λ) ≤ ● DD seeks best λ via subgradient algorithm reduce dual objective tighten upper bound on primal objective ● If λ gets all sub-problems to agree (x1 = … = xK) constraints satisfied dual value is also value of a primal solution which must be max primal! (and min dual) 222 Convergence behavior (full graph) Dual (tighten upper bound) primal (improve strings) Catalan Maori English Tangale 223 Comparisons ● Compare DD with two types of Belief Propagation (BP) inference. Approximate MAP inference (max-product BP) (baseline) Approximate marginal inference (sum-product BP) variational (TACL 2015) approximation Exact MAP inference (dual decomposition) (this paper) Viterbi Exact marginal inference (we don’t know how!) approximation 224 Inference accuracy Model 1 – trivial phonology Model 2S – oracle phonology Model 2E – learned phonology (inference used within EM) Approximate MAP inference Model 1, EXERCISE: 90% (max-product BP) Model 1, CELEX: 84% (baseline) Model 2S, CELEX: 99% Model 2E, EXERCISE: 91% Approximate marginal inference (sum-product BP) (TACL 2015) Model 1, EXERCISE: 95% Model 1, CELEX: 86% Model 2S, CELEX: 96% worse Model 2E, EXERCISE: 95% Exact MAP inference (dual decomposition) (this paper) Model 1, EXERCISE: 97% Model 1, CELEX: 90% Model 2S, CELEX: 99% Model 2E, EXERCISE: 98% 225 Conclusion •A general DD algorithm for MAP inference on graphical models over strings. •On the phonology problem, terminates in practice, guaranteeing the exact MAP solution. •Improved inference for supervised model; improved EM training for unsupervised model. •Try it for your own problems generalizing to new strings! 226 Future Work observed data probability distribution hidden data 227 Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to figure that out from raw text: Related spellings Related contexts shared morphemes? 228 Linguistics quiz: Find a morpheme Blah blah blah snozzcumber blah blah blah. Blah blah blahdy abla blah blah. Snozzcumbers blah blah blah abla blah. Blah blah blah snezzcumbri blah blah snozzcumber. Dreyer & Eisner 2011 – “select & mutate” Many possible morphological slots Andrews, Dredze, & Eisner 2014 – “select & mutate” Many possible phylogenies NEW Did [Taylor swift] just dis harry sytles Lets see how bad [T Swift] will be. #grammys it’s clear that [T-Swizzle] is on drugs [Taylor swift] is apart of the Illuminati Ladies STILL love [LL Cool James]. [LL Cool J] is looking realllll chizzled! Future: Which words are related? So far, we were told that “resignation” shares morphemes with “resigns” and “damnation.” We’d like to get that from raw text. To infer the abstract morphemes from context, we need to extend our generative story to capture regularity in morpheme sequences. Neural language models … But, must deal with unknown, unbounded vocabulary of morphemes 234 Reconstructing the (multilingual) lexicon Index Spelling Meaning Pronunciation Syntax 123 ca [si.ei] NNP (abbrev) 124 can [kɛɪn] NN 125 can [kæn], [kɛn], … MD 126 cane [keɪn] NN (mass) 127 cane [keɪn] NN 128 canes [keɪnz] NNS (other columns would include translations, 236 topics, counts, embeddings, …) Conclusions Unsupervised learning of how all the words in a language (or across languages) are interrelated. This is what kids and linguists do. Given data, estimate a posterior distribution over the infinite probabilistic lexicon. While training parameters that model how lexical entries are related (language-specific derivational processes or soft constraints). Starting to look feasible! We now have a lot of the ingredients – generative models and algorithms. 237