Factored Language Models EE517 Presentation April 19, 2005 Kevin Duh (duh@ee.washington.edu) Outline 1. 2. 3. 4. 5. 6. Motivation Factored Word Representation Generalized Parallel Backoff Model Selection Problem Applications Tools Factored Language Models 1 Word-based Language Models • Standard word-based language models T p(w1 , w2 ,..., wT ) p(wt | w1 ,..., wt 1 ) t 1 T p(wt | wt 1 , wt 2 ) t 1 • How to get robust n-gram estimates ( p(wt | wt 1, wt 2 ))? • Smoothing • E.g. Kneser-Ney, Good-Turing • Class-based language models p(wt | wt 1 ) p(wt | C(wt ))p(C(wt ) | C(wt 1 )) Factored Language Models 2 Limitation of Word-based Language Models • Words are inseparable whole units. • E.g. “book” and “books” are distinct vocabulary units • Especially problematic in morphologically-rich languages: • • • • E.g. Arabic, Finnish, Russian, Turkish Many unseen word contexts Arabic k-t-b High out-of-vocabulary rate Kitaab A book High perplexity Kitaab-iy My book Kitaabu-hum Their book Kutub Books Factored Language Models 3 Arabic Morphology pattern particles fa- sakan -tu affixes root LIVE + past + 1st-sg-past + part: “so I lived” • ~5000 roots • several hundred patterns • dozens of affixes Factored Language Models 4 Vocabulary Growth - full word forms vocab size English Arabic CallHome 16000 14000 12000 10000 8000 6000 4000 2000 0k 12 0k 11 0k k 10 90 k 80 k 70 k 60 k 50 k 40 k 30 k 20 10 k 0 # word tokens Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002 Factored Language Models 5 Vocabulary Growth - stemmed words vocab size EN words AR words EN stems AR stems CallHome 16000 14000 12000 10000 8000 6000 4000 2000 0k 12 0k 11 0k 10 k 90 k 80 k 70 k 60 k 50 k 40 k 30 k 20 10 k 0 # word tokens Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002 Factored Language Models 6 Solution: Word as Factors • Decompose words into “factors” (e.g. stems) • Build language model over factors: P(w|factors) • Two approaches for decomposition • Linear stem suffix prefix stem suffix • [e.g. Geutner, 1995] • Parallel • [Kirchhoff et. al., JHU Workshop 2002] • [Bilmes & Kirchhoff, NAACL/HLT 2003] Factored Language Models Mt-2 Mt-1 Mt St-2 St-1 St Wt-2 Wt-1 Wt 7 Factored Word Representations w { f , f ,..., f } f 1 2 K 1:K p(w1 , w2 ,..., wT ) p( f11:K , f 21:K ,..., fT1:K ) T 1:K p( ft1:K | ft1:K , f ) 1 t2 t 1 • Factors may be any word feature. Here we use Mt-2 morphological features: • E.g. POS, stem, root, pattern, etc. P(wt | wt 1 , wt 2 , st 1 , st 2 , mt 1 , mt 2 ) Factored Language Models Mt-1 Mt St-2 St-1 St Wt-2 Wt-1 Wt 8 Advantage of Factored Word Representations • Main advantage: Allows robust estimation of 1:K 1:K p( f | f , f ) probabilities (i.e. t t 1 t 2 ) using backoff • Word combinations in context may not be observed in training data, but factor combinations are • Simultaneous class assignment Word Kitaab-iy (My book) Kitaabu-hum (Their book) Kutub (Books) word stem root tag kitaab-iy kitaab ktb noun+poss kitaabu-hum kitaabu ktb noun+poss kutub kutub ktb noun (pl.) Factored Language Models 9 Example • Training sentence: “lAzim tiqra kutubiy bi sorca” (You have to read my books quickly) • Test sentence: “lAzim tiqra kitAbiy bi sorca” (You have to read my book quickly) Count(tiqra, kitAbiy, bi) = 0 Count(tiqra, kutubiy, bi) > 0 Count(tiqra, ktb, bi) > 0 P(bi| kitAbiy, tiqra) can back off to P(bi | ktb, tiqra) to obtain more robust estimate. => this is better than P(bi | <unknown>, tiqra) Factored Language Models 10 Language Model Backoff • When n-gram count is low, use (n-1)-gram estimate • Ensures more robust parameter estimation in sparse data: Word-based LM: Factored Language Model: Backoff path: Drop most distant word during backoff Backoff graph: multiple backoff paths possible F | F1 F2 F3 P(Wt | Wt-1 Wt-2 Wt-3) P(Wt | Wt-1 Wt-2) P(Wt | Wt-1) F | F1 F2 F | F2 F3 F | F2 F | F1 P(Wt) F | F1 F3 F | F3 F Factored Language Models 11 Choosing Backoff Paths • Four methods for choosing backoff path 1. Fixed path (a priori) 2. Choose path dynamically during training 3. Choose multiple paths dynamically during training and combine result (Generalized Parallel Backoff) 4. Constrained version of (2) or (3) F | F1 F2 F3 F | F1 F2 F | F2 F3 F | F2 F | F1 F | F1 F3 F | F3 F Factored Language Models 12 Generalized Backoff • Katz Backoff: N (wt , wt 1 , wt 2 ) d if N (wt , wt 1 , wt 2 ) 0 N ( wt ,wt1 ,wt2 ) N (wt 1 , wt 2 ) PBO (wt | wt 1 , wt 2 ) (wt 1 , wt 2 )PBO (wt | wt 1 ) otherwise • Generalized Backoff: N ( f , f P1 , f P 2 ) d if N ( f , f P1 , f P 2 ) 0 N ( f , f P1 , f P 2 ) N ( f P1 , f P 2 ) PBO ( f | f P1 , f P 2 ) ( f P1 , f P 2 ) g ( f , f P1 , f P 2 ) otherwise g() can be any positive function, but 1 some g() makes backoff weight ( f P1 , f P2 ) computation difficult dN ( f , f f :N ( f , f P1 , f P 2 )0 Factored Language Models P1 , f P 2 ) N ( f , f P1 , f P2 ) N ( f P1 , f P2 ) g( f , f P1 , f P2 ) f :N ( f , f P1 , f P 2 ) 0 13 g() functions • A priori fixed path: g ( f , f P1 , f P 2 ) PBO ( f | f P1 ) • Dynamic path: Max counts: g ( f , f P1 , f P 2 ) PBO ( f | f Pj* ) j* argmax N ( f , f Pj ) j Based on raw counts => Favors robust estimation • Dynamic path: Max normalized counts: j argmax * j N ( f , f Pj ) N ( f Pj ) Based on maximum likelihood => Favors statistical predictability Factored Language Models 14 Dynamically Choosing Backoff Paths During Training • Choose backoff path based based on g() and statistics of the data Wt | Wt-1 St-1 Tt-1 Wt | Wt-1 St-1 Wt | Wt-1 Wt | Wt-1 Tt-1 Wt | St-1 Wt | St-1 Tt-1 Wt | Tt-1 Wt Factored Language Models 15 Multiple Backoff Paths: Generalized Parallel Backoff • Choose multiple paths during training and combine probability estimates Wt | Wt-1 St-1 Tt-1 Wt | Wt-1 St-1 Wt | Wt-1 Tt-1 Wt | St-1 Tt-1 dc p ML (wt | wt 1 ,st 1 ,tt 1 ) if count threshold pbo (wt | wt 1 ,st 1 ,tt 1 ) [ pbo (wt | wt 1 ,st 1 ) pbo (wt | wt 1 ,tt 1 )] else 2 Options for combination are: - average, sum, product, geometric mean, weighted mean Factored Language Models 16 Summary: Factored Language Models FACTORED LANGUAGE MODEL = Factored Word Representation + Generalized Backoff • Factored Word Representation • Allows rich feature set representation of words • Generalized (Parallel) Backoff • Enables robust estimation of models with many conditioning variables Factored Language Models 17 Model Selection Problem • In n-grams, choose, eg. • Bigram vs. trigram vs. 4gram => relatively easy search; just try each and note perplexity on development set • In Factored LM, choose: • Initial Conditioning Factors • Backoff Graph • Smoothing Options Too many options; need automatic search Tradeoff: Factored LM is more general, but harder to select a good model that fits data well. Factored Language Models 18 Example: a Factored LM • Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model • E.g. 3 factors total: 0. Begin with full graph structure for 3 factors Wt | Wt-1 St-1 Tt-1 Wt | Wt-1 St-1 1. Initial Factors specify start-node Wt | Wt-1 Wt | Wt-1 Tt-1 Wt | St-1 Wt | St-1 Tt-1 Wt | Tt-1 Wt Factored Language Models 19 Example: a Factored LM • Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model • E.g. 3 factors total: 3. Begin with subgraph obtained with new root node Wt | Wt-1 St-1 5. Specify smoothing for each edge Wt | Wt-1 4. Specify backoff graph: i.e. what backoff to use at each node Wt | St-1 Wt Factored Language Models 20 Applications for Factored LM • Modeling of Arabic, Turkish, Finnish, German, and other morphologically-rich languages • [Kirchhoff, et. al., JHU Summer Workshop 2002] • [Duh & Kirchhoff, Coling 2004], [Vergyri, et. al., ICSLP 2004] • Modeling of conversational speech • [Ji & Bilmes, HLT 2004] • Applied in Speech Recognition, Machine Translation • General Factored LM tools can also be used to obtain various smoothed conditional probability tables for other applications outside of language modeling (e.g. tagging) • More possibilities (factors can be anything!) Factored Language Models 21 To explore further… • Factored Language Model is now part of the standard SRI Language Modeling Toolkit distribution (v.1.4.1) • Thanks to Jeff Bilmes (UW) and Andreas Stolcke (SRI) • Downloadable at: http://www.speech.sri.com/projects/srilm/ Factored Language Models 22 fngram Tools fngram-count -factor-file my.flmspec -text train.txt fngram -factor-file my.flmspec -ppl test.txt train.txt: “Factored LM is fun” W-Factored:P-adj W-LM:P-noun W-is:P-verb W-fun:P-adj my.flmspec W: 2 W(-1) P(-1) my.count my.lm 3 W1,P1 W1 kndiscount gtmin 1 interpolate P1 P1 kndiscount gtmin 1 0 0 kndiscount gtmin 1 Factored Language Models 23 Factored Language Models 24 Turkish Language Model • Newspaper text from web [Hakkani-Tür, 2000] • Train: 400K tokens / Dev: 100K / Test: 90K • Factors from morphological analyzer Word yararmanlak word root part-of-speech number case other inflection-group yararmanlak yarar NounInf-N:A3sg singular Nom Pnon NounA3sgPnonNom+Verb+Acquire+Pos Factored Language Models 25 Turkish: Dev Set Perplexity Hand FLM 555.0 Random FLM 556.4 Genetic FLM 539.2 ppl(%) 2 Wordbased LM 593.8 3 534.9 533.5 497.1 444.5 -10.6 4 534.8 549.7 566.5 522.2 -5.0 Ngram -2.9 • Factored Language Models found by Genetic Algorithms perform best • Poor performance of high order Hand-FLM corresponds to difficulty manual search Factored Language Models 26 Turkish: Eval Set Perplexity Hand FLM 558.7 Random FLM 525.5 Genetic FLM 487.8 ppl(%) 2 Wordbased LM 609.8 3 545.4 583.5 509.8 452.7 -11.2 4 543.9 559.8 574.6 527.6 -5.8 Ngram -7.2 • Dev Set results generalizes to Eval Set => Genetic Algorithms did not overfit • Best models used Word, POS, Case, Root factors and parallel backoff Factored Language Models 27 Arabic Language Model • LDC CallHome Conversational Egyptian Arabic speech transcripts • Train: 170K words / Dev: 23K / Test: 18K • Factors from morphological analyzer • [LDC,1996], [Darwish, 2002] Word Il+dOr word root morphological tag stem pattern Il+dOr dwr Noun+masc-sg+article dOr CCC Factored Language Models 28 Arabic: Dev Set and Eval Set Perplexity Dev Set perplexities Ngram Wordbased LM Hand FLM Random FLM Genetic FLM ppl(%) 2 229.9 229.6 229.9 222.9 -2.9 3 229.3 226.1 230.3 212.6 -6.0 Eval Set perplexities Ngram Word Hand Random Genetic ppl(%) 2 249.9 230.1 239.2 223.6 -2.8 3 285.4 217.1 224.3 206.2 -5.0 The best models used all available factors (Word, Stem, Root, Pattern, Morph), and various parallel backoffs Factored Language Models 29 Word Error Rate (WER) Results Dev Set Eval Set (eval 97) Stage Word LM Baseline Factored LM Word LM Baseline Factored LM 1 57.3 56.2 61.7 61.0 2a 54.8 52.7 58.2 56.5 2b 54.3 52.5 58.8 57.4 3 53.9 52.1 57.6 56.1 Factored language models gave 1.5% improvement in WER Factored Language Models 30