Advanced Artificial Intelligence Part II. Statistical NLP N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken from Helmut Schmid, Rada Mihalcea, Bonnie Dorr, Leila Kosseim and others Contents Short recap motivation for SNLP Probabilistic language models N-gramms Predicting the next word in a sentence Language guessing Largely chapter 6 of Statistical NLP, Manning and Schuetze. And chapter 6 of Speech and Language Processing, Jurafsky and Martin Motivation Human Language is highly ambiguous at all levels • acoustic level recognize speech vs. wreck a nice beach • morphological level saw: to see (past), saw (noun), to saw (present, inf) • syntactic level I saw the man on the hill with a telescope • semantic level One book has to be read by every student NLP and Statistics Statistical Disambiguation • Define a probability model for the data • Compute the probability of each alternative • Choose the most likely alternative Language Models Speech recognisers use a „noisy channel model“ source s channel acoustic signal a The source generates a sentence s with probability P(s). The channel transforms the text into an acoustic signal with probability P(a|s). The task of the speech recogniser is to find for a given speech signal a the most likely sentence s: s = argmaxs P(s|a) = argmaxs P(a|s) P(s) / P(a) = argmaxs P(a|s) P(s) Language Models Speech recognisers employ two statistical models: • a language model P(s) • an acoustics model P(a|s) Language Model Definition: • Language model is a model that enables one to compute the probability, or likelihood, of a sentence s, P(s). Let’s look at different ways of computing P(s) in the context of Word Prediction Example of bad language model A bad language model A bad language model A Good Language Model Determine reliable sentence probability estimates P(“And nothing but the truth”) 0.001 P(“And nuts sing on the roof”) 0 Shannon game Word Prediction Predicting the next word in the sequence • • • • • Statistical natural language …. The cat is thrown out of the … The large green … Sue swallowed the large green … … Claim A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques. Compute: - probability of a sequence - likelihood of words co-occurring Why would we want to do this? • Rank the likelihood of sequences containing various alternative alternative hypotheses • Assess the likelihood of a hypothesis Applications Spelling correction Mobile phone texting Speech recognition Handwriting recognition Disabled users … Spelling errors They are leaving in about fifteen minuets to go to her house. The study was conducted mainly be John Black. Hopefully, all with continue smoothly in my absence. Can they lave him my messages? I need to notified the bank of…. He is trying to fine out. Handwriting recognition Assume a note is given to a bank teller, which the teller reads as I have a gub. (cf. Woody Allen) NLP to the rescue …. • gub is not a word • gun, gum, Gus, and gull are words, but gun has a higher probability in the context of a bank For Spell Checkers Collect list of commonly substituted words • piece/peace, whether/weather, their/there ... Example: “On Tuesday, the whether …’’ “On Tuesday, the weather …” Language Models How to assign probabilities to word sequences? The probability of a word sequence w1,n is decomposed into a product of conditional probabilities. P(w1,n) = P(w1) P(w2 | w1) P(w3 | w1,w2) ... P(wn | w1,n-1) = i=1..n P(wi | w1,i-1) Language Models In order to simplify the model, we assume that • each word only depends on the 2 preceding words P(wi | w1,i-1) = P(wi | wi-2, wi-1) • 2nd order Markov model, trigram • that the probabilities are time invariant (stationary) P(Wi=c | Wi-2=a, Wi-1=b) = P(Wk=c | Wk-2=a, Wk-1=b) Final formula: P(w1,n) = i=1..n P(wi | wi-2, wi-1) Simple N-Grams An N-gram model uses the previous N-1 words to predict the next one: • P(wn | wn-N+1 wn-N+2… wn-1 ) unigrams: P(dog) bigrams: P(dog | big) trigrams: P(dog | the big) quadrigrams: P(dog | chasing the big) A Bigram Grammar Fragment Eat on .16 Eat Thai .03 Eat some .06 Eat breakfast .03 Eat lunch .06 Eat in .02 Eat dinner .05 Eat Chinese .02 Eat at .04 Eat Mexican .02 Eat a .04 Eat tomorrow .01 Eat Indian .04 Eat dessert .007 Eat today .03 Eat British .001 Additional Grammar <start> I <start> I’d <start> Tell <start> I’m I want I would I don’t I have Want to Want a .25 .06 .04 .02 .32 .29 .08 .04 .65 .05 Want some Want Thai To eat To have To spend To be British food British restaurant British cuisine British lunch .04 .01 .26 .14 .09 .02 .60 .15 .01 .01 Computing Sentence Probability P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25x.32x.65x.26x.001x.60 = .000080 vs. P(I want to eat Chinese food) = .00015 Probabilities seem to capture “syntactic'' facts, “world knowledge'' - eat is often followed by a NP - British food is not too popular N-gram models can be trained by counting and normalization Some adjustments product of probabilities… numerical underflow for long sentences so instead of multiplying the probs, we add the log of the probs P(I want to eat British food) Computed using log(P(I|<s>)) + log(P(want|I)) + log(P(to|want)) + log(P(eat|to)) + log(P(British|eat)) + log(P(food|British)) = log(.25) + log(.32) + log(.65) + log (.26) + log(.001) + log(.6) = -11.722 Why use only bi- or tri-grams? Markov approximation is still costly with a 20 000 word vocabulary: • bigram needs to store 400 million parameters • trigram needs to store 8 trillion parameters • using a language model > trigram is impractical to reduce the number of parameters, we can: • • • • do stemming (use stems instead of word types) group words into semantic classes seen once --> same as unseen ... Building n-gram Models Data preparation: • Decide training corpus • Clean and tokenize • How do we deal with sentence boundaries? I eat. I sleep. • (I eat) (eat I) (I sleep) <s>I eat <s> I sleep <s> • (<s> I) (I eat) (eat <s>) (<s> I) (I sleep) (sleep <s>) Use statistical estimators: • to derive a good probability estimates based on training data. Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • ( Validation: » Held Out Estimation » Cross Validation ) • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Statistical Estimators --> Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • ( Validation: » Held Out Estimation » Cross Validation ) • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Maximum Likelihood Estimation Choose the parameter values which gives the highest probability on the training corpus Let C(w1,..,wn) be the frequency of n-gram w1,..,wn C(w1,..,w n ) PMLE (wn |w1,..,wn-1 ) C(w1,..,w n-1 ) Example 1: P(event) in a training corpus, we have 10 instances of “come across” • 8 times, followed by “as” • 1 time, followed by “more” • 1 time, followed by “a” with MLE, we have: • • • • P(as | come across) = 0.8 P(more | come across) = 0.1 P(a | come across) = 0.1 P(X | come across) = 0 where X “as”, “more”, “a” Problem with MLE: data sparseness What if a sequence never appears in training corpus? P(X)=0 • “come across the men” --> prob = 0 • “come across some men” --> prob = 0 • “come across 3 men” --> prob = 0 MLE assigns a probability of zero to unseen events … probability of an n-gram involving unseen words will be zero! Maybe with a larger corpus? Some words or word combinations are unlikely to appear !!! Recall: • Zipf’s law • f ~ 1/r Problem with MLE: data sparseness (con’t) in (Balh et al 83) • training with 1.5 million words • 23% of the trigrams from another part of the same corpus were previously unseen. So MLE alone is not good enough estimator Discounting or Smoothing MLE is usually unsuitable for NLP because of the sparseness of the data We need to allow for possibility of seeing events not seen in training Must use a Discounting or Smoothing technique Decrease the probability of previously seen events to leave a little bit of probability for previously unseen events Statistical Estimators Maximum Likelihood Estimation (MLE) --> Smoothing • --> Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • ( Validation: » Held Out Estimation » Cross Validation ) • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Many smoothing techniques Add-one Add-delta Witten-Bell smoothing Good-Turing smoothing Church-Gale smoothing Absolute-discounting Kneser-Ney smoothing ... Add-one Smoothing (Laplace’s law) Pretend we have seen every n-gram at least once Intuitively: • new_count(n-gram) = old_count(n-gram) + 1 The idea is to give a little bit of the probability space to unseen events Add-one: Example 2nd word unsmoothed bigram counts: 1st word I want to eat Chinese food lunch … Total (N) I 8 1087 0 13 0 0 0 3437 want 3 0 786 0 6 8 6 1215 to 3 0 10 860 3 0 12 3256 eat 0 0 2 0 19 2 52 938 Chinese 2 0 0 0 0 120 1 213 food 19 0 17 0 0 0 0 1506 lunch 4 0 0 0 0 1 0 459 … unsmoothed normalized bigram probabilities: I want to eat Chinese food lunch .32 0 0 0 1 0 .65 .0038 (13/3437) 0 0 want .0023 (8/3437) .0025 .0049 .0066 .0049 1 to .00092 0 .0031 .26 .00092 0 .0037 1 eat 0 0 .0021 0 .020 .0021 .055 1 Chinese .0094 0 0 0 0 .56 .0047 1 food .013 0 .011 0 0 0 0 1 lunch .0087 0 0 0 0 .0022 0 1 I … … Total Add-one: Example (con’t) add-one smoothed bigram counts: I want I 8 9 want to eat Chinese food lunch … Total (N+V) 1 14 1 1 1 3 4 1087 1088 1 787 1 7 9 7 3437 5053 2831 to 4 1 11 861 4 1 13 4872 eat 1 1 23 1 20 3 53 2554 Chinese 3 1 1 1 1 121 2 1829 food 20 1 18 1 1 1 1 3122 lunch 5 1 1 1 1 2 1 2075 add-one normalized bigram probabilities: I want to eat Chinese food lunch .22 .0002 .0002 .0002 1 .00035 .28 .0028 (14/5053) .00035 .0002 want .0018 (9/5053) .0014 .0025 .0032 .0025 1 to .00082 .00021 .0023 .18 .00082 .00021 .0027 1 eat .00039 .00039 .0012 .00039 .0078 .0012 .021 1 Chinese .0016 .00055 .00055 .00055 .00055 .066 .0011 1 food .0064 .00032 .0058 .00032 .00032 .00032 .00032 1 lunch .0024 .00048 .00048 .00048 .00048 .0022 .00048 1 I … Total Add-one, more formally C (w1 w2 wn)+1 PAdd1(w1 w2 wn) N B N: nb of n-grams in training corpus B:nb of bins (of possible n-grams) B = V^2 for bigrams B = V^3 for trigrams etc. where V is size of vocabulary Problem with add-one smoothing bigrams starting with Chinese are boosted by a factor of 8 ! (1829 / 213) 1st word unsmoothed bigram counts: I want to eat Chinese food lunch I want 8 3 3 0 2 19 4 to 1087 0 0 0 0 0 0 eat 0 786 10 2 0 17 0 Chinese food 13 0 860 0 0 0 0 0 6 3 19 0 0 0 0 8 0 2 120 0 1 add-one smoothed bigram counts: I I 1st word want want 9 4 to eat Chinese lunch food … 0 6 12 52 1 0 0 lunch … Total (N) 3437 1215 3256 938 213 1506 459 Total (N+V) 1088 1 14 1 1 1 5053 1 787 1 7 9 7 2831 to 4 1 11 861 4 1 13 4872 eat 1 1 23 1 20 3 53 2554 Chinese 3 1 1 1 1 121 2 1829 food 20 1 18 1 1 1 1 3122 lunch 5 1 1 1 1 2 1 2075 Problem with add-one smoothing (con’t) Data from the AP from (Church and Gale, 1991) • Corpus of 22,000,000 bigrams • Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams - or bins) • 74,671,100,000 bigrams were unseen • And each unseen bigram was given a frequency of 0.000137 Add-one Freq. from smoothed freq. fMLE fempirical fadd-one training data 0 0.000027 0.000137 Freq. from too high 1 0.448 0.000274 held-out data 2 1.25 0.000411 3 2.24 0.000548 4 3.23 0.000685 5 4.21 0.000822 Total probability mass given to unseen bigrams = (74,671,100,000 x 0.000137) / 22,000,000 ~0.465 !!!! too low Problem with add-one smoothing every previously unseen n-gram is given a low probability but there are so many of them that too much probability mass is given to unseen events adding 1 to frequent bigram, does not change much but adding 1 to low bigrams (including unseen ones) boosts them too much ! In NLP applications that are very sparse, Laplace’s Law actually gives far too much of the probability space to unseen events. Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • --> Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • Validation: » Held Out Estimation » Cross Validation • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Add-delta smoothing (Lidstone’s law) instead of adding 1, add some other (smaller) positive value C (w1 w2 wn) PAddD(w1 w2 wn) N B most widely used value for = 0.5 if =0.5, Lidstone’s Law is called: • the Expected Likelihood Estimation (ELE) • or the Jeffreys-Perks Law C (w1 w2 wn) 0.5 PELE(w1 w2 wn) N 0.5 B better than add-one, but still… Adding / Lidstone‘s law The expected frequency of a trigram in a random sample of size N is therefore f*(w,w‘,w‘‘) = f(w,w‘,w‘‘) + (1- ) N/B relative discounting Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • --> ( Validation: » Held Out Estimation » Cross Validation ) • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Validation / Held-out Estimation How do we know how much of the probability space to “hold out” for unseen events? ie. We need a good way to guess in advance Held-out data: • We can divide the training data into two parts: the training set: used to build initial estimates by counting the held out data: used to refine the initial estimates (i.e. see how often the bigrams that appeared r times in the training text occur in the held-out text) Held Out Estimation For each n-gram w1...wn we compute: • Ctr(w1...wn) the frequency of w1...wn in the training data • Cho(w1...wn) the frequency of w1...wn in the held out data Let: • r = the frequency of an n-gram in the training data • Nr = the number of different n-grams with frequency r in the training data • Tr = the sum of the counts of all n-grams in the held-out data that appeared r times in the training data Tr C ho (w1 L w n ) {w1 L w n where C tr (w1 L w n ) r} • T = total number of n-gram in the held out data So: Pho (w1 L w n ) Tr 1 x T Nr where r Ctr (w1 L w n ) Some explanation… Pho (w1 L w n ) probability in held-out data for all n-grams appearing r times in the training data Tr 1 x T Nr where r Ctr (w1 L w n ) since we have Nr different n-grams in the training data that occurred r times, let's share this probability mass equality among them ex: assume • if r=5 and 10 different n-grams (types) occur 5 times in training • --> N5 = 10 • if all the n-grams (types) that occurred 5 times in training, occurred in total (n-gram tokens) 20 times in the held-out data • --> T5 = 20 • assume the held-out data contains 2000 n-grams (tokens) Pho (an n-gram with r 5) 20 1 x 0.001 2000 10 Cross-Validation Held Out estimation is useful if there is a lot of data available If not, we can use each part of the data both as training data and as held out data. Main methods: • Deleted Estimation (two-way cross validation) Divide data into part 0 and part 1 In one model use 0 as the training data and 1 as the held out data In another model use 1 as training and 0 as held out data. Do a weighted average of the two models • Leave-One-Out Divide data into N parts (N = nb of tokens) Leave 1 token out each time Train N language models Comparison Empirical results for bigram data (Church and Gale) f femp fGT fadd1 fheld-out 0 0.000027 0.000027 0.000137 0.000037 1 0.448 0.446 0.000274 0.396 2 1.25 1.26 0.000411 1.24 3 2.24 2.24 0.000548 2.23 4 3.23 3.24 0.000685 3.22 5 4.21 4.22 0.000822 4.22 6 5.23 5.19 0.000959 5.20 7 6.21 6.21 0.00109 6.21 8 7.21 7.24 0.00123 7.18 9 8.26 8.25 0.00137 8.18 Training: • Training data (80% of total data) • To build initial estimates (frequency counts) Held out data (10% of total data) Dividing the corpus To refine initial estimates (smoothed estimates) Testing: • Development test data (5% of total data) • Final test data (5% of total data) To test while developing To test at the end But how do we divide? • Randomly select data (ex. sentences, n-grams) • Advantage: Test data is very similar to training data Cut large chunks of consecutive data Advantage: Results are lower, but more realistic Developing and Testing Models 1. Write an algorithm 2. Train it With training set & held-out data 3. Test it With development set 4. Note things it does wrong & revise it 5. Repeat 1-5 until satisfied 6. Only then, evaluate and publish results With final test set Better to give final results by testing on n smaller samples of the test data and averaging Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • ( Validation: » Held Out Estimation » Cross Validation ) • --> Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Witten-Bell smoothing intuition: • An unseen n-gram is one that just did not occur yet • When it does happen, it will be its first occurrence • So give to unseen n-grams the probability of seeing a new n-gram Witten-Bell: the equations Total probability mass assigned to zerofrequency N-grams: (NB: T is OBSERVED types, not V) So each zero N-gram gets the probability: Witten-Bell: why ‘discounting’ Now of course we have to take away something (‘discount’) from the probability of the events seen more than once: Witten-Bell for bigrams We `relativize’ the types to the previous word: this probability mass, must be distributed in equal parts over all unseen bigrams • Z (w1) : number of unseen n-grams starting with w1 P(w2|w1) 1 T(w1) x Z(w1) N(w1) T(w1) for each unseen event Small example a b c d … Total = N(w1) T(w1) Z(w1) nb seen tokens nb seen types nb. unseen types 30 3 1 a 10 10 10 0 b 0 0 30 0 30 1 3 c 0 0 300 0 300 1 3 d … all unseen bigrams starting with a will share a probability mass of T(a) 3 0.091 N(a) T(a) 30 3 each unseen bigrams starting with a will have an equal part of this P(d|a) 1 T(a) 1 0.091 0.091 Z(a) N(a) T(a) 1 a Small example (con’t) b c d … Total = N(w1) T(w1) Z(w1) nb seen tokens nb seen types nb. unseen types 30 3 1 a 10 10 10 0 b 0 0 30 0 30 1 3 c 0 0 300 0 300 1 3 d … all unseen bigrams starting with b will share a probability mass of T(b) 1 0.032 N(b) T(b) 30 1 each unseen bigrams starting with b will have an equal part of this P(a|b) 1 T(b) 1 x x 0.032 0.011 Z(b) N(b) T(b) 3 P(b|b) 1 T(b) 1 x x 0.032 0.011 Z(b) N(b) T(b) 3 P(d|b) 1 T(b) 1 x x 0.032 0.011 Z(b) N(b) T(b) 3 Small example (con’t) a b c d … Total = N(w1) T(w1) Z(w1) nb seen tokens nb seen types nb. unseen types 30 3 1 a 10 10 10 0 b 0 0 30 0 30 1 3 c 0 0 300 0 300 1 3 d … all unseen bigrams starting with c will share a probability mass of T(c) 1 0.0033 N(c) T(c) 300 1 each unseen bigrams starting with c will have an equal part of this P(a|c) 1 T(c) 1 x x 0.0033 0.0011 Z(c) N(c) T(c) 3 P(b|c) 1 T(c) 1 x x 0.0033 0.0011 Z(c) N(c) T(c) 3 P(d|c) 1 T(c) 1 x x 0.0033 0.0011 Z(c) N(c) T(c) 3 More formally Unseen bigrams: • To get from the probabilities back to the counts, we know that: P(w2|w1) C(w2|w1 ) N(w1) // N (w1) = nb of bigrams starting with w1 • so we get: C(w2|w1) P(w2|w1 ) N(w1) 1 T(w1 ) N(w1 ) Z(w1) N(w1 ) T(w1 ) T(w1 ) N(w1 ) Z(w1) N(w1 ) T(w1 ) The restaurant example The original counts were: I want to eat Chine se food lunch … N(w) T(w) seen bigram seen bigram tokens types Z(w) unseen bigram types I 8 1087 0 13 0 0 0 3437 95 1521 want 3 0 786 0 6 8 6 1215 76 1540 to 3 0 10 860 3 0 12 3256 130 1486 eat 0 0 2 0 19 2 52 938 124 1492 Chinese 2 0 0 0 0 120 1 213 20 1592 food 19 0 17 0 0 0 0 1506 82 534 lunch 4 0 0 0 0 1 0 459 45 1571 T(w)= number of different seen bigrams types starting with w we have a vocabulary of 1616 words, so we can compute Z(w)= number of unseen bigrams types starting with w Z(w) = 1616 - T(w) N(w) = number of bigrams tokens starting with w Witten-Bell smoothed count • the count of the unseen bigram “I lunch” T(I) N(I) 95 3437 x x 0.06 Z(I) N(I) T(I) 1521 3437 95 • the count of the seen bigram “want to” count(want to)x N(want) 1215 786x 739.73 N(want) T(want) 1215 76 Witten-Bell smoothed bigram counts: I want to eat Chinese food lunch … Total I 7.78 1057.76 .061 12.65 .06 .06 .06 3437 want 2.82 .05 739.73 .05 5.65 7.53 5.65 1215 to 2.88 .08 9.62 826.98 2.88 .08 12.50 3256 .07 .07 19.43 .07 16.78 1.77 45.93 938 1.74 .01 .01 .01 .01 109.70 .91 213 food 18.02 .05 16.12 .05 .05 .05 .05 1506 lunch 3.64 .03 .03 .03 .03 0.91 .03 459 eat Chinese Witten-Bell smoothed probabilities Witten-Bell normalized bigram probabilities: I want to eat Chinese food lunch .3078 .000002 .0037 .000002 .000002 .000002 1 want .0022 (7.78/3437) .00230 .00004 .6088 .00004 .0047 .0062 .0047 1 to .00009 .00003 .0030 .2540 .00009 .00003 .0038 1 eat .00008 .00008 .0021 .00008 .0179 .0019 .0490 1 Chinese .00812 .00005 .00005 .00005 .00005 .5150 .0042 1 food .0120 .00004 .0107 .00004 .00004 .00004 .00004 1 lunch .0079 .00006 .00006 .00006 .00006 .0020 .00006 1 I … Total Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • Validation: » Held Out Estimation » Cross Validation • Witten-Bell smoothing • --> Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Good-Turing Estimator Based on the assumption that words have a binomial distribution Works well in practice (with large corpora) Idea: • Re-estimate the probability mass of n-grams with zero (or low) counts by looking at the number of n-grams with higher counts • Ex: N c* (c 1) c1 Nc Nb of ngrams that occur c+1 times Nb of ngrams that occur c times new count for bigrams that never occurred c o (0 1) N1 No nb of bigrams that occurred once nb of bigrams that never occurred Good-Turing Estimator (con’t) In practice c* is not used for all counts c large counts (> a threshold k) are assumed to be reliable If c > k (usually k = 5) c* = c If c <= k N c1 (k 1)N k 1 c Nc N1 for 1 c k (k 1)N k 1 1 N1 (c 1) c* Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • ( Validation: » Held Out Estimation » Cross Validation ) • Witten-Bell smoothing • Good-Turing smoothing --> Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Combining Estimators so far, we gave the same probability to all unseen ngrams • we have never seen the bigrams journal of journal from journal never Punsmoothed(of |journal) = 0 Punsmoothed(from |journal) = 0 Punsmoothed(never |journal) = 0 • all models so far will give the same probability to all 3 bigrams but intuitively, “journal of” is more probable because... • “of” is more frequent than “from” & “never” • unigram probability P(of) > P(from) > P(never) Combining Estimators (con’t) observation: • unigram model suffers less from data sparseness than bigram model • bigram model suffers less from data sparseness than trigram model • … so use a lower model estimate, to estimate probability of unseen n-grams if we have several models of how the history predicts what comes next, we can combine them in the hope of producing an even better model Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • Validation: » Held Out Estimation » Cross Validation • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • --> Simple Linear Interpolation • General Linear Interpolation • Katz’s Backoff Simple Linear Interpolation Solve the sparseness in a trigram model by mixing with bigram and unigram models Also called: • linear interpolation, • finite mixture models • deleted interpolation Combine linearly Pli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1) • where 0 i 1 and i i =1 Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • Validation: » Held Out Estimation » Cross Validation • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • --> General Linear Interpolation • Katz’s Backoff General Linear Interpolation In simple linear interpolation, the weights i are constant So the unigram estimate is always combined with the same weight, regardless of whether the trigram is accurate (because there is lots of data) or poor We can have a more general and powerful model where i are a function of the history h Pgli (wn |wn-2 ,wn-1 ) 1 P(wn ) 2 (wn-1 ) P(wn |wn-1 ) 3 (wn-2 ,wn-1 ) P (wn |wn-2 ,wn-1 ) • where 0 i(h) 1 and i i(h) =1 Having a specific (h) per n-gram is not a good idea, but we can set a (h) according to the frequency of the n-gram Statistical Estimators Maximum Likelihood Estimation (MLE) Smoothing • Add-one -- Laplace • Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE) • Validation: » Held Out Estimation » Cross Validation • Witten-Bell smoothing • Good-Turing smoothing Combining Estimators • Simple Linear Interpolation • General Linear Interpolation • --> Katz’s Backoff Backoff Smoothing Smoothing of Conditional Probabilities p(Angeles | to, Los) If „to Los Angeles“ is not in the training corpus, the smoothed probability p(Angeles | to, Los) is identical to p(York | to, Los). However, the actual probability is probably close to the bigram probability p(Angeles | Los). Backoff Smoothing (Wrong) Back-off Smoothing of trigram probabilities if C(w‘, w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘, w‘‘) else if C(w‘‘, w) > 0 P*(w | w‘, w‘‘) = P(w | w‘‘) else if C(w) > 0 P*(w | w‘, w‘‘) = P(w) else P*(w | w‘, w‘‘) = 1 / #words Backoff Smoothing Problem: not a probability distribution Solution: Combination of Back-off and frequency discounting P(w | w1,...,wk) = C*(w1,...,wk,w) / N if C(w1,...,wk,w) > 0 else P(w | w1,...,wk) = (w1,...,wk) P(w | w2,...,wk) Backoff Smoothing The backoff factor is defined s.th. the probability mass assigned to unobserved trigrams (w1,...,wk) P(w | w2,...,wk)) w: C(w ,...,w ,w)=0 1 k is identical to the probability mass discounted from the observed trigrams. 1- P(w | w1,...,wk)) w: C(w ,...,w ,w)>0 1 k Therefore, we get: (w1,...,wk) = ( 1 - P(w | w1,...,wk)) / (1 - w: C(w ,...,w ,w)>0 1 k P(w | w2,...,wk)) w: C(w ,...,w ,w)>0 1 k Other applications of LM Author / Language identification hypothesis: texts that resemble each other (same author, same language) share similar characteristics • In English character sequence “ing” is more probable than in French Training phase: • construction of the language model • with pre-classified documents (known language/author) Testing phase: • evaluation of unknown text (comparison with language model) Example: Language identification bigram of characters • characters = 26 letters (case insensitive) • possible variations: case sensitivity, punctuation, beginning/end of sentence marker, … A B C D … Y Z A 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014 B 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014 C 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014 D 0.0042 0.0014 0.0014 0.0014 … 0.0014 0.0014 E 0.0097 0.0014 0.0014 0.0014 … 0.0014 0.0014 … … … … … … … 0.0014 Y 0.0014 0.0014 0.0014 0.0014 … 0.0014 0.0014 Z 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 0.0014 1. Train a language model for English: 2. Train a language model for French 3. Evaluate probability of a sentence with LM-English & LM-French 4. Highest probability -->language of sentence Claim A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques. Compute: - probability of a sequence - likelihood of words co-occurring It can be useful to do this.