Foundations of Statistic Natural Language Processing CHAPTER6. STATISTICAL INFERENCE : N-GRAM MODEL OVER SPARSE DATA 1 Pusan national university 2014. 4. 22 Myoungjin, Jung INTRODUCTION Object of Statistical NLP Do statistical inference for the field of natural language. Statistical inference (크게 두 가지 과정으로 나눔) 1. Taking some data generated by unknown probability distribution. (말뭉치 필요) 2. Making some inferences about this distribution. (해당 말뭉치로 확률분포 추론) Divides the problem into three areas : (통계적 언어처리의 3가지 과정) 1. Dividing the training data into equivalence class. 2. Finding a good statistical estimator for each equivalence class. 3. Combining multiple estimators. 2 BINS : FORMING EQUIVALENCE CLASSES Reliability vs Discrimination Ex)“large green ___________” tree? mountain? frog? car? “swallowed the large green ________” pill? broccoli? smaller n: more instances in training data, better statistical estimates (more reliability) larger n: more information about the context of the specific instance (greater discrimination) 3 BINS : FORMING EQUIVALENCE CLASSES N-gram models “n-gram” = sequence of n words Pwn | w1,, wn1 predicting the next word : Markov assumption Only the prior local context - the last few words – affects the next word. Selecting an n : Vocabulary size = 20,000 words n Number of bins 2 (bigrams) 400,000,000 3 (trigrams) 8,000,000,000,000 4 (4-grams) 1.6 x 1017 4 BINS : FORMING EQUIVALENCE CLASSES Probability dist. : P(s) where s : sentence Ex. P(If you’re going to San Francisco, be sure ……) = P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * …… Markov assumption Only the last n-1 words are relevant for a prediction Ex. With n=5 P(sure|If you’re going to San Francisco, be) = P(sure|San Francisco , be) 5 BINS : FORMING EQUIVALENCE CLASSES N-gram : Sequence of length n with a count Ex. 5-gram : If you’re going to San Sequence naming : 𝑤1𝑖−1 = 𝑤1 , … , 𝑤𝑖−1 Markov assumption formalized : • 𝑖−1 P(𝑤𝑖 |𝑤1 , … , 𝑤𝑖−1 ) = P(𝑤𝑖 |𝑤1𝑖−1 ) ≅ P(𝑤𝑖 |𝑤𝑖−𝑛+1 ) n-1 words 6 BINS : FORMING EQUIVALENCE CLASSES Instead of P(s) : 𝑖−1 only one conditional prob. P(𝑤𝑖 |𝑤𝑖−𝑛+1 ) 𝑖−1 Simplify P(𝑤𝑖 |𝑤𝑖−𝑛+1 ) to P(𝑤𝑛 |𝑤1𝑛−1 ) n-1 • NWP(𝑤1𝑛−1 ) = arg max P(𝑤𝑛 |𝑤1𝑛−1 ) 𝑤𝑛 ∈ 𝑊 • Set of all words in the corpus • • n-1 next word prediction 7 BINS : FORMING EQUIVALENCE CLASSES Ex. The easiest way : 𝑛−1 𝑃𝑀𝐿𝐸 (𝑤𝑛 |𝑤1 ) = 𝑝(𝑤1𝑛 ) 𝑐(𝑤1𝑛 ) = 𝑝(𝑤1𝑛−1 ) 𝑐(𝑤1𝑛−1 ) P(San|If you’re going to) = = 𝑝(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜 𝑆𝑎𝑛) 𝑝(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜) 𝑐(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜 𝑆𝑎𝑛) 𝑐(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜) 8 STATISTICAL ESTIMATORS Given the observed training data. How do you develop a model (probability distribution) to predict future events? (더 좋은 확률의 추정) Probability estimate target feature Pwn | w1 wn 1 Pw1 wn Pw1 wn1 Estimating the unknown probability distribution of n-grams. 9 STATISTICAL ESTIMATORS Notation for the statistical estimation chapter. N Number of training instances B Number of bins training instances are divided into w1n C(w1…wn) r An n-gram w1…wn in the training text Freq. of n-gram w1…wn in the training text Freq. of an n-gram f(•) Freq. estimate of a model Nr Number of bins that have r training instances in them Tr Total count of n-grams of freq. r in further data h ‘History’ of preceding words 10 STATISTICAL ESTIMATORS Example - Instances in the training corpus: “inferior to ________” 11 MAXIMUM LIKELIHOOD ESTIMATION (MLE) Definition Using the relative frequency as a probability estimate. Example : In corpus, found 10 training instances of the word “comes across” 8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0 Formula P MLE P MLE C w 1 w n r w 1 w n N N C w 1 w n w n |w 1 w n 1 C w 1 w n 1 12 MAXIMUM LIKELIHOOD ESTIMATION (MLE) 13 MAXIMUM LIKELIHOOD ESTIMATION (MLE) Example 1. A Paragraph Using Training Data The bigram model uses the preceding word to help predict the next word. (End) In g eneral, this helps enormously, and gives us a much better model. (End) In some cas es the estimated probability of the word that actually comes next has gone up by ab out an order of magnitude (was, to, sisters). (End) However, note that the bigram m odel is not guaranteed to increase the probability estimate. (End) Word (N=79 : 𝑤1 ,…,𝑤51 ) 1-gram 2-gram 3-gram C(the)=7 C(bigram)=2 C(model)=3 C(the,bigram)=2 C(the,bigram,model)=2 P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 P(bigram)=2/79 14 LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS LAW Laplace’s law(1814; 1995) C w 1 w n 1 r 1 P LAP w 1 w n N B N B Add a little bit of probability space to unseen events 15 LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS LAW Word (N=79 : B=seen(51)+unseen(70)=121) MLE Laplace’s law 0.0886076 0.0400000 0.2857143 0.0050951 1.0000000 0.0083089 16 LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS LAW Page 202-203 (Associated Press[AP] newswire yielded a vocabulary) unseen event에 대한 약간의 확률 공간을 추가 but 너무 많은 공간을 추가하였다. 44milion의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 Bins의 개수가 training instance보다 많아지게 되는 문제가 발생 • Lap law는 unseen event에 대한 확률공간을 위해 분모에 B를 삽입 하였지만 결과적으로 약 46.5%의 확률공간을 unseen event에 주게 되었다. • N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465 17 LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS LAW Lidstone’s law(1920) and the Jeffreys-Perks law(1973) C(w1 wn ) λ PLid (w1 wn ) N Bλ Lidstone’s Law add some positive value Jeffreys-Perks Law = 0.5 Called ELE (Expected Likelihood Estimation) 18 LIDSTONE’S LAW Using Lidstone’s law, instead of adding one, add some smaller value , where the parameter is in the range And . . 19 LIDSTONE’S LAW Here, gives the maximum likelihood estimate, gives the Laplace’s law, tends to then we have the uniform estimate . if represents the trust we have in relative frequencies. implies more trust in relative frequencies than the Laplace's law while represents less trust in relative frequencies. In practice, people use values of in the range , a common value being . (Jeffreys-Perks law) 20 JEFFREYS-PERKS LAW Using Lidstone’s law, Lidstone’s law MLE ( ) ( ) Jeffreys-Perks ( ) Lidstone’s law ( ) Laplace’s law ( ) Lidstone’s law ( ) A 0.0886 0.0633 0.0538 0.0470 0.0400 0.0280 B 0.2857 0.0081 0.0063 0.0056 0.0051 0.0049 C 1.0000 0.0084 0.0085 0.0083 0.0083 0.0083 *A: , B: , C: 21 HELD OUT ESTIMATION(JELINEK AND MERCER, 1985) For each n-gram, , let : =frequencyof intrainingdata =frequencyof inheldoutdata Let be the total number of times that all n-grams that appeared r times in the tra ining text appeared in the held out data. An estimate for the probability of one of these n-gram is : where . 22 HELD OUT ESTIMATION(JELINEK AND MERCER, 1985) [Full text ( [Word ( ): (Training Data) ): , respectively], unseen word : I don't know. , unseen word : 70, respectively] : [Word ( ): respectively] : (Held out Data) (1-gram) Traing data : Held out data : , unseen word : 51- , , , ,( ,( ) ) 23 HELD OUT ESTIMATION(JELINEK AND MERCER, 1985) • • • • • training text에서 r번 나온 bigram이 추가적으로 추출한 text (further text)에서 몇 번 나오는가를 알아보는 것. Held out estimation : training text에서 r번 출현되어진 bigram이 더 많 은 text에서는 얼마나 출현 할 것인가를 예측하는 방법. Test data(training data에 독립적)는 전체 데이터의 5-10%이지만 신뢰 하기에 충분하다. 우리는 데이터를 training data의 test data로 나누기를 원한다. (검증된 데이터와 검증안된 데이터) Held out data (10%) N-gram의 held-out estimation을 통해 held-out data를 얻는다. 24 CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) Use data for both training and validation Divide test data into 2 parts Train on A, validate on B Train on B, validate on A Combine two models A B train validate Model 1 validate train Model 2 Model 1 + Model 2 Final Model 25 CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) Cross validation : training data is used both as initial training data held out data On large training corpora, deleted estimation works better than held-out estimation Tr01 Tr10 Pho w1 wn 0 or 1 where C w1 wn r Nr N Nr N Tr01 Tr10 Pdel w1 wn where C w1 wn r 0 1 N Nr Nr 26 CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) [Full text ( ): , respectively], unseen word : I don't know. [Word ( ): , unseen word : 70, respectively] : (Training Data) [A-part word ( ): , unseen word : 101, respectively] [B-part word ( ): , unseen word : 90- , respectively] A-part data : , ( B-part data : , ( ) ) 27 , . CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) [B-part word ( ): 90- , respectively] [A-part word ( ): 101+ , respectively] , unseen word : , unseen word : B-part data : , ( ) A-part data : , ( ) , . [Result] 28 , . CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) Held out estimation개념으로 우리가 training data를 두 부 분으로 나눔으로써 같은 효과를 얻는다. 이러한 메소드를 cross validation이라 한다. 더 효과적인 방법. 두 방법을 합치므로써 Nr0, Nr1의 차이 를 감소 시킨다. 큰 training curpus내의 deleted estimation은 held-out estimation보다 더 신뢰적이다. 29 GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRIBUTION] Idea: re-estimate probability mass assigned to Ngrams with zero counts Adjust actual counts 𝑟 to expected counts 𝑟 ∗ with formula r* PGT N E N r 1 r r 1 E N r * (r* is an adjusted frequency) (E denotes the expectation of random variable) 30 GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRIBUTION] If 𝐶 𝑤1𝑛 = 0 𝑃𝐺𝑇 (𝑤1𝑛 ) = If 𝐶 𝑤1𝑛 = 𝑟 > 0 𝑃𝐺𝑇 (𝑤1𝑛 ) = 𝑟∗ 1− ∞ 𝑁 𝑟=1 𝑟 𝑁 𝑁0 𝑟∗ 𝑁 = ≈ 𝑁1 𝑁0 𝑁 (𝑟+1) 𝐸(𝑁𝑟+1 ) 𝑁 𝐸(𝑁𝑟 ) 𝑟 이 작을 시 : 𝑟 > 𝑟 ∗ 𝑟 이 클 시 : 𝑟 ≈ 𝑟∗ So, over-estimator 된 것을 under-estimator 시킴 31 NOTE 𝑛 𝑃𝑀𝐿𝐸 𝑤1 = 𝐶(𝑤1𝑛 ) 𝑁 = 𝑟 𝑁 단점 : over-estimator [two discounting models] (Ney and Essen, 1993; Ney et al., 1994) Absolute discounting 𝑃𝑎𝑏𝑠 𝑤1𝑛 = 𝑃𝑎𝑏𝑠 𝑤1𝑛 = 𝑟−𝛿 , 𝑟>0 𝑁 (𝐵−𝑁0 )𝛿 ,𝑟= 𝑁0 𝑁 over-estimator를 𝛿만큼 다운시킴. 0 𝛿만큼 다운시킨것을 𝑢𝑛𝑠𝑒𝑒𝑛 𝑑𝑎𝑡𝑎에 확률로 적용 Linear discounting (1−𝛼)𝑟 P 𝑤1𝑛 = 𝑁 , 𝑟 > 0 1 − 𝛼 를 이용하여 𝑟를 조절 𝛼 P 𝑤1𝑛 = 𝑁 , 𝑟 = 0 0 α만큼 다운시킨것을 𝑢𝑛𝑠𝑒𝑒𝑛 𝑑𝑎𝑡𝑎에 확률로 적용 32 NOTE 𝑛 𝑃𝑀𝐿𝐸 𝑤1 = 𝐶(𝑤1𝑛 ) 𝑁 = 𝑟 𝑁 단점 : over-estimator [Natural Law of Succession] (Ristad, 1995) 𝑃𝑁𝐿𝑆 𝑤1𝑛 = 𝑟+1 𝑁+𝐵 (𝑟+1)(𝑁+1+𝑁0 −𝐵) 𝑁2 +𝑁+2(𝐵+𝑁0 ) (𝐵−𝑁0 )(𝐵−𝑁0 +1) 𝑁0 (𝑁2 +𝑁+2(𝐵−𝑁0 )) 𝑁0 = 0 𝑁0 > 0 𝑎𝑛𝑑 𝑟 > 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 33 COMBINING ESTIMATORS Basic Idea Consider how to combine multiple probability estimate from various different models How can you develop a model to utilize different length ngrams as appropriate? Simple linear interpolation 𝑛−1 𝑛−1 𝑃𝐿𝑖 𝑤𝑛 𝑤𝑛−2 = 𝛾1 𝑃1 𝑤𝑛 + 𝛾2 𝑃2 𝑤𝑛 |𝑤𝑛−1 + 𝛾3 𝑃3 𝑤𝑛 |𝑤𝑛−2 where 0 ≤ 𝛾𝑖 ≤ 1 and 𝑖 𝛾𝑖 = 1. Combination of trigram, bigram and unigram 34 COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987) 𝑖−1 𝑃𝑏𝑎𝑐𝑘 𝑤𝑖 |𝑤𝑖−𝑛+1 = 𝑑𝑤 𝑖−1 𝑖 𝐶(𝑤𝑖−𝑛+1 ) 𝑖−1 𝑖−𝑛+1 𝐶(𝑤𝑖−𝑛+1 ) 𝑖−1 𝛼𝑤 𝑖−1 𝑃𝑏𝑎𝑐𝑘 (𝑤𝑖−𝑛+2 ) 𝑖−𝑛+1 𝑖 𝐶 𝑤𝑖−𝑛+1 >𝑘 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Example 𝑖−1 𝑃(𝑤𝑖 |𝑤𝑖−2 ) 𝑖−1 𝑃𝑏𝑎𝑐𝑘 𝑤𝑖 |𝑤𝑖−2 = 𝛼1 𝑃(𝑤𝑖 |𝑤𝑖−1 ) 𝛼2 𝑃(𝑤𝑖 ) 𝑖 𝐶 𝑤𝑖−2 >0 𝑖 𝑖 𝐶 𝑤𝑖−2 = 0 𝑎𝑛𝑑 𝐶 𝑤𝑖−1 >0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 35 COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987) If sequence unseen : use shorter sequence Ex. If P(San | going to) = 0, Use P(San | to) 𝑃𝑏𝑎𝑐𝑘 (𝑤𝑛 |𝑤𝑖𝑛−1 ) = τ(𝑤𝑛 |𝑤𝑖𝑛−1 ) 𝑛−1 Λ*𝑃𝑏𝑎𝑐𝑘 (𝑤𝑛 |𝑤𝑖+1 ) if c(𝑤𝑖𝑛 ) > 0 if c(𝑤𝑖𝑛 ) = 0 weight lower order prob. higher order prob. 36 COMBINING ESTIMATORS [General linear interpolation] 𝑘 𝑃𝐿𝑖 𝑤 ℎ = 𝛾𝑖 (ℎ)𝑃𝑖 𝑤|ℎ 𝑖=1 where 0 ≤ 𝛾𝑖 (ℎ) ≤ 1 and 𝑖 𝛾𝑖 (ℎ) =1 37 COMBINING ESTIMATORS Interpolated smoothing 𝑛−1 𝑛−1 𝑃𝑖𝑛𝑡𝑒𝑟 (𝑤𝑛 |𝑤𝑖 ) = τ(𝑤𝑛 |𝑤𝑖𝑛−1 ) + Λ*𝑃𝑖𝑛𝑡𝑒𝑟 (𝑤𝑛 |𝑤𝑖+1 ) higher order prob. Weight lower order prob. Seems to work better than back-off smoothing 38 NOTE Witten Bell smoothing 𝑖−1 𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+1 ) 𝑖−1 𝑖−1 = Λ𝑤 𝑖−1 *𝑃𝑀𝐿𝐸 (𝑤𝑖 |𝑤𝑖−𝑛+1 ) + (1- Λ𝑤 𝑖−1 )*𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+2 ) 𝑖−𝑛+1 = 𝑖−𝑛+1 𝑖 𝑖−1 𝑖−1 𝑐 𝑤𝑖−𝑛+1 +𝑁1+ 𝑤𝑖−𝑛+1 ,∗ ∗𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+2 ) 𝑤𝑖 𝑖 𝑖−1 𝑐 𝑤𝑖−𝑛+1 +𝑁1+ 𝑤𝑖−𝑛+1 ,∗ 𝑖−1 𝑖−1 Where 𝑁1+ 𝑤𝑖−𝑛+1 ,∗ = |{𝑤𝑖 : c 𝑤𝑖−𝑛+1 , 𝑤𝑖 > 0}| 39 NOTE Absolute discounting Like Jelinek-Mercer, involves interpolation of higher- and lowerorder models But instead of multiplying the higher-order 𝑃𝑀𝐿𝐸 by a λ, we subtract a fixed discount δ ∈[0,1] from each nonzero count : 𝑖−1 𝑃𝑎𝑏𝑠 (𝑤𝑖 |𝑤𝑖−𝑛+1 ) = 𝑖 𝑚𝑎𝑥{𝑐 𝑤𝑖−𝑛+1 −δ,0} 𝑖 𝑤𝑖 𝑐(𝑤𝑖−𝑛+1 ) 𝑖−1 + + (1- λ𝑤 𝑖−1 )*𝑃𝑎𝑏𝑠 (𝑤𝑖 |𝑤𝑖−𝑛+2 ) 𝑖−𝑛+1 To make it sum to 1: δ (1- λ𝑤 𝑖−1 )= 𝑖 𝑖−𝑛+1 𝑤𝑖 𝑐(𝑤𝑖−𝑛+1 ) 𝑖−1 *𝑁1+ 𝑤𝑖−𝑛+1 ,∗ Choose δ using held-out estimation. 40 NOTE KN smoothing (1995) An extension of absolute discounting with a clever way of constructing the lower-order (backoff) model Idea: the lower-order model is signficant only when count is small or zero in the higher-order model, and so should be optimized for that purpose. 𝑖−1 𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+1 ) = + 𝑖 𝑚𝑎𝑥{𝑐 𝑤𝑖−𝑛+1 −𝐷,0} 𝑤𝑖 𝑖 𝑐(𝑤𝑖−𝑛+1 ) 𝐷 𝑤𝑖 𝑖 𝑐(𝑤𝑖−𝑛+1 ) 𝑖−1 𝑖−1 *𝑁1+ 𝑤𝑖−𝑛+1 ,∗ *𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+2 ) 41 NOTE An empirical study of smoothing techniques for language modeling (1999) For a bigram model, we would like to select a smoothed dist. 𝑃𝐾𝑁 that satisfies the following constraint on unigram marginals for all 𝑊𝑖 : (1) (2) (1)번으로부터 𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖−1 ,𝑤𝑖 𝑐(𝑤𝑖 ) 𝑤𝑖 𝑐(𝑤𝑖 ) = )= 𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖 𝑐(𝑤𝑖 ) 𝑤𝑖 𝑐(𝑤𝑖 ) (제약조건) |𝑤𝑖−1 )𝑃(𝑤𝑖−1 ) (3) (2)번으로부터 𝑐(𝑤𝑖 )= 𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−1 )𝑐(𝑤𝑖−1 ) 42 NOTE 𝑐(𝑤𝑖 )= + = + 𝑤𝑖−1 𝑐(𝑤𝑖−1 )*[ 𝐷 𝑤𝑖 𝑐 𝑤𝑖−1 ,𝑤𝑖 𝑚𝑎𝑥{𝑐 𝑤𝑖−1 ,𝑤𝑖 𝑐 𝑤𝑖−1 ,𝑤𝑖 *𝑁1+ 𝑤𝑖−1 ,∗ *𝑃𝐾𝑁 (𝑤𝑖 )] 𝑐(𝑤𝑖−1 ) 𝑤𝑖−1 : 𝑐 𝑤𝑖−1 ,𝑤𝑖 >0 𝑤𝑖−1 𝑐(𝑤𝑖−1 ) 𝑤𝑖 −𝐷,0} 𝐷 𝑐 𝑤𝑖−1 = 𝑐 𝑤𝑖 − 𝐷𝑁1+ ∗, 𝑤𝑖 D𝑃𝐾𝑁 (𝑤𝑖 ) 𝑤𝑖−1 𝑁1+ 𝑤𝑖−1 ,∗ 𝑐 𝑤𝑖−1 ,𝑤𝑖 −𝐷 𝑐 𝑤𝑖−1 𝑁1+ 𝑤𝑖−1 ,∗ *𝑃𝐾𝑁 (𝑤𝑖 ) + 43 NOTE = |{𝑤𝑖−1 : c 𝑤𝑖−1 , 𝑤𝑖 > 0}| 𝑁1+ ∗, 𝑤𝑖 𝑁1+ ∗,∗ = 𝑤𝑖−1 𝑁1+ (𝑤𝑖−1 ,∗) = |{ 𝑤𝑖−1 , 𝑤𝑖 : c 𝑤𝑖−1 , 𝑤𝑖 > 0}| = 𝑤𝑖 𝑁1+ (∗, 𝑤𝑖 ) 𝑁1+ (∗,𝑤𝑖 ) 𝑃𝐾𝑁 (𝑤𝑖 )= 𝑁1+ (∗,∗) 44 NOTE Generlizing to higher-order models, we have that 𝑖 𝑁1+ (∗,𝑤𝑖−𝑛+2 ) 𝑖−1 𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+2 )= 𝑁 (∗,𝑤 𝑖−1 ,∗) 1+ 𝑖−𝑛+2 Where 𝑖 𝑖 𝑁1+ ∗, 𝑤𝑖−𝑛+2 = |{𝑤𝑖−𝑛+1 : c 𝑤𝑖−𝑛+1 > 0}| 𝑖−1 𝑖 𝑁1+ ∗, 𝑤𝑖−𝑛+2 ,∗ = |{ 𝑤𝑖−𝑛+1 , 𝑤𝑖 : c 𝑤𝑖−𝑛+1 > 0}| = 𝑖 𝑁 ∗, 𝑤 𝑤𝑖 1+ 𝑖−𝑛+2 45