N-Gram

advertisement
Foundations of Statistic Natural Language Processing
CHAPTER6. STATISTICAL INFERENCE :
N-GRAM MODEL OVER SPARSE DATA
1
Pusan national university
2014. 4. 22
Myoungjin, Jung
INTRODUCTION

Object of Statistical NLP


Do statistical inference for the field of natural language.
Statistical inference (크게 두 가지 과정으로 나눔)
1. Taking some data generated by unknown probability distribution.
(말뭉치 필요)
 2. Making some inferences about this distribution.

(해당 말뭉치로 확률분포 추론)

Divides the problem into three areas :
(통계적 언어처리의 3가지 과정)



1. Dividing the training data into equivalence class.
2. Finding a good statistical estimator for each equivalence class.
3. Combining multiple estimators.
2
BINS : FORMING EQUIVALENCE CLASSES

Reliability vs Discrimination
Ex)“large green ___________”
tree? mountain? frog? car?
“swallowed the large green ________”
pill? broccoli?

smaller n: more instances in training data, better statistical
estimates (more reliability)

larger n: more information about the context of the specific
instance (greater discrimination)
3
BINS : FORMING EQUIVALENCE CLASSES

N-gram models
“n-gram” = sequence of n words
Pwn | w1,, wn1 
 predicting the next word :
 Markov assumption
 Only the prior local context - the last few words – affects the next
word.
 Selecting an n : Vocabulary size = 20,000 words

n
Number of bins
2 (bigrams) 400,000,000
3 (trigrams) 8,000,000,000,000
4 (4-grams) 1.6 x 1017
4
BINS : FORMING EQUIVALENCE CLASSES
Probability dist. : P(s) where s : sentence
 Ex.

P(If you’re going to San Francisco, be sure ……)
= P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) *
……

Markov assumption


Only the last n-1 words are relevant for a prediction
Ex. With n=5
P(sure|If you’re going to San Francisco, be)
= P(sure|San Francisco , be)
5
BINS : FORMING EQUIVALENCE CLASSES
N-gram : Sequence of length n with a count
 Ex. 5-gram : If you’re going to San


Sequence naming : 𝑤1𝑖−1 = 𝑤1 , … , 𝑤𝑖−1

Markov assumption formalized :
•
𝑖−1
P(𝑤𝑖 |𝑤1 , … , 𝑤𝑖−1 ) = P(𝑤𝑖 |𝑤1𝑖−1 ) ≅ P(𝑤𝑖 |𝑤𝑖−𝑛+1
)
n-1 words
6
BINS : FORMING EQUIVALENCE CLASSES

Instead of P(s) :

𝑖−1
only one conditional prob. P(𝑤𝑖 |𝑤𝑖−𝑛+1
)

𝑖−1
Simplify P(𝑤𝑖 |𝑤𝑖−𝑛+1
) to P(𝑤𝑛 |𝑤1𝑛−1 )

n-1
•
NWP(𝑤1𝑛−1 ) = arg max P(𝑤𝑛 |𝑤1𝑛−1 )
𝑤𝑛 ∈ 𝑊
•
Set of all words in the corpus
•
•
n-1
next word prediction
7
BINS : FORMING EQUIVALENCE CLASSES

Ex. The easiest way :
𝑛−1
 𝑃𝑀𝐿𝐸 (𝑤𝑛 |𝑤1
)


=
𝑝(𝑤1𝑛 )
𝑐(𝑤1𝑛 )
=
𝑝(𝑤1𝑛−1 )
𝑐(𝑤1𝑛−1 )
P(San|If you’re going to) =
=
𝑝(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜 𝑆𝑎𝑛)
𝑝(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜)
𝑐(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜 𝑆𝑎𝑛)
𝑐(𝐼𝑓 𝑦𝑜𝑢′ 𝑟𝑒 𝑔𝑜𝑖𝑛𝑔 𝑡𝑜)
8
STATISTICAL ESTIMATORS

Given the observed training data.

How do you develop a model (probability distribution) to predict
future events? (더 좋은 확률의 추정)

Probability estimate
 target feature


Pwn | w1  wn 1  
Pw1  wn 
Pw1  wn1 
Estimating the unknown probability distribution of n-grams.
9
STATISTICAL ESTIMATORS

Notation for the statistical estimation chapter.
N
Number of training instances
B
Number of bins training instances are divided into
w1n
C(w1…wn)
r
An n-gram w1…wn in the training text
Freq. of n-gram w1…wn in the training text
Freq. of an n-gram
f(•)
Freq. estimate of a model
Nr
Number of bins that have r training instances in them
Tr
Total count of n-grams of freq. r in further data
h
‘History’ of preceding words
10
STATISTICAL ESTIMATORS

Example - Instances in the training corpus:
“inferior to ________”
11
MAXIMUM LIKELIHOOD ESTIMATION (MLE)



Definition
 Using the relative frequency as a probability estimate.
Example :
 In corpus, found 10 training instances of the word “comes across”
 8 times they were followed by “as” : P(as) = 0.8
 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1
 Not among the above 3 word : P(x) = 0.0
Formula
P MLE
P MLE
C w 1 w n  r
w 1 w n  

N
N
C w 1 w n 
w n |w 1 w n 1  
C w 1 w n 1 
12
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
13
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
Example 1. A Paragraph Using Training Data
The bigram model uses the preceding word to help predict the next word. (End) In g
eneral, this helps enormously, and gives us a much better model. (End) In some cas
es the estimated probability of the word that actually comes next has gone up by ab
out an order of magnitude (was, to, sisters). (End) However, note that the bigram m
odel is not guaranteed to increase the probability estimate. (End)

Word (N=79 : 𝑤1 ,…,𝑤51 ) 1-gram

2-gram

3-gram
C(the)=7
C(bigram)=2
C(model)=3
C(the,bigram)=2
C(the,bigram,model)=2

P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2

P(bigram)=2/79
14
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS
LAW

Laplace’s law(1814; 1995)
C w 1 w n   1 r  1
P LAP w 1 w n  

N B
N B

Add a little bit of probability space to unseen events
15
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS
LAW

Word (N=79 : B=seen(51)+unseen(70)=121)
MLE
Laplace’s law
0.0886076
0.0400000
0.2857143
0.0050951
1.0000000
0.0083089
16
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS
LAW


Page 202-203
(Associated Press[AP] newswire yielded a vocabulary)
unseen event에 대한 약간의 확률 공간을 추가 but 너무 많은 공간을
추가하였다.

44milion의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생

Bins의 개수가 training instance보다 많아지게 되는 문제가 발생
•
Lap law는 unseen event에 대한 확률공간을 위해 분모에 B를 삽입 하였지만
결과적으로 약 46.5%의 확률공간을 unseen event에 주게 되었다.
•
N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465
17
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEFFREYS-PERKS
LAW

Lidstone’s law(1920) and the Jeffreys-Perks law(1973)
C(w1  wn )  λ
PLid (w1  wn ) 
N  Bλ

Lidstone’s Law


add some positive value
Jeffreys-Perks Law


 = 0.5
Called ELE (Expected Likelihood Estimation)
18
LIDSTONE’S LAW


Using Lidstone’s law, instead of adding one, add some

smaller value ,
where the parameter  is in the range
And
.
.
19
LIDSTONE’S LAW

Here,

gives the maximum likelihood estimate,
gives the Laplace’s law,
tends to then we have the uniform estimate
.

if

represents the trust we have in relative frequencies.
implies more trust in relative frequencies than the Laplace's
law
while
represents less trust in relative frequencies.




In practice, people use values of in the range
,
a common value being
. (Jeffreys-Perks law)
20
JEFFREYS-PERKS LAW

Using Lidstone’s law,
Lidstone’s law
MLE
(
)
(
)
Jeffreys-Perks
(
)
Lidstone’s law
(
)
Laplace’s law
(
)
Lidstone’s law
(
)
A
0.0886
0.0633
0.0538
0.0470
0.0400
0.0280
B
0.2857
0.0081
0.0063
0.0056
0.0051
0.0049
C
1.0000
0.0084
0.0085
0.0083
0.0083
0.0083
*A:
, B:
, C:
21
HELD OUT ESTIMATION(JELINEK AND MERCER,
1985)
For each n-gram,
, let :
=frequencyof
intrainingdata
=frequencyof
inheldoutdata
Let
be the total number of times that all n-grams that appeared r times in the tra
ining text appeared in the held out data.
An estimate for the probability of one of these n-gram is :
where
.
22
HELD OUT ESTIMATION(JELINEK AND MERCER,
1985)







[Full text (
[Word (
):
(Training Data)
):
, respectively],
unseen word : I don't know.
, unseen word : 70, respectively] :
[Word (
):
respectively] : (Held out Data)
(1-gram)
Traing data :
Held out data :
, unseen word : 51- ,
,
,
,(
,(
)
)
23
HELD OUT ESTIMATION(JELINEK AND MERCER,
1985)

•
•
•
•
•
training text에서 r번 나온 bigram이 추가적으로 추출한 text (further
text)에서 몇 번 나오는가를 알아보는 것.
Held out estimation : training text에서 r번 출현되어진 bigram이 더 많
은 text에서는 얼마나 출현 할 것인가를 예측하는 방법.
Test data(training data에 독립적)는 전체 데이터의 5-10%이지만 신뢰
하기에 충분하다.
우리는 데이터를 training data의 test data로 나누기를 원한다. (검증된
데이터와 검증안된 데이터)
Held out data (10%)
N-gram의 held-out estimation을 통해 held-out data를 얻는다.
24
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER,
1985)

Use data for both training and validation
 Divide test data into 2 parts
 Train on A, validate on B
 Train on B, validate on A
 Combine two models
A
B
train
validate
Model 1
validate
train
Model 2
Model 1
+
Model 2
Final Model
25
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER,
1985)

Cross validation : training data is used both as



initial training data
held out data
On large training corpora, deleted estimation works
better than held-out estimation
Tr01
Tr10
Pho w1  wn   0 or 1 where C w1  wn   r
Nr N
Nr N
Tr01  Tr10
Pdel w1  wn  
where C w1  wn   r
0
1
N Nr  Nr


26
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER,
1985)
[Full text (
):
, respectively],
unseen word : I don't know.
 [Word (
):
, unseen word : 70,
respectively] : (Training Data)
 [A-part word (
):
, unseen word : 101,
respectively]
 [B-part word (
):
, unseen word :
90- , respectively]

A-part data :
,
(
B-part data :
,
(
)
)
27
,
.
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER,
1985)
[B-part word (
):
90- , respectively]
 [A-part word (
):
101+ , respectively]
,

unseen word :
, unseen word :
B-part data :
,
(
)
A-part data :
,
(
)
,
.
[Result]
28
,
.
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER,
1985)



Held out estimation개념으로 우리가 training data를 두 부
분으로 나눔으로써 같은 효과를 얻는다. 이러한 메소드를
cross validation이라 한다.
더 효과적인 방법. 두 방법을 합치므로써 Nr0, Nr1의 차이
를 감소 시킨다.
큰 training curpus내의 deleted estimation은 held-out
estimation보다 더 신뢰적이다.
29
GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL
DISTRIBUTION]


Idea: re-estimate probability mass assigned to Ngrams with zero counts
Adjust actual counts 𝑟 to expected counts 𝑟 ∗ with
formula
r*
PGT 
N
E N r 1 
r  r  1
E N r 
*
(r* is an adjusted frequency)
(E denotes the expectation of
random variable)
30
GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL
DISTRIBUTION]

If 𝐶 𝑤1𝑛 = 0
𝑃𝐺𝑇 (𝑤1𝑛 ) =

If 𝐶 𝑤1𝑛 = 𝑟 > 0
𝑃𝐺𝑇 (𝑤1𝑛 )
=
𝑟∗
1− ∞
𝑁
𝑟=1 𝑟 𝑁
𝑁0
𝑟∗
𝑁
=
≈
𝑁1
𝑁0 𝑁
(𝑟+1) 𝐸(𝑁𝑟+1 )
𝑁
𝐸(𝑁𝑟 )
𝑟 이 작을 시 : 𝑟 > 𝑟 ∗
𝑟 이 클 시 : 𝑟 ≈ 𝑟∗
So, over-estimator 된 것을 under-estimator 시킴
31
NOTE
𝑛
 𝑃𝑀𝐿𝐸 𝑤1


=
𝐶(𝑤1𝑛 )
𝑁
=
𝑟
𝑁
단점 : over-estimator
[two discounting models] (Ney and Essen, 1993; Ney et al., 1994)
Absolute discounting
𝑃𝑎𝑏𝑠 𝑤1𝑛 =
𝑃𝑎𝑏𝑠 𝑤1𝑛 =
𝑟−𝛿
, 𝑟>0
𝑁
(𝐵−𝑁0 )𝛿
,𝑟=
𝑁0 𝑁
over-estimator를 𝛿만큼 다운시킴.
0
𝛿만큼 다운시킨것을 𝑢𝑛𝑠𝑒𝑒𝑛 𝑑𝑎𝑡𝑎에 확률로 적용

Linear discounting
(1−𝛼)𝑟
P 𝑤1𝑛 = 𝑁 , 𝑟 > 0
1 − 𝛼 를 이용하여 𝑟를 조절
𝛼
P 𝑤1𝑛 = 𝑁 , 𝑟 = 0
0
α만큼 다운시킨것을 𝑢𝑛𝑠𝑒𝑒𝑛 𝑑𝑎𝑡𝑎에 확률로 적용
32
NOTE
𝑛
 𝑃𝑀𝐿𝐸 𝑤1

=
𝐶(𝑤1𝑛 )
𝑁
=
𝑟
𝑁
단점 : over-estimator
[Natural Law of Succession] (Ristad, 1995)
𝑃𝑁𝐿𝑆 𝑤1𝑛 =
𝑟+1
𝑁+𝐵
(𝑟+1)(𝑁+1+𝑁0 −𝐵)
𝑁2 +𝑁+2(𝐵+𝑁0 )
(𝐵−𝑁0 )(𝐵−𝑁0 +1)
𝑁0 (𝑁2 +𝑁+2(𝐵−𝑁0 ))
𝑁0 = 0
𝑁0 > 0 𝑎𝑛𝑑 𝑟 > 0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
33
COMBINING ESTIMATORS

Basic Idea
Consider how to combine multiple probability estimate from
various different models
 How can you develop a model to utilize different length ngrams as appropriate?


Simple linear interpolation

𝑛−1
𝑛−1
𝑃𝐿𝑖 𝑤𝑛 𝑤𝑛−2
= 𝛾1 𝑃1 𝑤𝑛 + 𝛾2 𝑃2 𝑤𝑛 |𝑤𝑛−1 + 𝛾3 𝑃3 𝑤𝑛 |𝑤𝑛−2
where 0 ≤ 𝛾𝑖 ≤ 1 and

𝑖 𝛾𝑖
= 1.
Combination of trigram, bigram and unigram
34
COMBINING ESTIMATORS

[Katz’s backing-off] (Katz, 1987)
𝑖−1
𝑃𝑏𝑎𝑐𝑘 𝑤𝑖 |𝑤𝑖−𝑛+1
=
𝑑𝑤 𝑖−1
𝑖
𝐶(𝑤𝑖−𝑛+1
)
𝑖−1
𝑖−𝑛+1 𝐶(𝑤𝑖−𝑛+1 )
𝑖−1
𝛼𝑤 𝑖−1 𝑃𝑏𝑎𝑐𝑘 (𝑤𝑖−𝑛+2
)
𝑖−𝑛+1

𝑖
𝐶 𝑤𝑖−𝑛+1
>𝑘
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Example
𝑖−1
𝑃(𝑤𝑖 |𝑤𝑖−2
)
𝑖−1
𝑃𝑏𝑎𝑐𝑘 𝑤𝑖 |𝑤𝑖−2
= 𝛼1 𝑃(𝑤𝑖 |𝑤𝑖−1 )
𝛼2 𝑃(𝑤𝑖 )
𝑖
𝐶 𝑤𝑖−2
>0
𝑖
𝑖
𝐶 𝑤𝑖−2
= 0 𝑎𝑛𝑑 𝐶 𝑤𝑖−1
>0
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
35
COMBINING ESTIMATORS
[Katz’s backing-off] (Katz, 1987)
 If sequence unseen : use shorter sequence
 Ex. If P(San | going to) = 0, Use P(San | to)

𝑃𝑏𝑎𝑐𝑘 (𝑤𝑛 |𝑤𝑖𝑛−1 ) =
τ(𝑤𝑛 |𝑤𝑖𝑛−1 )
𝑛−1
Λ*𝑃𝑏𝑎𝑐𝑘 (𝑤𝑛 |𝑤𝑖+1
)
if c(𝑤𝑖𝑛 ) > 0
if c(𝑤𝑖𝑛 ) = 0
weight
lower order prob.
higher order prob.
36
COMBINING ESTIMATORS

[General linear interpolation]
𝑘
𝑃𝐿𝑖 𝑤 ℎ =
𝛾𝑖 (ℎ)𝑃𝑖 𝑤|ℎ
𝑖=1
where 0 ≤ 𝛾𝑖 (ℎ) ≤ 1 and
𝑖 𝛾𝑖 (ℎ)
=1
37
COMBINING ESTIMATORS
Interpolated smoothing
𝑛−1
𝑛−1
 𝑃𝑖𝑛𝑡𝑒𝑟 (𝑤𝑛 |𝑤𝑖
) = τ(𝑤𝑛 |𝑤𝑖𝑛−1 ) + Λ*𝑃𝑖𝑛𝑡𝑒𝑟 (𝑤𝑛 |𝑤𝑖+1
)




higher order prob.
Weight
lower order prob.
Seems to work better than back-off smoothing
38
NOTE

Witten Bell smoothing

𝑖−1
𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+1
)
𝑖−1
𝑖−1
= Λ𝑤 𝑖−1 *𝑃𝑀𝐿𝐸 (𝑤𝑖 |𝑤𝑖−𝑛+1
) + (1- Λ𝑤 𝑖−1 )*𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+2
)
𝑖−𝑛+1
=

𝑖−𝑛+1
𝑖
𝑖−1
𝑖−1
𝑐 𝑤𝑖−𝑛+1
+𝑁1+ 𝑤𝑖−𝑛+1
,∗ ∗𝑃𝑊𝐵 (𝑤𝑖 |𝑤𝑖−𝑛+2
)
𝑤𝑖
𝑖
𝑖−1
𝑐 𝑤𝑖−𝑛+1
+𝑁1+ 𝑤𝑖−𝑛+1
,∗
𝑖−1
𝑖−1
Where 𝑁1+ 𝑤𝑖−𝑛+1
,∗ = |{𝑤𝑖 : c 𝑤𝑖−𝑛+1
, 𝑤𝑖 > 0}|
39
NOTE








Absolute discounting
Like Jelinek-Mercer, involves interpolation of higher- and lowerorder models
But instead of multiplying the higher-order 𝑃𝑀𝐿𝐸 by a λ, we
subtract a fixed discount δ ∈[0,1] from each nonzero count :
𝑖−1
𝑃𝑎𝑏𝑠 (𝑤𝑖 |𝑤𝑖−𝑛+1
)
=
𝑖
𝑚𝑎𝑥{𝑐 𝑤𝑖−𝑛+1
−δ,0}
𝑖
𝑤𝑖 𝑐(𝑤𝑖−𝑛+1 )
𝑖−1
+ + (1- λ𝑤 𝑖−1 )*𝑃𝑎𝑏𝑠 (𝑤𝑖 |𝑤𝑖−𝑛+2
)
𝑖−𝑛+1
To make it sum to 1:
δ
(1- λ𝑤 𝑖−1 )=
𝑖
𝑖−𝑛+1
𝑤𝑖
𝑐(𝑤𝑖−𝑛+1 )
𝑖−1
*𝑁1+ 𝑤𝑖−𝑛+1
,∗
Choose δ using held-out estimation.
40
NOTE

KN smoothing (1995)
An extension of absolute discounting with a clever way of
constructing the lower-order (backoff) model
 Idea: the lower-order model is signficant only when count is
small or zero in the higher-order model, and so should be
optimized for that purpose.

𝑖−1
𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+1
)
=
+
𝑖
𝑚𝑎𝑥{𝑐 𝑤𝑖−𝑛+1
−𝐷,0}
𝑤𝑖
𝑖
𝑐(𝑤𝑖−𝑛+1
)
𝐷
𝑤𝑖
𝑖
𝑐(𝑤𝑖−𝑛+1
)
𝑖−1
𝑖−1
*𝑁1+ 𝑤𝑖−𝑛+1
,∗ *𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+2
)
41
NOTE


An empirical study of smoothing techniques for language modeling
(1999)
For a bigram model, we would like to select a smoothed dist.
𝑃𝐾𝑁 that satisfies the following constraint on unigram
marginals for all 𝑊𝑖 :

(1)

(2) (1)번으로부터



𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖−1 ,𝑤𝑖
𝑐(𝑤𝑖 )
𝑤𝑖 𝑐(𝑤𝑖 )
=
)=
𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖
𝑐(𝑤𝑖 )
𝑤𝑖 𝑐(𝑤𝑖 )
(제약조건)
|𝑤𝑖−1 )𝑃(𝑤𝑖−1 )
(3) (2)번으로부터
𝑐(𝑤𝑖 )= 𝑤𝑖−1 𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−1 )𝑐(𝑤𝑖−1 )
42
NOTE

𝑐(𝑤𝑖 )=
+
=
+
𝑤𝑖−1 𝑐(𝑤𝑖−1 )*[
𝐷
𝑤𝑖 𝑐 𝑤𝑖−1 ,𝑤𝑖
𝑚𝑎𝑥{𝑐 𝑤𝑖−1 ,𝑤𝑖
𝑐 𝑤𝑖−1 ,𝑤𝑖
*𝑁1+ 𝑤𝑖−1 ,∗ *𝑃𝐾𝑁 (𝑤𝑖 )]
𝑐(𝑤𝑖−1 )
𝑤𝑖−1 : 𝑐 𝑤𝑖−1 ,𝑤𝑖 >0
𝑤𝑖−1 𝑐(𝑤𝑖−1 )
𝑤𝑖
−𝐷,0}
𝐷
𝑐 𝑤𝑖−1
= 𝑐 𝑤𝑖 − 𝐷𝑁1+ ∗, 𝑤𝑖
D𝑃𝐾𝑁 (𝑤𝑖 ) 𝑤𝑖−1 𝑁1+ 𝑤𝑖−1 ,∗
𝑐 𝑤𝑖−1 ,𝑤𝑖
−𝐷
𝑐 𝑤𝑖−1
𝑁1+ 𝑤𝑖−1 ,∗ *𝑃𝐾𝑁 (𝑤𝑖 )
+
43
NOTE
= |{𝑤𝑖−1 : c 𝑤𝑖−1 , 𝑤𝑖 > 0}|

𝑁1+ ∗, 𝑤𝑖

𝑁1+ ∗,∗ =
𝑤𝑖−1 𝑁1+ (𝑤𝑖−1 ,∗)
= |{ 𝑤𝑖−1 , 𝑤𝑖 : c 𝑤𝑖−1 , 𝑤𝑖 > 0}|
=

𝑤𝑖 𝑁1+ (∗, 𝑤𝑖
)
𝑁1+ (∗,𝑤𝑖 )
𝑃𝐾𝑁 (𝑤𝑖 )=
𝑁1+ (∗,∗)
44
NOTE

Generlizing to higher-order models, we have that
𝑖
𝑁1+ (∗,𝑤𝑖−𝑛+2
)
𝑖−1
 𝑃𝐾𝑁 (𝑤𝑖 |𝑤𝑖−𝑛+2 )=
𝑁 (∗,𝑤 𝑖−1 ,∗)
1+
𝑖−𝑛+2
Where
𝑖
𝑖
𝑁1+ ∗, 𝑤𝑖−𝑛+2
= |{𝑤𝑖−𝑛+1 : c 𝑤𝑖−𝑛+1
> 0}|
𝑖−1
𝑖
𝑁1+ ∗, 𝑤𝑖−𝑛+2
,∗ = |{ 𝑤𝑖−𝑛+1 , 𝑤𝑖 : c 𝑤𝑖−𝑛+1
> 0}|
=
𝑖
𝑁
∗,
𝑤
𝑤𝑖 1+
𝑖−𝑛+2
45
Download