791-05-ngrams

advertisement
COMP 791A: Statistical
Language Processing
n-gram Models over Sparse Data
Chap. 6
1
“Shannon Game” (Shannon, 1951)
“I am going to make a collect …”



Predict the next word given the n-1
previous words.
Past behavior is a good guide to what will
happen in the future as there is
regularity in language.
Determine the probability of different
sequences from a training corpus.
2
Language Modeling

a statistical model of word/character sequences
used to predict the next character/word given
the previous ones

applications:



Speech recognition
Spelling correction





He is trying to fine out.
Hopefully, all with continue smoothly in my absence.
Optical character recognition / Handwriting recognition
Statistical Machine Translation
…
3
1st approximation

each word has an equal probability to
follow any other


with 100,000 words, the probability of each
of them at any given point is .00001
but some words are more frequent then
others…

in Brown corpus:


“the” appears 69,971 times
“rabbit” appears 11 times
4
Remember Zipf’s Law
f×r = k
350
300
Freq
250
200
150
100
50
0
0
500
1000
1500
2000
Rank
5
Frequency of frequencies

most words are rare (happax

but common words are very common
legomena)
6
n-grams

take into account the frequency of the word in
some training corpus


but bag of word approach…


at any given point, “the” is more probable than “rabbit”
“Just then, the white …”
so the probability of a word also depends on the
previous words (the history)
P(wn |w1w2…wn-1)
7
Problems with n-grams

“the large green ______ .”


“Sue swallowed the large green ______ .”



“mountain”? “tree”?
“pill”? “broccoli”?
Knowing that Sue “swallowed” helps narrow down
possibilities
But, how far back do we look?
8
Reliability vs. Discrimination

larger n:
 more information about the context of the specific instance
 greater discrimination
 But:


too consuming
ex: for a vocabulary of 20,000 words:





number of bigrams = 400 million (20 0002)
number of trigrams = 8 trillion (20 0003)
number of four-grams = 1.6 x 1017 (20 0004)
too many chances that the history has never been seen before
(data sparseness)
smaller n:
 less precision
 BUT:


more instances in training data, better statistical estimates
more reliability
--> Markov approximation: take only the most recent history
9
Markov assumption

Markov Assumption:



we can predict the probability of some future
item on the basis of a short history
if (history = last n-1 words) --> (n-1)th order
Markov model or n-gram model
Most widely used:



unigram (n=1)
bigram (n=2)
trigram (n=3)
10
Text generation with n-grams




n-gram model trained on 40 million words from WSJ
Unigram:
 Months the my and issue of year foreign new exchange’s
September were recession exchange new endorsed a acquire
to six executives.
Bigram:
 Last December through the way to preserve the Hudson
corporation N.B.E.C. Taylor would seem to complete the major
central planner one point five percent of U.S.E. has already
old M. X. corporation of living on information such as more
frequently fishing to keep her.
Trigram:
 They also point to ninety point six billion dollars from two
hundred four oh six three percent of the rates of interest
stores as Mexico and Brazil on market conditions.
11
Bigrams

first-order Markov models
P(wn|wn-1)


N-by-N matrix of probabilities/frequencies
N = size of the vocabulary we are modeling
2nd word
1st word
a
aardvark aardwolf aback …
zoophyte zucchini
a
0
0
0
0
…
8
5
aardvark
0
0
0
0
…
0
0
aardwolf
0
0
0
0
…
0
0
26
1
6
0
…
12
2
…
…
…
…
…
…
…
…
zoophyte
0
0
0
1
…
0
0
zucchini
0
0
0
3
…
0
0
aback
12
Why use only bi- or tri-grams?

Markov approximation is still costly
with a 20 000 word vocabulary:
 bigram needs to store 400 million parameters
 trigram needs to store 8 trillion parameters
 using a language model > trigram is impractical

to reduce the number of parameters, we can:




do stemming (use stems instead of word types)
group words into semantic classes
seen once --> same as unseen
...
13
Building n-gram Models

Data preparation:



Decide training corpus
Clean and tokenize
How do we deal with sentence boundaries?

I eat. I sleep.


<s>I eat <s> I sleep <s>


(I eat) (eat I) (I sleep)
(<s> I) (I eat) (eat <s>) (<s> I) (I sleep) (sleep <s>)
Use statistical estimators:

to derive a good probability estimates based on
training data.
14
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




( Validation:
 Held Out Estimation
 Cross Validation )
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
15
Statistical Estimators


--> Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




( Validation:
 Held Out Estimation
 Cross Validation )
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
16
Maximum Likelihood Estimation


Choose the parameter values which gives the
highest probability on the training corpus
Let C(w1,..,wn) be the frequency of n-gram w1,..,wn
C(w1 ,.., wn )
PMLE (wn | w1 ,.., wn-1 ) 
C(w1 ,.., wn-1 )
17
Example 1: P(event)

in a training corpus, we have 10 instances of
“come across”




8 times, followed by “as”
1 time, followed by “more”
1 time, followed by “a”
with MLE, we have:




P(as | come across) = 0.8
P(more | come across) = 0.1
P(a | come across) = 0.1
P(X | come across) = 0 where X  “as”, “more”, “a”
18
Example 2: P(sequence of events)
eat on
eat some
eat British
…
<s> I
<s> I’d
…
.16
.06
.001
.25
.06
I want
I would
I don’t
…
want to
want a
…
.32
.29
.08
.65
.5
to eat
to have
to spend
…
British food
British restaurant
…
.26
.14
.09
.6
.15
P(I want to eat British food)
= P(I|<s>) x P(want|I) x P(to|want) x P(eat|to) x P(British|eat) x P(food|British)
= .25
x .32
x .65
x .26
x .001
x .6
= .000008
19
Some adjustments


product of probabilities… numerical
underflow for long sentences
so instead of multiplying the probs, we
add the log of the probs
P(I want to eat British food)
= log(P(I|<s>)) + log(P(want|I)) + log(P(to|want)) + log(P(eat|to)) +
log(P(British|eat)) + log(P(food|British))
= log(.25) + log(.32) + log(.65) + log (.26) + log(.001) + log(.6)
= -11.722
20
Problem with MLE: data sparseness

What if a sequence never appears in training corpus? P(X)=0







“come across the men” --> prob = 0
“come across some men” --> prob = 0
“come across 3 men” --> prob = 0
MLE assigns a probability of zero to unseen events …
probability of an n-gram involving unseen words will be zero!
but… most words are rare (Zipf’s Law).
so n-grams involving rare words are even more rare… data
sparseness
21
Problem with MLE: data sparseness (con’t)

in (Balh et al 83)



training with 1.5 million words
23% of the trigrams from another part of the same corpus
were previously unseen.
in Shakespeare’s work


out of 844 000 possible bigrams
99.96% were not used

So MLE alone is not good enough estimator

Solution: smoothing



decrease the probability of previously seen events
so that there is a little bit of probability mass left over
for previously unseen events
also called discounting
22
Discounting or Smoothing




MLE is usually unsuitable for NLP because of the
sparseness of the data
We need to allow for possibility of seeing events
not seen in training
Must use a Discounting or Smoothing technique
Decrease the probability of previously seen
events to leave a little bit of probability for
previously unseen events
23
Statistical Estimators


Maximum Likelihood Estimation (MLE)
--> Smoothing


--> Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




( Validation:
 Held Out Estimation
 Cross Validation )
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
24
Many smoothing techniques








Add-one
Add-delta
Witten-Bell smoothing
Good-Turing smoothing
Church-Gale smoothing
Absolute-discounting
Kneser-Ney smoothing
...
25
Add-one Smoothing (Laplace’s law)


Pretend we have seen every n-gram at
least once
Intuitively:


new_count(n-gram) = old_count(n-gram) + 1
The idea is to give a little bit of the
probability space to unseen events
26
Add-one: Example
unsmoothed bigram counts:
1st word
I
want
2nd word
to
eat
Chinese
food
lunch
…
Total (N)
I
8
1087
0
13
0
0
0
3437
want
3
0
786
0
6
8
6
1215
to
3
0
10
860
3
0
12
3256
eat
0
0
2
0
19
2
52
938
Chinese
2
0
0
0
0
120
1
213
food
19
0
17
0
0
0
0
1506
lunch
4
0
0
0
0
1
0
459
…
unsmoothed normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.32
0
0
0
1
0
.65
.0038
(13/3437)
0
0
want
.0023
(8/3437)
.0025
.0049
.0066
.0049
1
to
.00092
0
.0031
.26
.00092
0
.0037
1
eat
0
0
.0021
0
.020
.0021
.055
1
Chinese
.0094
0
0
0
0
.56
.0047
1
food
.013
0
.011
0
0
0
0
1
lunch
.0087
0
0
0
0
.0022
0
1
I
…
Total
…
27
Add-one: Example (con’t)
add-one smoothed bigram counts:
I
want
I
8 9
want
to
eat
Chinese
food
lunch
…
Total (N+V)
1
14
1
1
1
3 4
1087
1088
1
787
1
7
9
7
3437
5053
2831
to
4
1
11
861
4
1
13
4872
eat
1
1
23
1
20
3
53
2554
Chinese
3
1
1
1
1
121
2
1829
food
20
1
18
1
1
1
1
3122
lunch
5
1
1
1
1
2
1
2075
add-one normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.22
.0002
.0002
.0002
1
.00035
.28
.0028
(14/5053)
.00035
.0002
want
.0018
(9/5053)
.0014
.0025
.0032
.0025
1
to
.00082
.00021
.0023
.18
.00082
.00021
.0027
1
eat
.00039
.00039
.0012
.00039
.0078
.0012
.021
1
Chinese
.0016
.00055
.00055
.00055
.00055
.066
.0011
1
food
.0064
.00032
.0058
.00032
.00032
.00032
.00032
1
lunch
.0024
.00048
.00048
.00048
.00048
.0022
.00048
1
I
…
Total
28
Add-one, more formally
C (w1 w2 wn)  1
PAdd1 (w1 w2 wn) 
NV
N: nb of n-grams in training corpus starting with w1…wn-1
V: size of vocabulary
i.e. nb of possible different n-grams starting with w1…wn-1
i.e. nb of word types
29
The example again
unsmoothed bigram counts:
V= 1616 word types
I
I
want
to
eat
Chinese
food
lunch
8
3
3
0
2
19
4
want
to
eat
Chinese food
lunch
…
1087
0
13
0
0
0
0
786
0
6
8
6
0
10
860
3
0
12
0
2
0
19
2
52
0
0
0
0
120
1
0
17
0
0
0
0
0
0
0
0
1
0
Total (N)
3437
1215
3256
938
213
1506
459
V= 1616
P(I eat)
= C(I eat) + 1 / (nb bigrams starting with “I” + nb of possible bigrams starting with “I”)
= 13 + 1 / 3437 + 1616
= 0.0028
30
Problem with add-one smoothing





every previously unseen n-gram is given a low probability
but there are so many of them that too much probability
mass is given to unseen events
adding 1 to frequent bigram, does not change much
but adding 1 to low bigrams (including unseen ones) boosts
them too much !
In NLP applications that are very sparse, Laplace’s Law
actually gives far too much of the probability space to
unseen events.
31
Problem with add-one smoothing
bigrams starting with Chinese are boosted by a factor of 8 !
(1829 / 213)
unsmoothed bigram counts:
1st word
I
I
want
to
eat
Chinese
food
lunch
8
3
3
0
2
19
4
want
to
eat
Chinese food
lunch
…
1087
0
13
0
0
0
0
786
0
6
8
6
0
10
860
3
0
12
0
2
0
19
2
52
0
0
0
0
120
1
0
17
0
0
0
0
0
0
0
0
1
0
Total (N)
3437
1215
3256
938
213
1506
459
add-one smoothed bigram counts:
I
I
want
1st word

want
9
4
to
eat
Chinese
food
lunch
…
Total (N+V)
1088
1
14
1
1
1
5053
1
787
1
7
9
7
2831
to
4
1
11
861
4
1
13
4872
eat
1
1
23
1
20
3
53
2554
Chinese
3
1
1
1
1
121
2
1829
food
20
1
18
1
1
1
1
3122
lunch
5
1
1
1
1
2
1
2075
32
Problem with add-one smoothing (con’t)
Data from the AP from (Church and Gale, 1991)

Corpus of 22,000,000 bigrams
 Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams - or
bins)
 74,671,100,000 bigrams were unseen
 And each unseen bigram was given a frequency of 0.000295
Add-one
Freq. from
smoothed freq.
fMLE
fempirical
fadd-one
training data
0
0.000027
0.000295

Freq. from
held-out data

1
0.448
0.000274
2
1.25
0.000411
3
2.24
0.000548
4
3.23
0.000685
5
4.21
0.000822
too high
too low
Total probability mass given to unseen bigrams =
(74,671,100,000 x 0.000295) / 22,000,000 ~99.96 !!!!
33
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
--> Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




Validation:
 Held Out Estimation
 Cross Validation
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
34
Add-delta smoothing (Lidstone’s law)

instead of adding 1, add some other (smaller) positive value 
PAddD (w1 w2 wn) 


C (w1 w2 wn)  
NV
most widely used value for  = 0.5
if =0.5, Lidstone’s Law is called:
 the Expected Likelihood Estimation (ELE)
 or the Jeffreys-Perks Law
C (w1 w2 wn)  0.5
PELE(w1 w2 wn) 
N  0.5 V

better than add-one, but still…
35
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




--> ( Validation:
 Held Out Estimation
 Cross Validation )
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
36
Validation / Held-out Estimation



How do we know how much of the probability
space to “hold out” for unseen events?
ie. We need a good way to guess  in advance
Held-out data:

We can divide the training data into two parts:


the training set: used to build initial estimates by
counting
the held out data: used to refine the initial estimates
(i.e. see how often the bigrams that appeared r times in
the training text occur in the held-out text)
37
Held Out Estimation

For each n-gram w1...wn we compute:



Ctr(w1...wn) the frequency of w1...wn in the training data
Cho(w1...wn) the frequency of w1...wn in the held out data
Let:



r = the frequency of an n-gram in the training data
Nr = the number of different n-grams with frequency r in
the training data
Tr = the sum of the counts of all n-grams in the held-out data
that appeared r times in the training data
Tr 


C
(w wn )
1
ho
{w1 wn where Ctr (w1 wn )  r}
T = total number of n-gram in the held out data
So:
Tr 1
Pho (w1 wn ) 
x
T Nr
where r  Ctr (w1 wn )
38
Some explanation…
Pho (w1 wn ) 
Tr 1
x
T Nr
probability in held-out
data for all n-grams
appearing r times in the
training data

where r  Ctr (w1 wn )
since we have Nr different n-grams in
the training data that occurred r
times, let's share this probability
mass equality among them
ex: assume





if r=5 and 10 different n-grams (types) occur 5 times in training
--> N5 = 10
if all the n-grams (types) that occurred 5 times in training, occurred
in total (n-gram tokens) 20 times in the held-out data
--> T5 = 20
assume the held-out data contains 2000 n-grams (tokens)
20
1
Pho (an n - gram with r  5) 
x
 0.001
2000 10
39
Cross-Validation



Held Out estimation is useful if there is a lot of data
available
If not, we can use each part of the data both as training
data and as held out data.
Main methods:

Deleted Estimation (two-way cross validation)





Divide data into part 0 and part 1
In one model use 0 as the training data and 1 as the held out
data
In another model use 1 as training and 0 as held out data.
Do a weighted average of the two models
Leave-One-Out



Divide data into N parts (N = nb of tokens)
Leave 1 token out each time
Train N language models
40
Dividing the corpus

Training:




To build initial estimates (frequency counts)

To refine initial estimates (smoothed estimates)
Held out data (10% of total data)
Testing:



Training data (80% of total data)
Development test data (5% of total data)

To test while developing

To test at the end
Final test data (5% of total data)
But how do we divide?


Randomly select data (ex. sentences, n-grams)

Advantage: Test data is very similar to training data

Advantage: Results are lower, but more realistic
Cut large chunks of consecutive data
41
Developing and Testing Models
1.
2.
Write an algorithm
Train it

3.
Test it

4.
5.
6.
With training set & held-out data
With development set
Note things it does wrong & revise it
Repeat 1-5 until satisfied
Only then, evaluate and publish results


With final test set
Better to give final results by testing on n smaller
samples of the test data and averaging
42
Factors of training corpus

Size:


the more, the better
but after a while, not much improvement…



bigrams (characters) after 100’s million words (IBM)
trigrams (characters) after some billions of words
(IBM)
Nature (adaptation):

training on WSJ and testing on AP??
43
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




( Validation:
 Held Out Estimation
 Cross Validation )
--> Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
44
Witten-Bell smoothing

intuition:



An unseen n-gram is one that just did not
occur yet
When it does happen, it will be its first
occurrence
So give to unseen n-grams the probability of
seeing a new n-gram
45
Some intuition

Assume these counts:
1st word
a

Observations:

10
0
0
c
10
10
0 30
0 300
d
…
0
0
0
Total
30
30
300
a seems more promiscuous than b…



a
b
c
d
…
b
2nd word
b has always been followed by c,
but a seems to be followed by a wider range of words
c seems more stubborn than b…


c and b have same distribution
but we have seen 300 instances of bigrams starting with c, so
there seems to be less chances that a new bigram starting
with c will be new, compared to b
46
Some intuition (con’t)
a
a
b
c
d
…
b
10
0
0
c
10
0
0
d …
10 0
30 0
300 0
Total
30
30
300
intuitively,
ad should be more probable than bd
bd should be more probable than cd
P(d|a) > P(d|b) > P(d|c)
P(w2 | w1) 
promiscuit y of w1
stubbornne ss of w1
47
Witten-Bell smoothing

to compute the probability of a bigram w1w2 we have never seen, we
use:
 promiscuity T(w1)
= the probability of seeing a new bigram starting with w1
= number of different n-grams (types) starting with w1

stubbornness N(w1)
= number of n-gram tokens starting with w1

the following total probability mass will be given to all (not each)
unseen bigrams
P(all unseen bigrams | w1) 

T(w1)
for all unseen events
N(w1)  T(w1)
this probability mass, must be distributed in equal parts over all
unseen bigrams
 Z (w1) : number of unseen n-grams starting with w1
P(w2 | w1) 
1
T(w1)
x
for each unseen event
Z(w1) N(w1)  T(w1)
48
Small example
a
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…

all unseen bigrams starting with a will share a
probability mass of
T(a)
3

 0.091
N(a)  T(a) 30  3

each unseen bigrams starting with a will have an equal
part of this
1
T(a)
1
P(d | a) 
  0.091  0.091
Z(a) N(a)  T(a) 1

49
Small example (con’t)
a
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…

all unseen bigrams starting with b will share a probability mass of
T(b)
1

 0.032
N(b)  T(b) 30  1

each unseen bigrams starting with b will have an equal part of this
P(a | b) 
1
T(b)
1
x
 x 0.032  0.011
Z(b) N(b)  T(b) 3
P(b | b) 
1
T(b)
1
x
 x 0.032  0.011
Z(b) N(b)  T(b) 3
P(d | b) 
1
T(b)
1
x
 x 0.032  0.011
Z(b) N(b)  T(b) 3
50
Small example (con’t)
a
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…

all unseen bigrams starting with c will share a probability mass of
T(c)
1

 0.0033
N(c)  T(c) 300  1

each unseen bigrams starting with c will have an equal part of this
P(a | c) 
1
T(c)
1
x
 x 0.0033  0.0011
Z(c) N(c)  T(c) 3
P(b | c) 
1
T(c)
1
x
 x 0.0033  0.0011
Z(c) N(c)  T(c) 3
P(d | c) 
1
T(c)
1
x
 x 0.0033  0.0011
Z(c) N(c)  T(c) 3
51
More formally

Unseen bigrams:


To get from the probabilities back to the counts, we know
that:
C(w2 | w1 ) // N (w1) = nb of tokens starting with w1
P(w2 | w1) 
N(w1)
so we get:
C(w2 | w1)  P(w2 | w1 )  N(w1)

1
T(w1 )

 N(w1 )
Z(w1) N(w1 )  T(w1 )
T(w1 )
N(w1 )


Z(w1) N(w1 )  T(w1 )
52
More formally (con’t)

Seen bigrams :


since we added probability mass to unseen bigrams, we must
decrease (discount) the probability mass of seen event (so
that total = 1)
we increased prob mass of unseen event by a factor of T(w1) /
N(w1) + T(w1) , so we must discount by the same factor


T(w1 )

newProb(w 2 | w1 )  originalProb(w2 | w1 )  1 
N(w
1
)

T(w
1
)



so we get:
C(w2 | w1)  newProb(w 2 | w1 )  N(w1)


T(w1 )
  N(w1 )
 originalProb(w2 | w1 )  1 
N(w
1
)

T(w
1
)


N(w1 )  T(w1 ) - T(w1 )
 originalCount(w 2 | w1 ) 
N(w1 )  T(w1 )
N(w1 )
 originalCount(w 2 | w1 ) 
N(w1 )  T(w1 )
53
The restaurant example

The original counts were:
I
I
8
1087
0
13
0
0
… N(w)
T(w)
Z(w)
seen bigram seen bigram unseen
tokens
types
bigram types
0
3437
95
1521
want
3
0
786
0
6
8
6
1215
76
1540
to
3
0
10
860
3
0
12
3256
130
1486
eat
0
0
2
0
19
2
52
938
124
1492
Chinese
2
0
0
0
0
120
1
213
20
1592
food
19
0
17
0
0
0
0
1506
82
534
lunch
4
0
0
0
0
1
0
459
45
1571




want
to
eat
Chine
se
food
lunch
T(w)= number of different seen bigrams types starting with w
we have a vocabulary of 1616 words, so we can compute
Z(w)= number of unseen bigrams types starting with w
Z(w) = 1616 - T(w)
N(w) = number of bigrams tokens starting with w
54
Witten-Bell smoothed count
• the count of the unseen bigram “I lunch”
T(I)
N(I)
95
3437
x

x
 0.06
Z(I) N(I)  T(I) 1521 3437  95
• the count of the seen bigram “want to”
count(want to)x
N(want)
1215
 786x
 739.73
N(want)  T(want)
1215  76
Witten-Bell smoothed bigram counts:
I
want
to
eat
Chinese
food
lunch
…
Total
I
7.78
1057.76
.061
12.65
.06
.06
.06
3437
want
2.82
.05
739.73
.05
5.65
7.53
5.65
1215
to
2.88
.08
9.62
826.98
2.88
.08
12.50
3256
.07
.07
19.43
.07
16.78
1.77
45.93
938
1.74
.01
.01
.01
.01
109.70
.91
213
food
18.02
.05
16.12
.05
.05
.05
.05
1506
lunch
3.64
.03
.03
.03
.03
0.91
.03
459
eat
Chinese
55
Witten-Bell smoothed
probabilities
Witten-Bell normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.3078
.000002
.0037
.000002
.000002
.000002
1
want
.0022
(7.78/3437)
.00230
.00004
.6088
.00004
.0047
.0062
.0047
1
to
.00009
.00003
.0030
.2540
.00009
.00003
.0038
1
eat
.00008
.00008
.0021
.00008
.0179
.0019
.0490
1
Chinese
.00812
.00005
.00005
.00005
.00005
.5150
.0042
1
food
.0120
.00004
.0107
.00004
.00004
.00004
.00004
1
lunch
.0079
.00006
.00006
.00006
.00006
.0020
.00006
1
I
…
Total
56
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




Validation:
 Held Out Estimation
 Cross Validation
Witten-Bell smoothing
--> Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
57
Good-Turing Estimator

Based on the assumption that words have a binomial
distribution
Works well in practice (with large corpora)

Idea:



Re-estimate the probability mass of n-grams with zero (or
low) counts by looking at the number of n-grams with higher
counts
N
Nb of ngrams that occur c+1 times
c*  (c  1) c1
Nb of ngrams that occur c times
Nc
Ex:
nb of bigrams that occurred once
new count for bigrams that never occurred 
co  (0  1)
nb of bigrams that never occurred
N1
No
58
Good-Turing Estimator (con’t)

In practice c* is not used for all counts c
large counts (> a threshold k) are assumed to be reliable

If c > k (usually k = 5)

c* = c

If c <= k
(k  1)Nk 1
Nc1
c
Nc
N1
for 1  c  k
(k  1)Nk 1
1
N1
(c  1)
c* 
59
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




( Validation:
 Held Out Estimation
 Cross Validation )
Witten-Bell smoothing
Good-Turing smoothing
--> Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
60
Combining Estimators

so far, we gave the same probability to all unseen n-grams

we have never seen the bigrams





journal of
Punsmoothed(of |journal) = 0
journal from Punsmoothed(from |journal) = 0
journal never Punsmoothed(never |journal) = 0
all models so far will give the same probability to all 3
bigrams
but intuitively, “journal of” is more probable because...


“of” is more frequent than “from” & “never”
unigram probability P(of) > P(from) > P(never)
61
Combining Estimators (con’t)

observation:





unigram model suffers less from data sparseness than
bigram model
bigram model suffers less from data sparseness than trigram
model
…
so use a lower model estimate, to estimate probability
of unseen n-grams
if we have several models of how the history predicts
what comes next, we can combine them in the hope of
producing an even better model
62
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




Validation:
 Held Out Estimation
 Cross Validation
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



--> Simple Linear Interpolation
General Linear Interpolation
Katz’s Backoff
63
Simple Linear Interpolation


Solve the sparseness in a trigram model by mixing
with bigram and unigram models
Also called:




linear interpolation,
finite mixture models
deleted interpolation
Combine linearly
Pli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1)

where 0 i 1 and i i =1
64
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




Validation:
 Held Out Estimation
 Cross Validation
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
--> General Linear Interpolation
Katz’s Backoff
65
General Linear Interpolation



In simple linear interpolation, the weights i are
constant
So the unigram estimate is always combined with the
same weight, regardless of whether the trigram is
accurate (because there is lots of data) or poor
We can have a more general and powerful model where
i are a function of the history h
Pgli (wn | wn-2, wn-1 )  1 P(wn )  2 (wn-1 ) P(wn | wn-1 )  3 (wn-2, wn-1 ) P (wn | wn-2, wn-1 )


where 0 i(h) 1 and i i(h) =1
Having a specific (h) per n-gram is not a good idea, but
we can set a (h) according to the frequency of the ngram
66
Statistical Estimators


Maximum Likelihood Estimation (MLE)
Smoothing


Add-one -- Laplace
Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)




Validation:
 Held Out Estimation
 Cross Validation
Witten-Bell smoothing
Good-Turing smoothing
Combining Estimators



Simple Linear Interpolation
General Linear Interpolation
--> Katz’s Backoff
67
Katz’s Backing Off Model


higher-order model are more reliable
so use lower-order model only if necessary
Pbo(wn|wn-2, wn-1) =
Pdisc(wn|wn-2, wn-1 ) if c(wn-2wn-1 wn ) > k
α1 Pdisc(wn |wn-1)
if c(wn-1 wn) > k
α2 Pdisc(wn)
otherwise
// if trigram was seen enough
// if bigram was seen enough
discounted probabilities (with Good-Turing, add-one, …)

α1 and α2 make sure the probability mass is 1 when
backing-off to lower-order model
68
Other applications of LM


Author / Language identification
hypothesis: texts that resemble each other (same
author, same language) share similar characteristics


Training phase:



In English character sequence “ing” is more probable
than in French
construction of the language model
with pre-classified documents (known language/author)
Testing phase:

evaluation of unknown text (comparison with language
model)
69
Example: Language identification

bigram of characters


characters = 26 letters (case insensitive)
possible variations: case sensitivity,
punctuation, beginning/end of sentence
marker, …
70
1. Train an language model for English:
A
B
C
D
…
Y
Z
A
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
B
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
C
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
D
0.0042
0.0014
0.0014
0.0014
…
0.0014
0.0014
E
0.0097
0.0014
0.0014
0.0014
…
0.0014
0.0014
…
…
…
…
…
…
…
0.0014
Y
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
Z
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
2. Train a language model for French
3. Evaluate probability of a sentence with LM-English & LMFrench
4. Highest probability -->language of sentence
71
Download