PPT - Research Group on the Foundations of Artificial Intelligence

advertisement
Advanced Artificial Intelligence
Part II. Statistical NLP
N-Gramms
Wolfram Burgard, Luc De Raedt, Bernhard
Nebel, Lars Schmidt-Thieme
Most slides taken from Helmut Schmid, Rada Mihalcea, Bonnie Dorr,
Leila Kosseim and others
Contents





Short recap motivation for SNLP
Probabilistic language models
N-gramms
Predicting the next word in a sentence
Language guessing
Largely chapter 6 of Statistical NLP, Manning
and Schuetze.
And chapter 6 of Speech and Language
Processing, Jurafsky and Martin
Motivation
Human Language is highly ambiguous at all levels
• acoustic level
recognize speech vs. wreck a nice beach
• morphological level
saw: to see (past), saw (noun), to saw (present, inf)
• syntactic level
I saw the man on the hill with a telescope
• semantic level
One book has to be read by every student
NLP and Statistics
Statistical Disambiguation
• Define a probability model for the data
• Compute the probability of each alternative
• Choose the most likely alternative
Language Models
Speech recognisers use a „noisy channel model“
source s
channel
acoustic signal a
The source generates a sentence s with probability P(s).
The channel transforms the text into an acoustic signal
with probability P(a|s).
The task of the speech recogniser is to find for a given
speech signal a the most likely sentence s:
s = argmaxs P(s|a) = argmaxs P(a|s) P(s) / P(a)
= argmaxs P(a|s) P(s)
Language Models
Speech recognisers employ two statistical models:
• a language model P(s)
• an acoustics model P(a|s)
Language Model
 Definition:
• Language model is a model that enables
one to compute the probability, or
likelihood, of a sentence s, P(s).
 Let’s look at different ways of computing
P(s) in the context of Word Prediction
Example of bad language model
A bad language model
A bad language model
A Good Language Model
 Determine reliable sentence probability
estimates
 P(“And nothing but the truth”)  0.001
 P(“And nuts sing on the roof”)  0
Shannon game
Word Prediction
 Predicting the next word in the
sequence
•
•
•
•
•
Statistical natural language ….
The cat is thrown out of the …
The large green …
Sue swallowed the large green …
…
Claim
 A useful part of the knowledge needed to allow Word
Prediction can be captured using simple statistical
techniques.
 Compute:
- probability of a sequence
- likelihood of words co-occurring
 Why would we want to do this?
• Rank the likelihood of sequences containing various
alternative alternative hypotheses
• Assess the likelihood of a hypothesis
Applications






Spelling correction
Mobile phone texting
Speech recognition
Handwriting recognition
Disabled users
…
Spelling errors
 They are leaving in about fifteen minuets to go to her
house.
 The study was conducted mainly be John Black.
 Hopefully, all with continue smoothly in my absence.
 Can they lave him my messages?
 I need to notified the bank of….
 He is trying to fine out.
Handwriting recognition
 Assume a note is given to a bank teller, which
the teller reads as I have a gub. (cf. Woody
Allen)
 NLP to the rescue ….
• gub is not a word
• gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank
For Spell Checkers
 Collect list of commonly substituted words
• piece/peace, whether/weather, their/there ...
 Example:
“On Tuesday, the whether …’’
“On Tuesday, the weather …”
Language Models
How to assign probabilities to word sequences?
The probability of a word sequence w1,n is decomposed
into a product of conditional probabilities.
P(w1,n) = P(w1) P(w2 | w1) P(w3 | w1,w2) ... P(wn | w1,n-1)
= i=1..n P(wi | w1,i-1)
Language Models
In order to simplify the model, we assume that
• each word only depends on the 2 preceding words
P(wi | w1,i-1) = P(wi | wi-2, wi-1)
• 2nd order Markov model, trigram
• that the probabilities are time invariant (stationary)
P(Wi=c | Wi-2=a, Wi-1=b) = P(Wk=c | Wk-2=a, Wk-1=b)
Final formula: P(w1,n) = i=1..n P(wi | wi-2, wi-1)
Simple N-Grams
 An N-gram model uses the previous N-1
words to predict the next one:
• P(wn | wn-N+1 wn-N+2… wn-1 )




unigrams: P(dog)
bigrams: P(dog | big)
trigrams: P(dog | the big)
quadrigrams: P(dog | chasing the big)
A Bigram Grammar Fragment
Eat on
.16
Eat Thai
.03
Eat some
.06
Eat breakfast
.03
Eat lunch
.06
Eat in
.02
Eat dinner
.05
Eat Chinese
.02
Eat at
.04
Eat Mexican
.02
Eat a
.04
Eat tomorrow
.01
Eat Indian
.04
Eat dessert
.007
Eat today
.03
Eat British
.001
Additional Grammar
<start> I
<start> I’d
<start> Tell
<start> I’m
I want
I would
I don’t
I have
Want to
Want a
.25
.06
.04
.02
.32
.29
.08
.04
.65
.05
Want some
Want Thai
To eat
To have
To spend
To be
British food
British restaurant
British cuisine
British lunch
.04
.01
.26
.14
.09
.02
.60
.15
.01
.01
Computing Sentence Probability
 P(I want to eat British food) = P(I|<start>) P(want|I)
P(to|want) P(eat|to) P(British|eat) P(food|British) =
.25x.32x.65x.26x.001x.60 = .000080
 vs.
 P(I want to eat Chinese food) = .00015
 Probabilities seem to capture “syntactic'' facts, “world
knowledge''
- eat is often followed by a NP
- British food is not too popular
 N-gram models can be trained by counting and
normalization
Some adjustments
 product of probabilities… numerical underflow
for long sentences
 so instead of multiplying the probs, we add
the log of the probs
P(I want to eat British food)
Computed using
log(P(I|<s>)) + log(P(want|I)) + log(P(to|want)) + log(P(eat|to)) +
log(P(British|eat)) + log(P(food|British))
= log(.25) + log(.32) + log(.65) + log (.26) + log(.001) + log(.6)
= -11.722
Why use only bi- or tri-grams?
 Markov approximation is still costly
with a 20 000 word vocabulary:
• bigram needs to store 400 million parameters
• trigram needs to store 8 trillion parameters
• using a language model > trigram is impractical
 to reduce the number of parameters, we can:
•
•
•
•
do stemming (use stems instead of word types)
group words into semantic classes
seen once --> same as unseen
...
Building n-gram Models
 Data preparation:
• Decide training corpus
• Clean and tokenize
• How do we deal with sentence boundaries?
 I eat. I sleep.
• (I eat) (eat I) (I sleep)
 <s>I eat <s> I sleep <s>
• (<s> I) (I eat) (eat <s>) (<s> I) (I sleep) (sleep <s>)
 Use statistical estimators:
•
to derive a good probability estimates based on training
data.
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:
» Held Out Estimation
» Cross Validation )
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Statistical Estimators
 --> Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:
» Held Out Estimation
» Cross Validation )
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Maximum Likelihood Estimation
 Choose the parameter values which gives the
highest probability on the training corpus
 Let C(w1,..,wn) be the frequency of n-gram
w1,..,wn
C(w1,..,w n )
PMLE (wn |w1,..,wn-1 ) 
C(w1,..,w n-1 )
Example 1: P(event)
 in a training corpus, we have 10 instances of “come
across”
• 8 times, followed by “as”
• 1 time, followed by “more”
• 1 time, followed by “a”
 with MLE, we have:
•
•
•
•
P(as | come across) = 0.8
P(more | come across) = 0.1
P(a | come across) = 0.1
P(X | come across) = 0 where X  “as”, “more”, “a”
Problem with MLE: data sparseness
 What if a sequence never appears in training corpus?
P(X)=0
• “come across the men” -->
prob = 0
• “come across some men” --> prob = 0
• “come across 3 men” --> prob = 0
 MLE assigns a probability of zero to unseen events …
 probability of an n-gram involving unseen words will be
zero!
Maybe with a larger corpus?
 Some words or word combinations are
unlikely to appear !!!
 Recall:
• Zipf’s law
• f ~ 1/r
Problem with MLE: data sparseness (con’t)
 in (Balh et al 83)
• training with 1.5 million words
• 23% of the trigrams from another part of the
same corpus were previously unseen.
 So MLE alone is not good enough estimator
Discounting or Smoothing
 MLE is usually unsuitable for NLP because of the
sparseness of the data
 We need to allow for possibility of seeing events not
seen in training
 Must use a Discounting or Smoothing technique
 Decrease the probability of previously seen events to
leave a little bit of probability for previously unseen
events
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 --> Smoothing
• --> Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:
» Held Out Estimation
» Cross Validation )
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Many smoothing techniques








Add-one
Add-delta
Witten-Bell smoothing
Good-Turing smoothing
Church-Gale smoothing
Absolute-discounting
Kneser-Ney smoothing
...
Add-one Smoothing (Laplace’s law)
 Pretend we have seen every n-gram at
least once
 Intuitively:
• new_count(n-gram) = old_count(n-gram) + 1
 The idea is to give a little bit of the
probability space to unseen events
Add-one: Example
2nd word
unsmoothed bigram counts:
1st word
I
want
to
eat
Chinese
food
lunch
…
Total (N)
I
8
1087
0
13
0
0
0
3437
want
3
0
786
0
6
8
6
1215
to
3
0
10
860
3
0
12
3256
eat
0
0
2
0
19
2
52
938
Chinese
2
0
0
0
0
120
1
213
food
19
0
17
0
0
0
0
1506
lunch
4
0
0
0
0
1
0
459
…
unsmoothed normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.32
0
0
0
1
0
.65
.0038
(13/3437)
0
0
want
.0023
(8/3437)
.0025
.0049
.0066
.0049
1
to
.00092
0
.0031
.26
.00092
0
.0037
1
eat
0
0
.0021
0
.020
.0021
.055
1
Chinese
.0094
0
0
0
0
.56
.0047
1
food
.013
0
.011
0
0
0
0
1
lunch
.0087
0
0
0
0
.0022
0
1
I
…
…
Total
Add-one: Example (con’t)
add-one smoothed bigram counts:
I
want
I
8 9
want
to
eat
Chinese
food
lunch
…
Total (N+V)
1
14
1
1
1
3 4
1087
1088
1
787
1
7
9
7
3437
5053
2831
to
4
1
11
861
4
1
13
4872
eat
1
1
23
1
20
3
53
2554
Chinese
3
1
1
1
1
121
2
1829
food
20
1
18
1
1
1
1
3122
lunch
5
1
1
1
1
2
1
2075
add-one normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.22
.0002
.0002
.0002
1
.00035
.28
.0028
(14/5053)
.00035
.0002
want
.0018
(9/5053)
.0014
.0025
.0032
.0025
1
to
.00082
.00021
.0023
.18
.00082
.00021
.0027
1
eat
.00039
.00039
.0012
.00039
.0078
.0012
.021
1
Chinese
.0016
.00055
.00055
.00055
.00055
.066
.0011
1
food
.0064
.00032
.0058
.00032
.00032
.00032
.00032
1
lunch
.0024
.00048
.00048
.00048
.00048
.0022
.00048
1
I
…
Total
Add-one, more formally
C (w1 w2 wn)+1
PAdd1(w1 w2 wn) 
N  B
N: nb of n-grams in training corpus B:nb of bins (of possible n-grams)
B = V^2 for bigrams
B = V^3 for trigrams etc.
where V is size of vocabulary
Problem with add-one smoothing
 bigrams starting with Chinese are boosted by a factor of
8 ! (1829 / 213)
1st word
unsmoothed bigram counts:
I
want
to
eat
Chinese
food
lunch
I
want
8
3
3
0
2
19
4
to
1087
0
0
0
0
0
0
eat
0
786
10
2
0
17
0
Chinese food
13
0
860
0
0
0
0
0
6
3
19
0
0
0
0
8
0
2
120
0
1
add-one smoothed bigram counts:
I
I
1st word
want
want
9
4
to
eat
Chinese
lunch
food
…
0
6
12
52
1
0
0
lunch
…
Total (N)
3437
1215
3256
938
213
1506
459
Total (N+V)
1088
1
14
1
1
1
5053
1
787
1
7
9
7
2831
to
4
1
11
861
4
1
13
4872
eat
1
1
23
1
20
3
53
2554
Chinese
3
1
1
1
1
121
2
1829
food
20
1
18
1
1
1
1
3122
lunch
5
1
1
1
1
2
1
2075
Problem with add-one smoothing (con’t)
 Data from the AP from (Church and Gale, 1991)
• Corpus of 22,000,000 bigrams
• Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams - or
bins)
• 74,671,100,000 bigrams were unseen
• And each unseen bigram was given a frequency of 0.000137
Add-one
Freq. from
smoothed freq.
fMLE
fempirical
fadd-one
training data
0
0.000027
0.000137
Freq. from
too high
1
0.448
0.000274
held-out data
2
1.25
0.000411
3
2.24
0.000548
4
3.23
0.000685
5
4.21
0.000822
 Total probability mass given to unseen bigrams =
(74,671,100,000 x 0.000137) / 22,000,000 ~0.465 !!!!
too low
Problem with add-one smoothing
 every previously unseen n-gram is given a low
probability
 but there are so many of them that too much probability
mass is given to unseen events
 adding 1 to frequent bigram, does not change much
 but adding 1 to low bigrams (including unseen ones)
boosts them too much !
 In NLP applications that are very sparse, Laplace’s Law
actually gives far too much of the probability space to
unseen events.
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• --> Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
» Held Out Estimation
» Cross Validation
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Add-delta smoothing (Lidstone’s law)
 instead of adding 1, add some other (smaller) positive value 
C (w1 w2 wn)  
PAddD(w1 w2 wn) 
N  B
 most widely used value for  = 0.5
 if =0.5, Lidstone’s Law is called:
• the Expected Likelihood Estimation (ELE)
• or the Jeffreys-Perks Law
C (w1 w2 wn)  0.5
PELE(w1 w2 wn) 
N  0.5 B
 better than add-one, but still…
Adding  / Lidstone‘s law
The expected frequency of a trigram in a random
sample of size N is therefore
f*(w,w‘,w‘‘) =  f(w,w‘,w‘‘) + (1- ) N/B
 relative discounting
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• --> ( Validation:
» Held Out Estimation
» Cross Validation )
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Validation / Held-out Estimation
 How do we know how much of the probability space
to “hold out” for unseen events?
 ie. We need a good way to guess  in advance
 Held-out data:
• We can divide the training data into two parts:
 the training set: used to build initial estimates by counting
 the held out data: used to refine the initial estimates (i.e. see
how often the bigrams that appeared r times in the training text
occur in the held-out text)
Held Out Estimation
 For each n-gram w1...wn we compute:
• Ctr(w1...wn) the frequency of w1...wn in the training data
• Cho(w1...wn) the frequency of w1...wn in the held out data
 Let:
• r = the frequency of an n-gram in the training data
• Nr = the number of different n-grams with frequency r in the training
data
• Tr = the sum of the counts of all n-grams in the held-out data that
appeared r times in the training data
Tr 

C ho (w1 L w n )
{w1 L w n where C tr (w1 L w n )  r}
• T = total number of n-gram in the held out data
 So:
Pho (w1 L w n ) 
Tr 1
x
T Nr
where r  Ctr (w1 L w n )
Some explanation…
Pho (w1 L w n ) 
probability in held-out
data for all n-grams
appearing r times in the
training data
Tr 1
x
T Nr
where r  Ctr (w1 L w n )
since we have Nr different n-grams in
the training data that occurred r
times, let's share this probability
mass equality among them
 ex: assume
• if r=5 and 10 different n-grams (types) occur 5 times in training
• --> N5 = 10
• if all the n-grams (types) that occurred 5 times in training, occurred in
total (n-gram tokens) 20 times in the held-out data
• --> T5 = 20
• assume the held-out data contains 2000 n-grams (tokens)
Pho (an n-gram with r  5) 
20
1
x
 0.001
2000 10
Cross-Validation
 Held Out estimation is useful if there is a lot of data available
 If not, we can use each part of the data both as training data and
as held out data.
 Main methods:
• Deleted Estimation (two-way cross validation)




Divide data into part 0 and part 1
In one model use 0 as the training data and 1 as the held out data
In another model use 1 as training and 0 as held out data.
Do a weighted average of the two models
• Leave-One-Out
 Divide data into N parts (N = nb of tokens)
 Leave 1 token out each time
 Train N language models
Comparison
Empirical results for bigram data
(Church and Gale)
f
femp
fGT
fadd1
fheld-out
0
0.000027
0.000027
0.000137
0.000037
1
0.448
0.446
0.000274
0.396
2
1.25
1.26
0.000411
1.24
3
2.24
2.24
0.000548
2.23
4
3.23
3.24
0.000685
3.22
5
4.21
4.22
0.000822
4.22
6
5.23
5.19
0.000959
5.20
7
6.21
6.21
0.00109
6.21
8
7.21
7.24
0.00123
7.18
9
8.26
8.25
0.00137
8.18

Training:
•
Training data (80% of total data)

•
To build initial estimates (frequency counts)
Held out data (10% of total data)


Dividing the corpus
To refine initial estimates (smoothed estimates)
Testing:
•
Development test data (5% of total data)

•
Final test data (5% of total data)


To test while developing
To test at the end
But how do we divide?
•
Randomly select data (ex. sentences, n-grams)

•
Advantage: Test data is very similar to training data
Cut large chunks of consecutive data

Advantage: Results are lower, but more realistic
Developing and Testing Models
1. Write an algorithm
2. Train it
 With training set & held-out data
3. Test it
 With development set
4. Note things it does wrong & revise it
5. Repeat 1-5 until satisfied
6. Only then, evaluate and publish results
 With final test set
 Better to give final results by testing on n smaller
samples of the test data and averaging
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:
» Held Out Estimation
» Cross Validation )
• --> Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Witten-Bell smoothing
 intuition:
• An unseen n-gram is one that just did not
occur yet
• When it does happen, it will be its first
occurrence
• So give to unseen n-grams the probability of
seeing a new n-gram
Witten-Bell: the equations
 Total probability mass assigned to zerofrequency N-grams:
(NB: T is OBSERVED types, not V)
 So each zero N-gram gets the probability:
Witten-Bell: why ‘discounting’
 Now of course we have to take away
something (‘discount’) from the
probability of the events seen more than
once:
Witten-Bell for bigrams
 We `relativize’ the types to the previous word:

this probability mass, must be distributed in equal parts over all unseen
bigrams
• Z (w1) : number of unseen n-grams starting with w1
P(w2|w1) 
1
T(w1)
x
Z(w1) N(w1)  T(w1)
for each unseen event
Small example
a
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…
 all unseen bigrams starting with a will share a probability
mass of
T(a)
3

 0.091
N(a)  T(a) 30  3
 each unseen bigrams starting with a will have an equal part
of this
P(d|a) 
1
T(a)
1

  0.091  0.091
Z(a) N(a)  T(a) 1
a
Small example (con’t)
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…

all unseen bigrams starting with b will share a probability mass of
T(b)
1

 0.032
N(b)  T(b)
30  1

each unseen bigrams starting with b will have an equal part of this
P(a|b) 
1
T(b)
1
x

x 0.032  0.011
Z(b) N(b)  T(b)
3
P(b|b) 
1
T(b)
1
x

x 0.032  0.011
Z(b) N(b)  T(b)
3
P(d|b) 
1
T(b)
1
x

x 0.032  0.011
Z(b) N(b)  T(b)
3
Small example (con’t)
a
b
c
d
…
Total = N(w1)
T(w1)
Z(w1)
nb seen tokens
nb seen types
nb. unseen types
30
3
1
a
10
10
10
0
b
0
0
30
0
30
1
3
c
0
0
300
0
300
1
3
d
…

all unseen bigrams starting with c will share a probability mass of
T(c)
1

 0.0033
N(c)  T(c)
300  1

each unseen bigrams starting with c will have an equal part of this
P(a|c) 
1
T(c)
1
x

x 0.0033  0.0011
Z(c) N(c)  T(c)
3
P(b|c) 
1
T(c)
1
x

x 0.0033  0.0011
Z(c) N(c)  T(c)
3
P(d|c) 
1
T(c)
1
x

x 0.0033  0.0011
Z(c) N(c)  T(c)
3
More formally
 Unseen bigrams:
• To get from the probabilities back to the counts, we know that:
P(w2|w1) 
C(w2|w1 )
N(w1)
// N (w1) = nb of bigrams starting with w1
• so we get:
C(w2|w1)  P(w2|w1 )  N(w1)
1
T(w1 )


 N(w1 )
Z(w1) N(w1 )  T(w1 )
T(w1 )
N(w1 )


Z(w1) N(w1 )  T(w1 )
The restaurant example
 The original counts were:
I
want
to
eat
Chine
se
food
lunch
… N(w)
T(w)
seen bigram seen bigram
tokens
types
Z(w)
unseen
bigram types
I
8
1087
0
13
0
0
0
3437
95
1521
want
3
0
786
0
6
8
6
1215
76
1540
to
3
0
10
860
3
0
12
3256
130
1486
eat
0
0
2
0
19
2
52
938
124
1492
Chinese
2
0
0
0
0
120
1
213
20
1592
food
19
0
17
0
0
0
0
1506
82
534
lunch
4
0
0
0
0
1
0
459
45
1571



T(w)= number of different seen bigrams types starting with w
we have a vocabulary of 1616 words, so we can compute
Z(w)= number of unseen bigrams types starting with w
Z(w) = 1616 - T(w)

N(w) = number of bigrams tokens starting with w
Witten-Bell smoothed count
• the count of the unseen bigram “I lunch”
T(I)
N(I)
95
3437
x

x
 0.06
Z(I) N(I)  T(I) 1521 3437  95
• the count of the seen bigram “want to”
count(want to)x
N(want)
1215
 786x
 739.73
N(want)  T(want)
1215  76
Witten-Bell smoothed bigram counts:
I
want
to
eat
Chinese
food
lunch
…
Total
I
7.78
1057.76
.061
12.65
.06
.06
.06
3437
want
2.82
.05
739.73
.05
5.65
7.53
5.65
1215
to
2.88
.08
9.62
826.98
2.88
.08
12.50
3256
.07
.07
19.43
.07
16.78
1.77
45.93
938
1.74
.01
.01
.01
.01
109.70
.91
213
food
18.02
.05
16.12
.05
.05
.05
.05
1506
lunch
3.64
.03
.03
.03
.03
0.91
.03
459
eat
Chinese
Witten-Bell smoothed probabilities
Witten-Bell normalized bigram probabilities:
I
want
to
eat
Chinese
food
lunch
.3078
.000002
.0037
.000002
.000002
.000002
1
want
.0022
(7.78/3437)
.00230
.00004
.6088
.00004
.0047
.0062
.0047
1
to
.00009
.00003
.0030
.2540
.00009
.00003
.0038
1
eat
.00008
.00008
.0021
.00008
.0179
.0019
.0490
1
Chinese
.00812
.00005
.00005
.00005
.00005
.5150
.0042
1
food
.0120
.00004
.0107
.00004
.00004
.00004
.00004
1
lunch
.0079
.00006
.00006
.00006
.00006
.0020
.00006
1
I
…
Total
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
» Held Out Estimation
» Cross Validation
• Witten-Bell smoothing
• --> Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Good-Turing Estimator
 Based on the assumption that words have a binomial
distribution
 Works well in practice (with large corpora)
 Idea:
• Re-estimate the probability mass of n-grams with zero (or
low) counts by looking at the number of n-grams with higher
counts
• Ex:
N
c*  (c  1) c1
Nc
Nb of ngrams that occur c+1 times
Nb of ngrams that occur c times
new count for bigrams that never occurred 
c o  (0  1)
N1
No
nb of bigrams that occurred once
nb of bigrams that never occurred
Good-Turing Estimator (con’t)
 In practice c* is not used for all counts c
 large counts (> a threshold k) are assumed to be
reliable
 If c > k (usually k = 5)
c* = c
 If c <= k
N c1
(k  1)N k 1
c
Nc
N1
for 1  c  k
(k  1)N k 1
1
N1
(c  1)
c* 
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• ( Validation:
» Held Out Estimation
» Cross Validation )
• Witten-Bell smoothing
• Good-Turing smoothing
 --> Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Combining Estimators
 so far, we gave the same probability to all unseen ngrams
• we have never seen the bigrams
 journal of
 journal from
 journal never
Punsmoothed(of |journal) = 0
Punsmoothed(from |journal) = 0
Punsmoothed(never |journal) = 0
• all models so far will give the same probability to all 3
bigrams
 but intuitively, “journal of” is more probable because...
• “of” is more frequent than “from” & “never”
• unigram probability P(of) > P(from) > P(never)
Combining Estimators (con’t)
 observation:
• unigram model suffers less from data sparseness than
bigram model
• bigram model suffers less from data sparseness than
trigram model
• …
 so use a lower model estimate, to estimate
probability of unseen n-grams
 if we have several models of how the history
predicts what comes next, we can combine them in
the hope of producing an even better model
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
» Held Out Estimation
» Cross Validation
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• --> Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Simple Linear Interpolation
 Solve the sparseness in a trigram model by mixing with
bigram and unigram models
 Also called:
• linear interpolation,
• finite mixture models
• deleted interpolation
 Combine linearly
Pli(wn|wn-2,wn-1) = 1P(wn) + 2P(wn|wn-1) + 3P(wn|wn-2,wn-1)
• where 0 i 1 and i i =1
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
» Held Out Estimation
» Cross Validation
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• --> General Linear Interpolation
• Katz’s Backoff
General Linear Interpolation
 In simple linear interpolation, the weights i are constant
 So the unigram estimate is always combined with the same
weight, regardless of whether the trigram is accurate
(because there is lots of data) or poor
 We can have a more general and powerful model where i are
a function of the history h
Pgli (wn |wn-2 ,wn-1 )  1 P(wn )  2 (wn-1 ) P(wn |wn-1 )  3 (wn-2 ,wn-1 ) P (wn |wn-2 ,wn-1 )
• where 0 i(h) 1 and i i(h) =1
 Having a specific (h) per n-gram is not a good idea, but we
can set a (h) according to the frequency of the n-gram
Statistical Estimators
 Maximum Likelihood Estimation (MLE)
 Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
» Held Out Estimation
» Cross Validation
• Witten-Bell smoothing
• Good-Turing smoothing
 Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• --> Katz’s Backoff
Backoff Smoothing
Smoothing of Conditional Probabilities
p(Angeles | to, Los)
If „to Los Angeles“ is not in the training corpus,
the smoothed probability p(Angeles | to, Los) is
identical to p(York | to, Los).
However, the actual probability is probably close to
the bigram probability p(Angeles | Los).
Backoff Smoothing
(Wrong) Back-off Smoothing of trigram probabilities
if C(w‘, w‘‘, w) > 0
P*(w | w‘, w‘‘) = P(w | w‘, w‘‘)
else if C(w‘‘, w) > 0
P*(w | w‘, w‘‘) = P(w | w‘‘)
else if C(w) > 0
P*(w | w‘, w‘‘) = P(w)
else
P*(w | w‘, w‘‘) = 1 / #words
Backoff Smoothing
Problem: not a probability distribution
Solution:
Combination of Back-off and frequency discounting
P(w | w1,...,wk) = C*(w1,...,wk,w) / N if C(w1,...,wk,w) > 0
else
P(w | w1,...,wk) = (w1,...,wk) P(w | w2,...,wk)
Backoff Smoothing
The backoff factor is defined s.th. the probability
mass assigned to unobserved trigrams

(w1,...,wk) P(w | w2,...,wk))
w: C(w ,...,w ,w)=0
1
k
is identical to the probability mass discounted from
the observed trigrams.
1-

P(w | w1,...,wk))
w: C(w ,...,w ,w)>0
1
k
Therefore, we get:
(w1,...,wk) = ( 1 - 
P(w | w1,...,wk)) / (1 -
w: C(w ,...,w ,w)>0
1
k

P(w | w2,...,wk))
w: C(w ,...,w ,w)>0
1
k
Other applications of LM
 Author / Language identification
 hypothesis: texts that resemble each other (same author,
same language) share similar characteristics
• In English character sequence “ing” is more probable than in
French
 Training phase:
• construction of the language model
• with pre-classified documents (known language/author)
 Testing phase:
• evaluation of unknown text (comparison with language model)
Example: Language identification
 bigram of characters
• characters = 26 letters (case insensitive)
• possible variations: case sensitivity, punctuation,
beginning/end of sentence marker, …
A
B
C
D
…
Y
Z
A
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
B
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
C
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
D
0.0042
0.0014
0.0014
0.0014
…
0.0014
0.0014
E
0.0097
0.0014
0.0014
0.0014
…
0.0014
0.0014
…
…
…
…
…
…
…
0.0014
Y
0.0014
0.0014
0.0014
0.0014
…
0.0014
0.0014
Z
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
0.0014
1. Train a language model for English:
2. Train a language model for French
3. Evaluate probability of a sentence with LM-English & LM-French
4. Highest probability -->language of sentence
Claim
 A useful part of the knowledge needed to allow Word
Prediction can be captured using simple statistical
techniques.
 Compute:
- probability of a sequence
- likelihood of words co-occurring
 It can be useful to do this.
Download