CSC 9010: Special Topics. Natural Language Processing.
Paula Matuszek, Mary-Angela Papalaskari
Spring, 2005
Based on McCoy, http://www.cis.udel.edu/~mccoy/courses/cisc882.03f/lectures/lect5-ngrams.ppt/
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1
• I am going to say some phrases. Write down the next word or two that occur to you.
– Microsoft announced a new security ____
– NHL commissioner cancels rest ____
– One Fish, ______
– Colorless green ideas ______
– Conjunction Junction, what’s __________
– Oh, say, can you see, by the dawn’s ______
– After I finished my homework I went _____.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 2
• Clearly, at least some of us have the ability to predict future words in an utterance.
• How?
– Domain knowledge
– Syntactic knowledge
– Lexical knowledge
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3
• A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques
• In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4
• Why do we want to predict a word, given some preceding words?
– Rank the likelihood of sequences containing various alternative hypotheses, e.g. for automated speech recognition, OCRing.
Theatre owners say popcorn/unicorn sales have doubled...
– Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation
Como mucho pescado. At the most fished.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5
• They are leaving in about fifteen minuets to go to her house.
• The study was conducted mainly be John Black.
• The design an construction of the system will take more than a year.
• Hopefully, all with continue smoothly in my absence.
• Can they lave him my messages?
• I need to notified the bank of….
• He is trying to fine out.
Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 6
• Fundamental tool in NLP
• Main idea:
– Some words are more likely than others to follow each other
– You can predict fairly accurately that likelihood.
• In other words, you can build a language model
Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 7
• N-Grams are sequences of tokens.
• The N stands for how many terms are used
– Unigram: 1 term
– Bigram: 2 terms
– Trigrams: 3 terms
• You can use different kinds of tokens
– Character based n-grams
– Word-based n-grams
– POS-based n-grams
• N-Grams give us some idea of the context around the token we are looking at.
Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8
• A language model is a model that lets us compute the probability, or likelihood, of a sentence S, P(S).
• N-Gram models use the previous N-1 words in a sequence to predict the next word
– unigrams, bigrams, trigrams,…
• How do we construct or train these language models?
– Count frequencies in very large corpora
– Determine probabilities using Markov models, similar to POS tagging.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9
• What is a word?
– e.g., are cat and cats the same word?
– September and Sept ?
– zero and oh ?
– Is _ a word? * ? ‘ ( ‘ ?
– How many words are there in don’t ?
Gonna ?
– In Japanese and Chinese text -- how do we identify a word?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10
• Sentence : unit of written language
• Utterance : unit of spoken language
• Word Form : the inflected form that appears in the corpus
• Lemma : an abstract form, shared by word forms having the same stem , part of speech, and word sense
• Types : number of distinct words in a corpus
(vocabulary size)
• Tokens : total number of words
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11
• Assume a language has V word types in its lexicon, how likely is word x to follow word y?
– Simplest model of word probability: 1/ V
– Alternative 1: estimate likelihood of x occurring in new text based on its general frequency of occurrence estimated from a corpus ( unigram probability) popcorn is more likely to occur than unicorn
– Alternative 2: condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 12
• Compute the product of component conditional probabilities?
– P( the mythical unicorn ) = P( the ) P( mythical | the )
P( unicorn | the mythical )
• The longer the sequence, the less likely we are to find it in a training corpus
P( Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal )
• Solution: approximate using n-grams
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 13
• Approximate P ( w n
| w
1 n
1 ) by
P ( w n
| w n
1
)
– P( unicorn | the mythical ) by P( unicorn | mythical )
• Markov assumption : the probability of a word depends only on the probability of a limited history
• Generalization: the probability of a word depends only on the probability of the n previous words
– Trigrams, 4-grams, …
– The higher n is, the more data needed to train
– The higher n is, the sparser the matrix.
• Leads us to backoff models
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 14
• For N-gram models
– P ( w n
| w
1 n
1 )
P ( w n
| w n n
1
N
1
)
– P(w n-1
,w n
) = P(w n
| w n-1
) P(w n-1
)
– By the Chain Rule we can decompose a joint probability, e.g. P(w
1
,w
2
,w
3
)
P(w
1
,w
2
, ...,w n
) = P(w
1
|w
2
,w
3
,...,w n
) P(w
2
|w
3
, ...,w n
) … P(w n-
1
|w n
) P(w n
)
For bigrams then, the probability of a sequence is just the product of the conditional probabilities of its bigrams
P( the,mythical,unicorn ) = P( unicorn | mythical )
P( mythical | the ) P( the|<start> )
P ( w
1 n )
k n
1
P ( w k
| w k
1
)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 15
– P(I want to eat Chinese food) =
P(I | <start>) *
P(to | want) *
P(Chinese | eat) *
P(want | I)
P(eat | to)
P(food | Chinese)
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 16
N-1 term
Nth term
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 17
N-1 term
Nth term
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 18
– P(I want to eat Chinese food) =
P(I | <start>) *
P(to | want) *
P(Chinese | eat) *
P(want | I)
P(eat | to)
P(food | Chinese)
•.25*.32*.65*.26*.02*.56 = .00015
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 19
• P( I want to eat British food ) = P(I|<start>)
P(want|I) P(to|want) P(eat|to)
P(British|eat) P(food|British) =
.25*.32*.65*.26*.001*.60 = .000080
• vs. I want to eat Chinese food = .00015
• Probabilities seem to capture ``syntactic'' facts, ``world knowledge''
– eat is often followed by an NP
– British food is not too popular
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 20
• What's being captured with ...
– P(want | I) = .32
– P(to | want) = .65
– P(eat | to) = .26
– P(food | Chinese) = .56
– P(lunch | eat) = .055
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 21
• What about...
– P(I | I) = .0023
– P(I | want) = .0025
– P(I | food) = .013
• BeRP is a testbed for speech recognition.
• We are getting:
– P(I | I) = .0023 I I I I want
– P(I | want) = .0025 I want I want
– P(I | food) = .013 the kind of food I want is ...
• In other words, our corpus includes disfluencies.
Bad choice if we wanted to process text!
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 22
• As we increase the value of N, the accuracy of the ngram model increases, since choice of next word becomes increasingly constrained
• Generating sentences with random unigrams...
– Every enter now severally so, let
– Hill he late speaks; or! a more to leg less first you enter
• With bigrams...
– What means, sir. I confess she? then all sorts, he is trim, captain.
– Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 23
• Trigrams
– Sweet prince, Falstaff shall die.
– This shall forbid it should be branded, if renown made it empty.
• Quadrigrams
– What! I will go seek the traitor Gloucester.
– Will you not tell me who I am?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 24
• There are 884,647 tokens, with 29,066 word form types, in about a one million word Shakespeare corpus
• Shakespeare produced 300,000 bigram types out of 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)
• Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 25
• If we repeated the Shakespeare experiment but trained our n-grams on a
Wall Street Journal corpus, what would we get?
• This has major implications for corpus selection or design
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 26
• A few events occur with high frequency
• Many events occur with low frequency
• You can quickly collect statistics on the high frequency events
• You might have to wait an arbitrarily long time to get valid statistics on low frequency events
• Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. How to address?
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 27
Zipf’s Law
George Kingsley Zipf (1902-1950) noted that for many frequency distributions, the n -th largest frequency is proportional to a negative power of the rank order n.
Let t range over the set of unique events. Let f(t) be the frequency of t and let r(t) be its rank. Then:
t r(t)
c * f(t) -b for some constants b and c.
•Applies to a surprising range of things.
•Including frequencies in corpora
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 28
• Every n-gram training matrix is sparse, even for very large corpora ( Zipf’s law )
• Solution: estimate the likelihood of unseen n-grams
• Problems: how do you adjust the rest of the corpus to accommodate these
‘phantom’ n-grams?
• Methods to handle this are called smoothing .
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 29
Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass)
From Snow, http://www.stanford.edu/class/linguist236/lec11.ppt
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 30
• For unigrams:
– Add 1 to every word (type) count
– Normalize by N (tokens) /(N (tokens) +V (types))
– Smoothed count (adjusted for additions to N) is
c i
1
N
N
V
– Normalize by N to get the new unigram probability: p i
* c
N i
1
V
• For bigrams:
– Add 1 to every bigram c(w n-1 w n
) + 1
– Incr unigram count by vocabulary size c(w n-1
) + V
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 31
– Discount: ratio of new counts to old (e.g. add-one smoothing changes the BeRP bigram (to|want) from 786 to 331 (d c
=.42) and p(to|want) from .65 to .28)
– But this changes counts drastically:
• too much weight given to unseen n-grams
• in practice, unsmoothed bigrams often work better!
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 32
• A zero n-gram is just an n-gram you haven’t seen yet…but every n-gram in the corpus was unseen once…so...
– How many times did we see an n-gram for the first time?
Once for each n-gram type (T)
– Est. total probability of unseen bigrams as
N
T
T
– View training corpus as series of events, one for each token
(N) and one for each new type (T)
• We can divide the probability mass equally among unseen bigrams….or we can condition the probability of an unseen bigram on the first word of the bigram
• Discount values for Witten-Bell are much more reasonable than Add-One
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 33
• For e.g. a trigram model
– Compute unigram, bigram and trigram probabilities
– In use:
• Where trigram unavailable back off to bigram if available, o.w. unigram probability
• E.g An omnivorous unicorn
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 34
• N-gram probabilities can be used to estimate the likelihood
– Of a word occurring in a context (N-1)
– Of a sentence occurring at all
• Smoothing techniques deal with problems of unseen words in a corpus
• N-grams are useful in a wide variety of
NLP tasks
CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 35