Liz Salesky  Math 20, Project Paper
N-gram Language Models
If someone asked you to fill in the blank in the sentence ‘How do we predict what word
comes ____’ you would undoubtedly answer with ‘next.’ Even for a more ambiguous sentence
like ‘I need to study for my Probability ____’ we can immediately limit our responses to
semantically similar words like ‘final,’ ‘exam,’ or ‘test.’ Why is this? Why can we listen to music
and be able to roughly sing along, though we might not have heard the song before? Our
brains think probabilistically, though we may not consciously realize it. This is particularly
obvious when it comes to our language usage. Native speakers of a language seem to have a
built-in grammar; we may not be able to explain why we can use certain words in some
situations, and not others, or why a certain preposition needs to come after a given word or
phrase, but somehow we’ve been trained from a young age to realize how to sequence words.
Searching for the right word, or even correct letter when we’re spelling, we are looking for
words that fit a certain context. A popular application is reading text without vowels; ‘w cn stll
rd wtht ny vwls,’ because after seeing and hearing so many examples of the English language,
we intuitively know the probability of certain vowels in certain locations and can fill in the
gaps. Computational linguistics has found hosts of ways to use statistically organized
sequences of words; machine translation, speech recognition, and text prediction, to just
name a few. Language models comprising sequences of n words, or n-grams, can be used to
describe certain aspects of a corpus, but can also predict new forms based on what has come
before. The n-gram models discussed in this paper can predict the most probable next word
given a history, using simple conditional probability, as well as tell us how predictable a large
text is given a unit length of n words.
At base, to computer probability, we must count something. In computational
linguistics, we are counting words! In talking about n-grams, we’re concerned with the
frequencies of words in a corpus, and as the value of n gets larger, we’re also concerned with
the relative frequencies of words. We want to know, given a history of n-1 words, how
probable it is that the nth is a particular word. This practice has many very prominent
applications, like machine translation, text prediction, spelling correction, and more, which
we will look at later. Let us now look at how we determine the probability of the nth word. At
the heart of the matter is relative frequency – out of all these possible outcomes, what is the
probability that we have outcome x? So, if we have a text, what is the probability of the word
‘the?’ Intuitively, we know that it is just the counts of ‘the’ out of the total number of words in
the text;
. Now we already now how to construct a unigram language model! We are
looking at the probability of a single word, or gram, within a language model, which is our
corpus, constructed via probability. A unigram model is a 0th order Markov approximation; it
actually involves no context at all, since we are looking at a single word. The probability of any
word is just its relative frequency within the corpus, as the above formula suggests. Looking
at a 7-word sentence like ‘the girl is going to the store’ with a unigram language model, we’re
asking ‘Given this sentence, what is the likelihood of this single word?’ So, we may create a
chart with each of the 7 words, marking their probabilities as the number of times they
appear in the sentence out of the total number of words in the sentence. We see that every
word has a probability of 1/7, then, except for ‘the,’ which has a probability of 2/7. Note that
the total number of words in the sentence is not the total number of unique words, or types,
but rather the number of word occurrences, or tokens. Progressing, though, we want to know
Liz Salesky  Math 20, Project Paper
how likely it is, based on the text we’re looking at, that we see both the words ‘the’ and ‘and.’
Note that we’re not yet talking about distributions; we don’t need to see them next to each
other, we just need to see both of them throughout the text. This alone is actually quite useful
in language identification! Seeing patterns of words and word structure can separate related
languages, like Dutch and Afrikaans, even though the two languages are mutually intelligible.
This joint probability is given by
, and we can infer what will happen as we
increase the number of words we’re asking about. Let us put these into context – literally. If
we’re also concerned about the context a word is in, we’re asking ‘what is the probability of
word x given that word y precedes it?’ But, we recognize that this is just conditional
probability! Given our definition above about the probability of x, Bayes’ rule tell us that this
question translates to p(wn|wn-1) = c(wn-1, wn)/c(wn-1).1 So, we now know how to construct a
bigram language model! A bigram model is a 1st order Markov approximation; as the formula
above shows us, it asks for the probability of a word x, given its predecessor y in the text. This
is the number of times that we see both words, out of the total number of times we see the
predecessor; we’re looking for the percentage of times that x follows y out of all the times that
something follows y. Continuing on, we note that a trigram model could be constructed using
p(wn|wn-2, wn-1) = c(wn-2, wn-1, wn)/c(wn-2, wn-1), and so we can generalize our cases. A unigram
model gives us p(Xn | Xn-1, Xn-2, Xn-3, Xn-4, ..., X1) = p(Xn), while a bigram model is represented by
p(Xn | Xn-1, Xn-2, Xn-3, Xn-4, ..., X1) = p(Xn | Xn-1), a trigram by p(Xn | Xn-1, Xn-2, Xn-3, Xn-4, ..., X1) =
p(Xn | Xn-1, Xn-2), and so an n-gram by p(Xn | Xn-1, Xn-2, Xn-3, Xn-4, ..., X1) = p(Xn | Xn-1, Xn-2, ..., Xn-y).
This is all a direct application of Bayes’ rule. Before we move on to the significance of such
models, let us note some helpful chain rule. In performing the calculations we just listed, as n
increases, we aren’t losing information; if we’re looking at a trigram model, we need to first
have the word x to consider the probability of x given y, which in turn needs to occur in order
to have x given y and z. So, the probability of the third gram, or word, in a trigram, or three
word sequence, is expanded out from the above formula to be p(x, y, z) = p(x) * p(y|x) *
p(z|x,y), and our generalized case is p(x1, x2, x3, ..., xn) = p(x1) * p(x2|x1) * p(x3|x1,x2) *.... *
p(xn|x1,x2, ...,xn-1).
In using n-grams, we must ask ourselves how much history we should use. What is the
perfect balance between added information and additional computation? So, well it is actually
quite useful to know the probability of a certain sequence, we need to ask ourselves what ngrams are actually telling us about a text as a whole. In fact, constructing an n-gram model for
a text can answer many, varied questions! How complex is English? How compressible? How
predictable? How redundant? How efficient? All of these can be answered by the probabilities
of the n-grams a text contains. A particular measure for this is entropy. Entropy is broadly, as
Merriam Webster puts it, a ‘the degree of disorder or uncertainty in a system.’2 Here, our
system is the n-gram model trained on our text. Any text is going to be uncertain to some
degree – there isn’t an infinite history, with all words distributed as they are in grammatical
speech around the world; that would be impossible. But, looking at the entropy of a text for
different values of n in n-gram models can allow us to formulate hypotheses about the
predictability and uniformity of a text, and comparing these different entropy values can
Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schütze, MIT Press. Cambridge, MA:
May 1999.
2 http://www.merriam-webster.com/dictionary/entropy
1
Liz Salesky  Math 20, Project Paper
further those intuitions. So, what is entropy in language and how do we calculate it? Entropy
is ‘a convenient mathematical way to estimate the uncertainty, or difficulty of making
predictions, under some probabilistic model.’3 It’s telling us how hard it is to predict the word
that comes next. Its formula is H(p(X)) = –p(X = xi)*log2p(X = xi).4 Looking at this closely, we
notice that it is a negative weighted average of the logged probabilities of our words – the
negative is there because the log gives us a negative value. The log inside the sum shows us
that the entropy of a text actually reflects the text’s uncertainty in the number of bits it would
take to store the text, if we did so in the most efficient way possible (by grouping sequences of
words of length n that occur frequently, so that they only need to be stored once). So, entropy
is also a lower bound on the number of bits it would take to encode information. Let’s look at
a unigram model. The entropy of a unigram language model is calculated by summing the
weighted probability of each type occurring in the document, i.e. –p(xi)*log2p(xi). The
maximum possible entropy occurs when the possibilities are uniformly distributed, as we saw
when said the maximum value of p*q in calculating confidence intervals was ¼. So, for a
unigram model, this means that each type occurs the same amount of times, not dependent on
order. A unigram model, as we’ve said, does not use context. Even with entropy, we see that
the value is independent of order; we might as well have just had a list with the same words.
Taking our 7 word example unigram sentence from above, with the same probabilities, we see
we have H(p(X)) = - [p(x1) log2 p(x1) + p(x2) log2 p(x2) + … + p(x7) log2 p(x7)], or
H(p(X)) = –[5*(0.1429 log2 0.1429) + 0.2857 log2 0.2857], or 2.5217 bits. The entropy of a
bigram model is a little trickier; we’re weighting the number of times that a word x occurs
first in a bigram out of all the words in the text, but then we have the weighted probabilities of
each of the ‘next words.’ So, if our first set of bigrams were ‘the’ followed by ‘dursleys’ ½ of
the time, ‘cupboard’ ¼ of the time, and ‘stairs’ the other ¼, we would need to know as well
that ‘the_x’ occurs 16 times in a text with 40 words, and then we would be able to begin to
compute the entropy as such; H(p(X)) = –[16/40 ( ½ log2 ½ + ¼ log2 ¼ + ¼ log2 ¼ ) + … +
p(xy) log2 p(xy)]. We’re taking a weight average of the weighted averages of the bits to encode
each bigram. Discovering the number of bits necessary to encode all bigrams that start with a
certain word has replaced the number of bits necessary to encode a unigram, as we might
have suspected. To put these calculations into context, let us look at the results of large text in
Python.
The sample text that I used for my computations was the second chapter of the first
Harry Potter book, Harry Potter and the Sorcerer’s Stone, file sschap2.txt. It is approximately
18 pages in the book, depending on the edition. It contains 3821 words, 972 of which are
unique. When speaking about the English language, we know that some words will occur
more often than others, but until we sit down and see numbers like these two, which are
vastly different, we don’t realize quite how large the disparity is. Of the 972 unique words in
this text, only 393 occur more than once, meaning 579 words occur only once out of 3821
total words in this chapter. They account for only 15% of the text! As well, only 50 words
occur more than 10 times, and only 26 occur more than 20 times. What does this mean for our
n-grams? The vast majority of the words we see are going to be repeated at some point in the
chapter; so, if we are looking at bigrams or larger models, there are going to be multiple
options to predict after a given word. Every such option increases the entropy, or uncertainty,
3
4
Irvine, Ann. "Information Theory." Hanover, NH. 18 Oct. 2011. Lecture.
Shannon, C. E. "Prediction and Entropy of Printed English." Bell Technological Journal. Bell Technologies, 15 Sept. 1950.
Liz Salesky  Math 20, Project Paper
of the text. This might lead one to conjecture that the bigram model will have a higher entropy
than the unigram model, which does not use context, but really, each unigram in a unigram
model is just as likely to precede any other unigram! This makes the model much less
uncertain, which is reflected in the entropy of the unigram and bigram models for Harry
Potter, which are 8.131 bits and 2.830 bits, respectively. Before describing the models more
thoroughly, though, let us note that context does not just constitute the history before a word,
but also the location of a word in a sentence. Certain words are bound to be more likely at the
beginnings or ends of sentences than others, and similarly some words are more likely to
occur around punctuation, which marks boundaries. For this project, I stripped and did not
make use of punctuation, but did choose to note the beginnings of sentences by inserting the
word BOS at the beginnings of sentences. In fact, if we look at the top six most frequent words,
we see that they are ‘BOS,’ ‘the,’ ‘a,’ ‘he,’ ‘and,’ and ‘Harry.’ These occur 196, 180, 93, 90, 84 and
73 times in this text, respectively. Remembering that entropy is a weighted average, we notice
that words with higher frequencies will be weighted more heavily, because they occur more
often in the text. This means that their high entropies, because there are multiple different
bigrams starting with the same word, will influence the text’s entropy more than the 579
words that occur only once, and whose following words then have a probability of 1, may be
overshadowed. This trend ties into my choice to use only unigram and bigram models in my
program; first, the calculation of entropy for a bigram model already takes several seconds,
and secondly, part of the pull of these models is that they can be used to make predictions! In
my program, because bigrams are the largest n-gram stored, and the history used to calculate
the probability of a bigram is a unigram, the prediction in the program can only take in a
single word, like ‘aunt,’ which predicts ‘petunia.’ To ask what would come after an n-gram
would be only to ask what would come after the last word. However, we can ask a model to
evaluate the probability of a sentence that it hasn’t seen yet; an n-gram model will break the
sentence into sequences of length n, but the probability of a sentence given an n-gram model
will be 0 if any of the n-grams that make up the sentence are not in the model. Lambdasmoothing can be used to rectify this, as can using a lower order model or more data,5 but for
this project, I wanted to keep calculations a little more straightforward, since this is about the
probability rather than the linguistics.
One clear and seemingly ever-present application of n-gram modeling is text
prediction. The text prediction or T9 option on a lot of phones is really just modeled on ngram probabilities! When you’re typing with certain keys, the algorithm is suggesting the
most likely combination of the characters you’ve pressed given the previous word typed, or
possibly multiple words typed. My phone, for example, uses a bigram model – I can tell
because it currently suggests ‘nog’ as the next word no matter what word I enter before ‘egg.’
In a similar vein, n-grams are used in spelling correction. A lot of the time, one can catch
spelling errors by having an algorithm consult a stored dictionary that will notice when a
word has been typed that it doesn’t recognize. For example, getting a red underline in
Microsoft Word is because an entered word hasn’t been entered into Word’s own particular
built-in dictionary yet. However, perhaps a more efficient (storage-wise, at the very least)
method is to use n-grams! Context-sensitive spelling error correction uses probability to
suggest, though, will detect mistakes and suggest alternatives based on the previous words;
Foundations of Statistical Natural Language Processing, by Chris Manning and Hinrich Schütze, MIT Press. Cambridge, MA:
May 1999.
5
Liz Salesky  Math 20, Project Paper
this involves distinguishing between ‘he’s sitting over their, ‘he’s sitting over they’re,’ and ‘he’s
sitting over there.’ Another similar application is one of the most groundbreaking aspects of
the computational linguistics field; statistical machine translation. Google Translate, for
example, is one of the best statistical machine translators out there. N-grams here are perhaps
most obviously useful in providing not just good matches to the English text or the foreign
language text, but both! This means being able to distinguish word order, prepositions, affixes
and various other language features to be able to say that ‘TV in Jon appeared’ and ‘TV on Jon
appeared’ are much less likely than ‘Jon appeared on TV.’ N-grams are also useful in
augmentative communication systems for the disabled, speech recognition, and other
language processes with noisy data in which words’ context can be used to determine the
most likely intended input, as well as more fundamental natural language processing
applications like part-of-speech tagging and language generation.6
The famous linguist Noam Chomsky once said that “it must be recognized that the
notion ‘probability of a sentence’ is an entirely useless one, under any known interpretation of
this term.”7 Chomsky almost certainly over-generalized, but as we have come to see, n-gram
models are just approximations of language, because we can’t possibly feed them infinite
vocabulary and grammar. This is not to demean their use, though, since people compose
language in much the same way models do – probabilistically. This means that not only can
probability be used to describe language, but it can also be used to tell us something about
language itself! N-gram models can be used to intimate patterns about a single language or
cross-linguistically which it would be much harder to realize without such programs.
Language is systematic, but not completely. There is no universal theory of language, but we
can’t get from any word to any other word, necessarily, depending on our position within a
sentence. Language is not ergodic. Probability and statistics can do much more than just
describe the rules that regulate these changes. The program 20project.py can accurately tell
us that ‘petunia’ should follow ‘aunt,’ or that ‘had’ will most likely follow ‘harry.’ Different
values of n give us very different models, though. A unigram language model will just as soon
suggest ‘the’ follow ‘the’ because of its frequency, whereas a bigram model might predict that
‘dursleys’ comes next. I think that n-grams are a really fascinating application of conditional
probability because they are of this; they use something relatively simple to describe a very
complex entity, language, in a very approachable way and allow us to learn new things, like
the uncertainty of a text, that we might not have been able to see as easily otherwise.
Language generation – This program generates text in multiple languages based on n-gram character and word based
models trained on prominent novels and other large texts. http://johno.jsmf.net/knowhow/n-grams/index.php
7 Jurafsky, Dan, and James H. Martin. Speech and Language Processing: an Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition. Upper Saddle River, NJ: Pearson Prentice Hall, 2008. Print. 57.
6
Liz Salesky  Math 20, Project Paper
The entropy of the unigram model for the file is 8.131 bits.
The entropy of the bigram model for the file is: 2.830 bits.
The 10 most frequent unigrams and their frequencies in this text are:
BOS
196
186
the
180
a
93
he
90
and 84
harry 73
was 70
had 65
to
63
The number of tokens (words) in this text is 3821 and the number of types (unique words) is
972. Of those 972 types, only 393 occur more than once, only 50 occur more than 10 times,
and only 26 occur more than 20 times!
The 10 most frequent bigrams and their frequencies in this text are:
BOS_
93
aunt_petunia 22
on_the
17
in_the
17
uncle_vernon 16
it_was
14
the_dursleys 13
he_had
13
of_the
12
had_been
12
The word entered was aunt, and the word the bigram model predicts will come next is
petunia.