Draft04

advertisement
Introduction
Imagine having the capability to predict the next word out of somebody’s mouth. In some cases,
this is not as difficult as one might think. Consider the following sentence:
“When I left the house this morning I forgot to lock the _____.”
In cases like this, we know that some words, such as door, are far more likely to occupy the blank than
are others, such as photosynthesis or a. The point is that it is easier to predict the next word of a
sentence if we know which words preceded it. This is the motivation behind n-gram models (example
concept from J&S, 2008).
Using n-gram models, provided that we have a sufficiently large corpus of training data, we can
predict the nth word in a sequence of n words by looking at the previous n–1 words in the sequence and
determining, based on what we have seen in our corpus, which word is mostly likely to come next.
Returning to our example sentence above, we can deduce, given the first 12 words of the sentence, that
the probability that the word occupying the blank will be door, P(door|w1,…,w12), is quite high.
However, since most sentences are novel and unique, we have a severe data sparsity problem (Pinker
??). To ameliorate this sparsity, we make a Markov assumption – that is, we approximate this
probability by instead conditioning on a shorter sequence of words prior to the blank (for our purposes,
this sequence will be between 0 and 2 words). Thus, we can approximate P(door|When, I, left, the,
house, this, morning, I, forgot, to, lock, the) by calculating P(door|lock, the). Since it is far more likely
that the 2-gram “lock the” will have occurred in our training corpus, we are more likely to be able to
compute the probability of the 3-gram in which we are interested: “lock the door”.
This statistical modeling of word sequences is called n-gram modeling, and we can compute the
probabilities of an entire sentence by multiplying together the conditional probability of each n-gram in
the sentence. With a large corpus of text to train on, we can use n-grams to build language models,
which can tell us the probability of a given sentence or even an entire document.
N-gram models have many applications in natural language processing, such as speech
recognition, where the aim is to find the most likely sequence of words reflecting the input audio signal.
The task to which we apply n-gram models, however, is that of text genre classification, where the goal
is to build a classifier that is able to determine the genre or topic of a given text. This technology is
essential for medical researchers, among others, who need to locate research papers dealing with
specialized topics in the multitude of available medical literature. Several artificial intelligence
techniques such as decision trees, maximum entropy models, and perceptrons, have been employed in
devising solutions to this problem (Manning and Schutze, 1999). Given that the research topic in which
we are primarily interested is n-gram language modeling, however, our approach to text genre
classification relies primarily on these models.
For our project, we build n-gram language models from four corpora consisting of news text,
political blog text, transcripts of telephone conversations, and fiction, respectively. We then use these
models to build a classifier, which classifies documents into one of the genres listed above by
calculating which model best predicts the document. Our results show that classifiers built using this
method can achieve reasonably high accuracy. Below we discuss in more detail the techniques we use
to build the n-gram language models, the challenges in pre-processing the corpora, the parameters we
varied in order to maximize the accuracy of our text genre classifier, as well as our experiments and the
degree to which the accuracy of our classifier improves as the amount of training data is increased.
Methods
Before jumping straight to the classification part of the problem the immediate step is to decide
which corpora would be useful for building the language models.
Common natural language processing techniques require a percentage of the data is reserved
and hidden. This test set is usually in a 1:9 ratio with the training set. In the spirit of our research topic
it was necessary to reserve an additional ten percent of the data as a development set for the lambda
parameters. These parameters adjust the weights of the tokens so as to more easily fit the testing data
with the training data. While this may seem counterintuitive because we are decreasing the training-test
ration, the reasoning is to be able to see a sliding scale of how much training is needed to accurately
classify a genre.
When the data is portioned out in to the 10-10-80 percentages, there must be enough words in
each set to make the study meaningful. While we gathered as much corpora as possible, due to the
formatting, size, or availability some sources, like the famous Brown Corpus, were left out.
The corpora used came from the American National Corpus (ANC) and contains around 11
million words. Within the ANC is the Switchboard corpus with around 3 million words, the New York
Times corpus with around 3.2 million words and the Slate Magazine corpus with 4.3 million. We also
used the fiction section of the Gutenberg Corpus, with the exception of the works of Shakespeare,
tallying in at around 2 million words. This made four genres for classification. These genres were
selected as they roughly have the same word count.
Before tossing all the data at the classification code, the next step was to go through the data set
and normalize. There is a great deal of variability in the types of corpora available so it was imperative
that all the genres had the same formatting thus preserving the n-grams.
Because each of the corpora represented different phenomena in language each had its own
complications for normalizing. For example, corpora that comes from spoken word may not have the
breadth of vocabulary that would be helpful for the study, as well as have a heightened count of
disfluencies. Disfluencies are problematic because there is often no recognizable pattern for
algorithmically editing them out of the data set. Another problematic corpora could be a volume of
works from an era that has gone out of style. Dated word forms could jeopardize the counts and not
yield a realistic model of modern usage. Prose and poetic styles often make use of metaphor and do not
always portray the standard usage of a word or sequence. These reasons relate to why the works of
Shakespeare were omitted. These are are a few of the situations that we fore saw as possible problems
but still help to guide the data collection process.
Most of the normalizing required using regular expressions to adjust the formatting to
something the language model code was expecting. This involved breaking every sentence into its own
line and removing extraneous characters. An initial test was done to pull out the cleanest sentences and
separate them from those that would need elaborate processing. Punctuation was removed and special
cases like 's, 't, 'nt etc were split off and turned into their own tokens.
Some situations proved to be more challenging than others. In the case of proper names with
titles as in “Mr. Smith”. We didn't want to see “Mr.” on its own line as that would ruin the n-gram for
rest of the sentence. Mistakes like these are common and tend to multiply into many mistakes. To
handle this, the punctuation would be stripped and an extra pass afterwards would go through looking
for this type of instance and reapply the period. This is important as a generic name like “Mr. Smith”
could appear across genres but the tokens “Mr . Smith” or “Mr Smith” and <“Mr.”><”Smith”> would
all have been different.
At times, the tagged corpus became sloppy and all that could be done was to skip over it. While
the Gutenberg corpus was generally cleaner, the ANC would have situations in which strings would be
enclosed in the tag brackets. When the code runs over this the outcome can be surprising. Given the
time constraints, we did the best to eyeball the corpus and pin down as many of these flaws as possible,
or just cut them out all together.
After preparing the corpora, we built language models for each corpus. The first step in building
our models was to take counts of all n-grams in each corpus. We use a trigram model (n=3), so for each
trigram we see in the text, we keep count how many times that trigram occurred in the text. We also
keep track of counts of counts of trigrams – that is, we keep track of the number of trigrams occurring
only once in the training set, the number of trigrams occurring twice, etc. We do the same for unigrams
(n=1) and bigrams (n=2), as we will need these counts for the smoothing and interpolation steps
described below. The following histograms show the frequencies for our trigram counts for each
corpus. On the x-axis are bins containing the number of trigrams that occurred y times.
[INSERT HISTOGRAMS FOR TRIGRAM COUNTS FROM EACH CORPUS]
The probability our language model assigns to a given sentence is equal to the product of the
probabilities assigned by the model to each of the n-grams in that sentence. This means we need to
assign some probability to words and word sequences that turn up in our test set that we did not
encounter when training our model on the corpora, because otherwise the probability of these
sequences will be zero, which will cause the probability of the entire sentence to be zero. Since we
could not possibly encounter all words, much less all possible bigrams and trigrams, in our training
data, we need to estimate the probability of seeing new n-grams. We therefore "smooth" our n-gram
counts, that is, we take some of the probability mass assigned to the word sequences we did observe in
the training set and reassign it to sequences that we did not observe in the training set.
We used the Good-Turing (GT) smoothing technique, which is based on using the frequency of
n-grams that occur only once in the training set to estimate the probability of n-grams we have not yet
seen (Jurafsky & Martin, 2008). Let Nc = the number of trigrams that occur in the training set c times.
In GT-smoothing, we pretend that instead of seeing each of these trigrams c times, we actually saw
them c* times, where c* = (c+1)Nc+1/Nc (Jurafsky & Martin, 2008). We calculate these c* counts for all
values of c from 0 through a maximum value, which we set, called GT-max. Replacing our original
counts by c* counts reduces some of the probability mass assigned to the n-grams occurring c times and
allows us to reassign this leftover probability mass to n-grams which occurred zero times in the training
set. The remaining probability mass turns out to be N1/N, which means we predict that the probability
that we will see an n-gram we have not yet seen is equal to the probability of seeing an n-gram that we
have seen once in the training set (Jurafsky & Martin, 2008).
After counting all our unigrams, bigrams, and trigrams, and smoothing our unigram, bigram,
and trigram counts, we can calculate the probability of trigrams using the formula P(w3 | w1 w2) = (w1
w2 w3)c* / (w1 w2)c*. For bigrams, we use P(w2 | w1) = (w1 w2)c* / (w1)c*, and for unigrams, (w1)c* /
(sum over c’s of Nc times c*).
Now we use interpolation to further improve our model, that is, we reassign probabilities to
each trigram using the following formula:
P(w3 | w1 w2) = 1P(w3) + 2P(w3 | w2) + 3P(w3 | w1 w2)
where 1, 2, and 3 are the weights we apply to the smoothed unigram, bigram, and trigram
probabilities, respectively. These three lambdas must sum to 1. We chose values for 1, 2, and 3 by
trial and error, using a development set – see the results section for details.
Download