Introduction Imagine having the capability to predict the next word out of somebody’s mouth. In some cases, this is not as difficult as one might think. Consider the following sentence: “When I left the house this morning I forgot to lock the _____.” In cases like this, we know that some words, such as door, are far more likely to occupy the blank than are others, such as photosynthesis or a. The point is that it is easier to predict the next word of a sentence if we know which words preceded it. This is the motivation behind n-gram models (example concept from J&S, 2008). Using n-gram models, provided that we have a sufficiently large corpus of training data, we can predict the nth word in a sequence of n words by looking at the previous n–1 words in the sequence and determining, based on what we have seen in our corpus, which word is mostly likely to come next. Returning to our example sentence above, we can deduce, given the first 12 words of the sentence, that the probability that the word occupying the blank will be door, P(door|w1,…,w12), is quite high. However, since most sentences are novel and unique, we have a severe data sparsity problem (Pinker ??). To ameliorate this sparsity, we make a Markov assumption – that is, we approximate this probability by instead conditioning on a shorter sequence of words prior to the blank (for our purposes, this sequence will be between 0 and 2 words). Thus, we can approximate P(door|When, I, left, the, house, this, morning, I, forgot, to, lock, the) by calculating P(door|lock, the). Since it is far more likely that the 2-gram “lock the” will have occurred in our training corpus, we are more likely to be able to compute the probability of the 3-gram in which we are interested: “lock the door”. This statistical modeling of word sequences is called n-gram modeling, and we can compute the probabilities of an entire sentence by multiplying together the conditional probability of each n-gram in the sentence. With a large corpus of text to train on, we can use n-grams to build language models, which can tell us the probability of a given sentence or even an entire document. N-gram models have many applications in natural language processing, such as speech recognition, where the aim is to find the most likely sequence of words reflecting the input audio signal. The task to which we apply n-gram models, however, is that of text genre classification, where the goal is to build a classifier that is able to determine the genre or topic of a given text. This technology is essential for medical researchers, among others, who need to locate research papers dealing with specialized topics in the multitude of available medical literature. Several artificial intelligence techniques such as decision trees, maximum entropy models, and perceptrons, have been employed in devising solutions to this problem (Manning and Schutze, 1999). Given that the research topic in which we are primarily interested is n-gram language modeling, however, our approach to text genre classification relies primarily on these models. For our project, we build n-gram language models from four corpora consisting of news text, political blog text, transcripts of telephone conversations, and fiction, respectively. We then use these models to build a classifier, which classifies documents into one of the genres listed above by calculating which model best predicts the document. Our results show that classifiers built using this method can achieve reasonably high accuracy. Below we discuss in more detail the techniques we use to build the n-gram language models, the challenges in pre-processing the corpora, the parameters we varied in order to maximize the accuracy of our text genre classifier, as well as our experiments and the degree to which the accuracy of our classifier improves as the amount of training data is increased. Methods Before jumping straight to the classification part of the problem the immediate step is to decide which corpora would be useful for building the language models. Common natural language processing techniques require a percentage of the data is reserved and hidden. This test set is usually in a 1:9 ratio with the training set. In the spirit of our research topic it was necessary to reserve an additional ten percent of the data as a development set for the lambda parameters. These parameters adjust the weights of the tokens so as to more easily fit the testing data with the training data. While this may seem counterintuitive because we are decreasing the training-test ration, the reasoning is to be able to see a sliding scale of how much training is needed to accurately classify a genre. When the data is portioned out in to the 10-10-80 percentages, there must be enough words in each set to make the study meaningful. While we gathered as much corpora as possible, due to the formatting, size, or availability some sources, like the famous Brown Corpus, were left out. The corpora used came from the American National Corpus (ANC) and contains around 11 million words. Within the ANC is the Switchboard corpus with around 3 million words, the New York Times corpus with around 3.2 million words and the Slate Magazine corpus with 4.3 million. We also used the fiction section of the Gutenberg Corpus, with the exception of the works of Shakespeare, tallying in at around 2 million words. This made four genres for classification. These genres were selected as they roughly have the same word count. Before tossing all the data at the classification code, the next step was to go through the data set and normalize. There is a great deal of variability in the types of corpora available so it was imperative that all the genres had the same formatting thus preserving the n-grams. Because each of the corpora represented different phenomena in language each had its own complications for normalizing. For example, corpora that comes from spoken word may not have the breadth of vocabulary that would be helpful for the study, as well as have a heightened count of disfluencies. Disfluencies are problematic because there is often no recognizable pattern for algorithmically editing them out of the data set. Another problematic corpora could be a volume of works from an era that has gone out of style. Dated word forms could jeopardize the counts and not yield a realistic model of modern usage. Prose and poetic styles often make use of metaphor and do not always portray the standard usage of a word or sequence. These reasons relate to why the works of Shakespeare were omitted. These are are a few of the situations that we fore saw as possible problems but still help to guide the data collection process. Most of the normalizing required using regular expressions to adjust the formatting to something the language model code was expecting. This involved breaking every sentence into its own line and removing extraneous characters. An initial test was done to pull out the cleanest sentences and separate them from those that would need elaborate processing. Punctuation was removed and special cases like 's, 't, 'nt etc were split off and turned into their own tokens. Some situations proved to be more challenging than others. In the case of proper names with titles as in “Mr. Smith”. We didn't want to see “Mr.” on its own line as that would ruin the n-gram for rest of the sentence. Mistakes like these are common and tend to multiply into many mistakes. To handle this, the punctuation would be stripped and an extra pass afterwards would go through looking for this type of instance and reapply the period. This is important as a generic name like “Mr. Smith” could appear across genres but the tokens “Mr . Smith” or “Mr Smith” and <“Mr.”><”Smith”> would all have been different. At times, the tagged corpus became sloppy and all that could be done was to skip over it. While the Gutenberg corpus was generally cleaner, the ANC would have situations in which strings would be enclosed in the tag brackets. When the code runs over this the outcome can be surprising. Given the time constraints, we did the best to eyeball the corpus and pin down as many of these flaws as possible, or just cut them out all together. After preparing the corpora, we built language models for each corpus. The first step in building our models was to take counts of all n-grams in each corpus. We use a trigram model (n=3), so for each trigram we see in the text, we keep count how many times that trigram occurred in the text. We also keep track of counts of counts of trigrams – that is, we keep track of the number of trigrams occurring only once in the training set, the number of trigrams occurring twice, etc. We do the same for unigrams (n=1) and bigrams (n=2), as we will need these counts for the smoothing and interpolation steps described below. The following histograms show the frequencies for our trigram counts for each corpus. On the x-axis are bins containing the number of trigrams that occurred y times. [INSERT HISTOGRAMS FOR TRIGRAM COUNTS FROM EACH CORPUS] The probability our language model assigns to a given sentence is equal to the product of the probabilities assigned by the model to each of the n-grams in that sentence. This means we need to assign some probability to words and word sequences that turn up in our test set that we did not encounter when training our model on the corpora, because otherwise the probability of these sequences will be zero, which will cause the probability of the entire sentence to be zero. Since we could not possibly encounter all words, much less all possible bigrams and trigrams, in our training data, we need to estimate the probability of seeing new n-grams. We therefore "smooth" our n-gram counts, that is, we take some of the probability mass assigned to the word sequences we did observe in the training set and reassign it to sequences that we did not observe in the training set. We used the Good-Turing (GT) smoothing technique, which is based on using the frequency of n-grams that occur only once in the training set to estimate the probability of n-grams we have not yet seen (Jurafsky & Martin, 2008). Let Nc = the number of trigrams that occur in the training set c times. In GT-smoothing, we pretend that instead of seeing each of these trigrams c times, we actually saw them c* times, where c* = (c+1)Nc+1/Nc (Jurafsky & Martin, 2008). We calculate these c* counts for all values of c from 0 through a maximum value, which we set, called GT-max. Replacing our original counts by c* counts reduces some of the probability mass assigned to the n-grams occurring c times and allows us to reassign this leftover probability mass to n-grams which occurred zero times in the training set. The remaining probability mass turns out to be N1/N, which means we predict that the probability that we will see an n-gram we have not yet seen is equal to the probability of seeing an n-gram that we have seen once in the training set (Jurafsky & Martin, 2008). After counting all our unigrams, bigrams, and trigrams, and smoothing our unigram, bigram, and trigram counts, we can calculate the probability of trigrams using the formula P(w3 | w1 w2) = (w1 w2 w3)c* / (w1 w2)c*. For bigrams, we use P(w2 | w1) = (w1 w2)c* / (w1)c*, and for unigrams, (w1)c* / (sum over c’s of Nc times c*). Now we use interpolation to further improve our model, that is, we reassign probabilities to each trigram using the following formula: P(w3 | w1 w2) = 1P(w3) + 2P(w3 | w2) + 3P(w3 | w1 w2) where 1, 2, and 3 are the weights we apply to the smoothed unigram, bigram, and trigram probabilities, respectively. These three lambdas must sum to 1. We chose values for 1, 2, and 3 by trial and error, using a development set – see the results section for details.