Katherine Brainard 224N Final Project Classifying Parts of Speech Based on Sparse Data Introduction: Classification of parts of speech is an interesting topic to look into for several reasons. First, there are many different aspects of it, and a multitude of different approaches to the problem, which makes it intriguing. Second, it has practical implications for other aspects of machine translation and language acquisition. If the part-of-speech tagging of corpora could be automated, that would dramatically increase researchers’ access to test and training data. The part of the classification process that I focused on was how to deal with rare words. Approaches that have been tried in the past include forcing rare words into their own grammatical category, or trying to classify them along with the rest of the data by exploiting the fact that many rare words have morphologies that allow the classifier to predict what their type is. These approaches are detailed in a 2003 paper by Alexander Clark. The approach I took was a combination of these two – the classifier learns to cluster the frequent data with the rare words kept in their own cluster, and then after the other clusters have been learned, the classifier retroactively places the rare words in categories based on their morphology and the context in which they occurred. The contextual information is of course limited since the words have not appeared very often. This approach should work well, since there are two issues faced when dealing with rare words: it is easier to classify them if there is already a defined classification framework, and it is hard to create such a framework when one has to deal with the rare words. Algorithms and Methods: The classifier uses clustering to assign a given token to a particular category. After a few trial runs, and looking at the number of clusters used in papers, I wound up using 23 clusters for the classifier. The clusters are absolute, in that a token can belong to only one cluster. This is, of course, a simplification of any actual language, but it is still sufficiently powerful to use to try different methods of classifying the data. In more detail, the classifier first determines frequency counts and context information for every token type in the input file. It then generates random clusters, and gradually moves similar types of words into the same cluster, at which point the algorithm terminates. The relationships of context information are how the model determines similarity between a word and a cluster. This process is not applied to any words that occurred fewer than 5 times in the training data – these words are placed in their own cluster and are not allowed to move to another cluster. After the initial clustering finishes, the classifier combines information about the clusters and the rare words to attempt to assign each of these words to a category. If it has insufficient information to pick a category, the classifier leaves the words where they are. Feedback about the performance of the classifier is achieved by running two different evaluation criteria against it. The classifier relies heavily on hash maps and sets to achieve reasonable performance. Even so, the speed at which it operates is fairly slow. This seems to be acceptable, however, since there are few applications of machine classification of parts of speech that require real-time responses from the program. Development: During development, I tried several other evaluation functions. It turned out to be a bit difficult to pick a function that did not value degenerate input highly, but also correlated positively with the performance of the classifier. The two types of degenerate input that I wanted to avoid approaching were classifying all tokens as one type, and classifying each token as its own type. I kept data from one of these earlier functions, because it took a probabilistic approach different from that of the better-designed evaluation function I came up with later. This earlier function (called eval1 later on) reads in some test sentences, and then figures out the probability of this data given a particular clustering as follows: P(Sentence s) = Πi P(g(si) | g(si-1)), where g(si) is the grammatical category the clustering assigns to token si. This function clearly has the problem that if all data is in one category, all sentences will have probability 1. However, in practice, this function correlates well with the second evaluation function I used. This function (eval2), compares the clustered categories to already-tagged data from the Penn Tree Bank. The function had its own categories, and measured the distribution of a category over the classifier’s clusters, both in terms of the number of clusters that the category was spread across, and how much of each of these clusters was assigned to a particular category. So a classifier that split proper nouns into two clusters, but these clusters had only proper nouns, would be considered “better” than a classifier that mixed half of the proper nouns with a bunch of adjectives into one cluster. Originally, I had used morphological data in both the clustering of frequent and infrequent data. However, the performance cost associated with this was prohibitively high, so I ended up using morphological data only in the classification of infrequent word types. This had some interesting results, which are discussed below. In the interests of improving timing performance, data that is time-consuming to calculate, and is used somewhat frequently, such as frequencies and conditional probabilities, is cached by the program even when this hurts performance in terms of memory usage. This tradeoff seemed reasonable because the program doesn’t take up huge amounts of space, but is still fairly slow even after the caching (testing it on 26,000 tokens took over 20 minutes). Further, I run the clustering algorithm almost to convergence – when fewer than 2% of the tokens are shifting between clusters, I stop converging early. In small tests that I ran, this didn’t significantly impact the evaluation of the clustering. Results: Holding out the less frequent data wound up producing improvements in terms of resulting classification. The improvements are summarized in the following table; to get the results to be consistent, I only considered the held-out model where only contextual information is used to add the less-frequent words to the clusters. The data used were sentences from Treebank with 10,000 token types. In this table, and in the rest of the results, I will show the outputs of eval1 and 2 scaled to a range 0 to 100, to make them more easily comparable. The raw outputs of eval1 were in the range 0 to .002, and those of eval2 were from -500 to 0. Final result with held out data Final result without holding out data Eval 1 73 Eval 2 92.5 45 84 When holding out infrequent words, though, it is important to realize that the data used to eventually classify these words impacts the quality of the resulting clustering immensely. The most bizarre feature of this was that, for eval2, the worst way to classify these words was by looking at both morphological and contextual data. Just looking at one or the other was much better, although looking at contextual data was better than looking at morphological data. This is probably a result of the fact that the clusters were built by only looking at contextual information. These results are summarized in the following chart: 100 80 Morph & Context Morph 60 40 Context 20 0 Eval1 Eval2 To see how the model scaled, I also tried running it on two other different sized training sets. For the data above, eval1’s test data was the same as the training data, which is what produced the very high results for context-only classification. This is also the case for the 10,000 token data point below. For the other ones, the test data was separate from the training data, although they were from the same corpus. These results are summarized in the following chart: Eval1 baseline 100 90 Eval1 before 80 rare words 70 60 Eval1 final 50 Eval2 baseline 40 30 20 Eval2 before 10 rare words 0 Eval2 final Treebank (10,000 tokens) Bllip (20,000 tokens) Treebank(26,000 tokens) As the chart shows, with the exception of testing on the training data for eval1 and the 10,000 token Treebank data, the final results are fairly stable, and seem to improve dramatically when the infrequent words are added to the clusters. Discussion: This model assumes that rare words are not simultaneously grammatical oddities. If they were, then looking at the morphological and contextual information of the word would not help match the word to a particular grammatical category’s pattern. Fortunately, rare words are usually not irregular, so this assumption is valid to make as a general rule. The nature of the model also clearly assumes that there are local contextual similarities between parts of speech – it would not work for a language where the grammatical category is determined by something many words away in the sentence. Alexander Clark describes much of the past work that has been done on dealing with infrequent words in his paper Combining Distributional and Morphological Information for Part of Speech Induction. His paper helped shape what I looked at, by outlining the general approaches that can be taken towards classifying grammatical categories. His paper also outlined various evaluation strategies, which I used to help develop eval1. One area where further exploration could take place is in using fuzzy clusters. Clark’s paper mentions this briefly, but more investigation in that area are clearly possible. Having token types have a distribution over clusters, or belong to more than one cluster, will definitely complicate the convergence procedure, but should produce slightly better results. This is because token types really do belong to more than one grammatical category in real life, contrary to what this classifier supposes.