report - The Stanford NLP

advertisement
Katherine Brainard
224N Final Project
Classifying Parts of Speech Based on Sparse Data
Introduction:
Classification of parts of speech is an interesting topic to look into for several reasons.
First, there are many different aspects of it, and a multitude of different approaches to the
problem, which makes it intriguing. Second, it has practical implications for other aspects of
machine translation and language acquisition. If the part-of-speech tagging of corpora could be
automated, that would dramatically increase researchers’ access to test and training data. The
part of the classification process that I focused on was how to deal with rare words. Approaches
that have been tried in the past include forcing rare words into their own grammatical category,
or trying to classify them along with the rest of the data by exploiting the fact that many rare
words have morphologies that allow the classifier to predict what their type is. These approaches
are detailed in a 2003 paper by Alexander Clark. The approach I took was a combination of these
two – the classifier learns to cluster the frequent data with the rare words kept in their own
cluster, and then after the other clusters have been learned, the classifier retroactively places the
rare words in categories based on their morphology and the context in which they occurred. The
contextual information is of course limited since the words have not appeared very often. This
approach should work well, since there are two issues faced when dealing with rare words: it is
easier to classify them if there is already a defined classification framework, and it is hard to
create such a framework when one has to deal with the rare words.
Algorithms and Methods:
The classifier uses clustering to assign a given token to a particular category. After a few
trial runs, and looking at the number of clusters used in papers, I wound up using 23 clusters for
the classifier. The clusters are absolute, in that a token can belong to only one cluster. This is, of
course, a simplification of any actual language, but it is still sufficiently powerful to use to try
different methods of classifying the data.
In more detail, the classifier first determines frequency counts and context information
for every token type in the input file. It then generates random clusters, and gradually moves
similar types of words into the same cluster, at which point the algorithm terminates. The
relationships of context information are how the model determines similarity between a word
and a cluster. This process is not applied to any words that occurred fewer than 5 times in the
training data – these words are placed in their own cluster and are not allowed to move to another
cluster.
After the initial clustering finishes, the classifier combines information about the clusters
and the rare words to attempt to assign each of these words to a category. If it has insufficient
information to pick a category, the classifier leaves the words where they are.
Feedback about the performance of the classifier is achieved by running two different
evaluation criteria against it.
The classifier relies heavily on hash maps and sets to achieve reasonable performance.
Even so, the speed at which it operates is fairly slow. This seems to be acceptable, however,
since there are few applications of machine classification of parts of speech that require real-time
responses from the program.
Development:
During development, I tried several other evaluation functions. It turned out to be a bit
difficult to pick a function that did not value degenerate input highly, but also correlated
positively with the performance of the classifier. The two types of degenerate input that I wanted
to avoid approaching were classifying all tokens as one type, and classifying each token as its
own type. I kept data from one of these earlier functions, because it took a probabilistic approach
different from that of the better-designed evaluation function I came up with later. This earlier
function (called eval1 later on) reads in some test sentences, and then figures out the probability
of this data given a particular clustering as follows:
P(Sentence s) = Πi P(g(si) | g(si-1)), where g(si) is the grammatical category the clustering assigns
to token si.
This function clearly has the problem that if all data is in one category, all sentences will
have probability 1. However, in practice, this function correlates well with the second evaluation
function I used. This function (eval2), compares the clustered categories to already-tagged data
from the Penn Tree Bank. The function had its own categories, and measured the distribution of
a category over the classifier’s clusters, both in terms of the number of clusters that the category
was spread across, and how much of each of these clusters was assigned to a particular category.
So a classifier that split proper nouns into two clusters, but these clusters had only proper nouns,
would be considered “better” than a classifier that mixed half of the proper nouns with a bunch
of adjectives into one cluster.
Originally, I had used morphological data in both the clustering of frequent and
infrequent data. However, the performance cost associated with this was prohibitively high, so I
ended up using morphological data only in the classification of infrequent word types. This had
some interesting results, which are discussed below. In the interests of improving timing
performance, data that is time-consuming to calculate, and is used somewhat frequently, such as
frequencies and conditional probabilities, is cached by the program even when this hurts
performance in terms of memory usage. This tradeoff seemed reasonable because the program
doesn’t take up huge amounts of space, but is still fairly slow even after the caching (testing it on
26,000 tokens took over 20 minutes). Further, I run the clustering algorithm almost to
convergence – when fewer than 2% of the tokens are shifting between clusters, I stop converging
early. In small tests that I ran, this didn’t significantly impact the evaluation of the clustering.
Results:
Holding out the less frequent data wound up producing improvements in terms of
resulting classification. The improvements are summarized in the following table; to get the
results to be consistent, I only considered the held-out model where only contextual information
is used to add the less-frequent words to the clusters. The data used were sentences from
Treebank with 10,000 token types. In this table, and in the rest of the results, I will show the
outputs of eval1 and 2 scaled to a range 0 to 100, to make them more easily comparable. The raw
outputs of eval1 were in the range 0 to .002, and those of eval2 were from -500 to 0.
Final result
with held out data
Final result without
holding out data
Eval 1
73
Eval 2
92.5
45
84
When holding out infrequent words, though, it is important to realize that the data used to
eventually classify these words impacts the quality of the resulting clustering immensely. The
most bizarre feature of this was that, for eval2, the worst way to classify these words was by
looking at both morphological and contextual data. Just looking at one or the other was much
better, although looking at contextual data was better than looking at morphological data. This is
probably a result of the fact that the clusters were built by only looking at contextual
information. These results are summarized in the following chart:
100
80
Morph &
Context
Morph
60
40
Context
20
0
Eval1
Eval2
To see how the model scaled, I also tried running it on two other different sized training sets. For
the data above, eval1’s test data was the same as the training data, which is what produced the
very high results for context-only classification. This is also the case for the 10,000 token data
point below. For the other ones, the test data was separate from the training data, although they
were from the same corpus. These results are summarized in the following chart:
Eval1 baseline
100
90
Eval1 before
80
rare words
70
60
Eval1 final
50
Eval2 baseline 40
30
20
Eval2 before
10
rare words
0
Eval2 final
Treebank (10,000
tokens)
Bllip (20,000 tokens)
Treebank(26,000
tokens)
As the chart shows, with the exception of testing on the training data for eval1 and the 10,000
token Treebank data, the final results are fairly stable, and seem to improve dramatically when
the infrequent words are added to the clusters.
Discussion:
This model assumes that rare words are not simultaneously grammatical oddities. If they
were, then looking at the morphological and contextual information of the word would not help
match the word to a particular grammatical category’s pattern. Fortunately, rare words are
usually not irregular, so this assumption is valid to make as a general rule.
The nature of the model also clearly assumes that there are local contextual similarities
between parts of speech – it would not work for a language where the grammatical category is
determined by something many words away in the sentence.
Alexander Clark describes much of the past work that has been done on dealing with
infrequent words in his paper Combining Distributional and Morphological Information for Part
of Speech Induction. His paper helped shape what I looked at, by outlining the general
approaches that can be taken towards classifying grammatical categories. His paper also outlined
various evaluation strategies, which I used to help develop eval1.
One area where further exploration could take place is in using fuzzy clusters. Clark’s
paper mentions this briefly, but more investigation in that area are clearly possible. Having token
types have a distribution over clusters, or belong to more than one cluster, will definitely
complicate the convergence procedure, but should produce slightly better results. This is because
token types really do belong to more than one grammatical category in real life, contrary to what
this classifier supposes.
Download