Natural Language Clustering of Large Random Corpuses

advertisement
Natural Language Processing using Hierarchical Clustering
Benjamin Arai Chris Baron
{barai, cbaron}@cs.ucr.edu
ABSTRACT
Although there are various methods for indexing
semantic data, there are no efficient algorithms for
determining if word co-occurrences contain useful
relationships with highly subjective meanings for analysis
such as surveys and poles. To this point it has been
difficult to extract information from corpora because of
the large amounts of computation and memory required
for computing semantic spaces of lexical co-occurrences.
In this paper we present a method for searching corpora to
detect lexical co-occurrences that provide significant and
useful associations in a reasonable amount of time, given
only a large corpus as input.
HAL (Hyperspace Analogue to Language) is a
procedure that processes large corpora of text into
numerical vectors (vector representation of word
meanings), which can be used for determining word
relationships. These vectors can be used for creating high
dimensional spaces for analyzing statistical relationships
of words and phrases. From this high dimensional space
semantic information can be extracted. This method takes
no human intervention [1].
Beyond this extraction method, little work has been
done to interpret the data in a way that can be easily
understood by human analysis. This paper presents a
technique for reducing semantic space dimensionality in
conjunction with a clustering technique to produce
accurate hierarchical clusters regardless of the number of
dimensions. These clusters may then be used for detecting
word and language relations.
1
INTRODUCTION
Current methods for detecting word co-occurrences
are far from perfect. This paper presents a method for
visualizing corpus data using methods rooted from the
HAL algorithm and expanded upon through clustering
methods to create a more concrete relationship scheme of
interpretable features. Obviously, this method is far from
optimal by itself. To address this issue, various data
reduction techniques have been tested and examined to
increase overall speed and accuracy.
The question remains of how do create a dynamic
clustering scheme to handle the subjective and nonlinear
nature of human language. We have created a method for
clustering this data, which allow for results to be
interpreted and scrutinized in the hierarchical fashion.
2
RELATED WORK
This paper expands upon the work of Kevin Lund
and Curt Burgess entitled “Producing high-dimensional
semantic spaces from lexical co-occurrences”. Their work
includes the “examination of a method for creating a
simulation that exhibits some of the characteristics of
human semantic memory, a simulation that develops
through the analysis of human experience with the world
in the form of natural language text.” [1]. This work
creates a basis from which our clustering techniques build
upon.
Methods for clustering high-dimensional corpora
exist using various forms of semantic relation detection,
but always include some manual method for determining
valid relations. These methods, though useful, have failed
to produce results which are relative to an entire corpus
and valid beyond subjective means [6].
Nearest-neighbor methods used so far have been
useful for detecting valid co-occurrences but only in
circumstances where a co-occurrence is known to exist
and the number of clusters is already known [3].
In high dimensions, datasets suffer as the number of
dimensions increase, the closeness or relative distance of
each object to every other object becomes arbitrarily close,
causing the validity of co-occurrences to come into
question.
The paper is structured as follows: Section 2 presents
a background of natural language processing and
dimension reduction. Section 3 explains how databases
are an integral part of our technique. Section 4 describes
methods used for formatting and sampling corpus data
sets. Section 5 explains how we eliminated unnecessary
data. Section 6 describes how the algorithm works.
Section 7 presents experiments of clustering sample
corpora using various techniques. In section 8 we present
our results. Lastly, in Section 9 and 10 we analyze our
results, and conclude with future possibilities.
2.1
(SVD) Singular Value Decomposition
SVD is a convenient automated way to reduce
dimensionality using statistical techniques and can be
applied to almost any kind of data. SVD focuses on
combining like words into single dimensions recursively.
This can be used as a basis for determining if a word is
related to other words [4].
Linear algebra is used to create word vectors where
each vector position represents its distance to another
word. Since SVD not only reduces the original matrix but
also changes the coefficients in the new condensed word
vectors, a new representation of the word vectors is
created [4].
SVD is a very useful dimension reduction
procedure. By reducing noise, it allows for analysis of
systematic similarities between vectors in the sub-matrices
it produces. When attempting to categorize information
into a small and static number of groups, focusing on
overall and reliable similarity is important, and SVD is a
good way to do this.
The kinds of natural language processing in which
we are interested does not lend as well to this kind of
dimension reduction for two reasons. First, when doing
natural language processing beyond simple categorization,
the systematic (but small) differences (or the similarities
within a very small context or subset of words) are more
important. This is especially important for ambiguity
resolution and/or predicting what word will come next.
Imagine the following example.
Imagine an algorithm that takes several words’
vectors as input and returns a binary vector that had 1’s for
all vector elements where each input vector had a positive
value and a 0 if any one of the input vectors had a zero for
that element. The output vector would be the subset of
contexts which all of the input vectors shared. If the
number of input vectors is large or less related, the output
vector will be more and more sparse. Additionally, if the
overall vector size is small (because of dimension
reduction) there will be more 1’s overall (making the
output less distinctive and less useful for predicting unique
occurrences for that particular set of inputs).
The second reason has to do with mapping the
original columns. In the HAL algorithm, the dimensions
are meaningful. When dimension reduction (SVD, PCA,
convolution, or some other transformation is used, like
hidden-layer units in the Simple Recurrent Networks
described below) is performed, it is harder to analyze each
unit or dimension and understand what role it is playing in
the bigger picture. It is useful to know, for a given
concept, where it has high values and where it has low
values. For example, we have noticed that for ambiguous
words (like bank) many of the highest value elements are
useful for disambiguating the word (e.g. money, teller,
savings, river, side).
From a psycholinguistics’
standpoint, this is very informative for understanding the
nature of ambiguity and how it is resolved.
These word-specific mappings may be very
important when combined with the first reason, therefore
we don’t use SVD. Imagine an algorithm that was
providing the “current semantic state” of the system by
producing a linear combination of the input. For example,
the vectors for smelly and apple would be combined in a
specific way (by upwards weighting the dimensions that
are relevant to those two specific concepts). This new
vector for the combined concept smelly apple can then be
combined with new vectors. The number of dimensions
(contexts) that could unify and be relevant to smelly apple
and a third concept is going to be small and very context
specific. Reducing the dimensions (and eliminating the
small variance that could be applicable to that small subset
of semantic space) could likely eliminate all hope of
finding a useful solution.
SVD could not be implemented in our project do
to the fact that most corpora used for testing were
extremely large, and the resulting matrix could not fit into
memory. Even if memory was not an issue, SVD takes an
extensive amount of computation, and the result would
take an intractable amount of time to calculate.
2.2
Simple Recurrent Networks
SRNs are a special class of multilayer neural
networks. Specifically, they have a context layer that
takes the output of the hidden layer at time(t) and feeds it
in as additional input at time(t+1). Thus, the hidden units
of the network will form a representation of the current
input that is dependant on both the current input and
whatever came before. SRNs have been used to model
semantic and grammatical structure [8][9] and syntactic
structure [7]. They are related to our work in the sense
that context is crucial to providing information about what
a word means or what is likely to follow a given set of
input.
The SRNs are fed a training data set of a set of
words. Each word is assigned a random input layer
pattern [8] fed the network a set of simple 3-word
sentences with no more than 300 words. The networks
task was to learn which word would come next. It learned
to do this with proficiency. Analysis of the hidden layer
for a given word as input (the vector of hidden layer unit
activation levels) after learning allowed for analysis of
what kinds of representations were created for each word.
Cluster analysis of those vectors produced some that look
like those we create with HAL.
Elman also showed that SRNs were capable of
segmenting speech. If you are trying to guess what letter
will come next in an input stream, there is considerable
variation in the transitional probability structure.
Specifically, at the end of a word it is harder to guess what
letter will come next than within a word (e.g. d-o- has a
smaller subset of possible next letters than f-a-s-t-e-r- ).
The networks learn these patterns and their error rates at
predicting what letter will come next can be used to
determine where word boundaries are placed. Recent work
with infants [13] has shown that infants are sensitive to
these transitional probabilities and use them to segment
speech.
Christiansen and Chater [7] used SRNs to analyze
the ability of networks to learn recursively embedded
patterns within natural language. The linguist Chomsky, a
big critic of associative learning approaches, pointed out
that a finite state grammar is necessary for producing
infinitely embedded sentences of which he claimed
humans were capable. Christiansen and Chater showed
that the context and time-dependant structure of SRN
representations could learn sentences with several centerembedded clauses. They also showed that humans
(despite Chomsky’s intuitions) are not capable of
processing and understanding center-embedded sentences.
(e.g. The shot the soldier the mosquito the boy the girl
kissed swatted bit fired missed). They concluded that while
a finite state grammar may be necessary in order to have
an algorithm that can parse infinitely embedded clauses, it
is unnecessary to posit such a grammar for human
language functioning.
2.3
Latent Semantic Analysis
LSI (Latent semantic Indexing) is an algorithm
designed by several people at Bell Labs in the late 1980s
[4]. LSA (Latent Semantic Analysis) is the theoretical
framework and collection of techniques for analyzing the
matrices produced by LSI. So, they are one and the same.
Psychology articles almost exclusively refer to LSA as it
includes both the matrix construction and the different
techniques they have developed for analyzing the data.
LSA has several steps. First, a large corpus (divided
into n documents) and a vocabulary list (composed of m
words) are used as input. Word frequencies for each of
the words are tabulated in each document, producing a
vector of length n for each word, or an m x n matrix.
Several matrix transformation steps are then used. First, all
the frequencies are converted to log10 frequencies. Next,
SVD is performed on the matrix, which reduces the
original m x n matrix into three smaller matrices:
k x m matrix: reduced dimensionality matrix containing
all the most significant, least variant information for each
word
k x n matrix: reduced dimensionality matrix containing all
the most significant, least variant information for each
document
1 x k matrix: scaling value that gives the weight of each
dimension
Creating the word co-occurrences uses large amounts
of disk space in order to store the calculated co-occurrence
distances. This data is then loaded from disk into the
DBMS via a bulking loading utility.
The chosen DBMS used for storage is PostgreSQL.
PostgreSQL’s ability to perform reliable transactions with
delayed write support is important for achieving
reasonable insertion speeds of records.
This was
especially beneficial in situations where the ability to
delay writes to the disk could speedup inserts by causing
them to be inserted in batches instead of on a single record
basis.
4
In order to find interesting lexical co-occurrences, a
broad range of corpora were used, varied in both the
number of words and dictionary size. The content ranged
from randomly sporadic data such as Usenet groups to
theme based data such as fictional stories and books of the
Bible. This large variation in corpora allowed us to test a
wide range of data and also ensure co-occurrences were
representative of normal language circumstances and not
to only a certain corpus.
Since the results may vary widely based upon the
contents of a corpus, three different corpora were used,
two are themed text: Moby Dick and the Bible, while the
third corpus was un-themed, Usenet data. Each of the
corpora were experimented upon using sample sizes of
500, 1,000, 10,000, 25,000, 250,000, and 1,000,000 words,
depending on the maximum number of words in the
original corpus. For corpora larger then a million words,
an additional test was done for the entire dataset regardless
of the total size.
5
For most psychological research, only the k x m
matrix is used. In most published LSA papers, k = 300.
The cosines between these word vectors are used to
measure the similarity between items. Words will have
higher cosines to the degree they were correlated, or be
predictive of each other’s co-occurrence across the entire
set of documents.
LSA cosines have been used to predict categorization
results, priming, and many other psychological tasks
where similarity is a factor. Centroids (averages) of word
vectors have also been computed to generate meanings for
larger units (phrases, sentences, adjective-noun pairs) with
some success.
3
DATABASE INTEGRATION
The corpora range in size from thousands to over one
million words. In order to handle these large corpus sizes,
special data structures and storage methods had to be
created/developed in order to efficiently store and search
data.
CORPUS SAMPLING
DATA REDUCTION
The task of data reduction is to retrieve a subset of
the corpus’ original words in order to remove low impact
words and to reduce time and complexity constraints [3].
In addition or the Porter Stemming algorithm, several
other techniques were tested to reduce unnecessary
information.
5.1
Data Cleaning
In order to get compelling results, a cleaning
method must be implemented to remove non-words and
other data which might create errors or skewed results.
Two cleaning methods were implemented; the first method
removes all non-words but leaves in punctuation, and the
second method removes punctuation leaving only the
whole words.
The first method includes punctuation and
addresses each with the exception of the hyphen character
as an individual word. We hoped punctuation would
contribute to defining the structure of the corpus.
The second method assumes that the punctuation
doesn’t contribute to the final structure of the corpus, and
are therefore should not be included in the cleaned dataset.
After testing both methods, it has been concluded
the inclusion of punctuation improves the accuracy of the
co-occurrence values regardless of the corpora size or
theme.
5.2
Porter Stemming Algorithm
Words with a common stem usually have close or
similar meanings. The ability to remove suffixes in an
automated fashion is very important for information
cleaning and standardization. Frequently the dimensional
structure of a corpus will be reduced using the Porter
Stemming algorithm by reducing a set of words into a
single common stem [2].
Given a random corpus, there are many words
which are the same but appear to be different because of
the difference in suffix. The ability to ignore suffixes and
only evaluating the root of each word may be more
beneficial in recording co-occurrences because there is not
a logical difference between words with different suffixes
except for the context in which they are used.
After testing both with and without Porter
Stemming, we have concluded the inclusion of the Porter
Stemming is not beneficial for hierarchical clustering cooccurrences because the suffixes in corpora play a large
part in determining grammatical structure.
5.3
High-Frequency Words
High frequency words that appear very often in a
corpus of text are not important for comparison but are
very useful in considering the structure of the corpora. For
example, the occurrence of the high-frequency word does
not imply meaning by itself but it does contribute to the
co-occurrences of other words. This is important because
by contributing to the co-occurrence values of the other
lower frequency words it still contributes to the overall
value and structure of the corpus.
5.4
Low-Frequency Words
Low frequency words that appear very
infrequently in comparison to other words in a corpus are
considered to have very low co-occurrence values because
they contribute little information to the location of other
words in the corpus. Unlike high-frequency words, lowfrequency words bare little or no value for clustering word
pairs. Low-frequency words are removed and only the top
n words for a given corpus are used for clustering.
6
ALGORITHM
The algorithm has three distinct phases. The first
step involves the cleaning and ordering of the data. The
second step removes useless and other low-impact data,
and the final step is clustering of the data.
Cleaning of the data involves the obvious removal
of any outlier data such as numbers or corpus specific data
such as titles in play scripts etc. This process is straight
forward but requires meticulous attention to ensure the
cleaning of the corpus does not effect or skew results of
the clustering.
The data reduction phase involves removing
sparse words. For all tests, we select the 5,000 most used
words, and ignore the rest.
The final and most important phase is clustering.
Assuming the most important aspect of clustering is
accuracy, the method used is hierarchical clustering. This
method ensures that the closeness found of any given
group is deterministically the closest points (by average
linkage). It is important to differentiate from other
algorithms because our algorithm provides deterministic
results of closeness/distance.
6.1
Extracting Whole Vectors
For each unique word chosen, there is a row
vector created which contains a value for every other word
including itself. For example, the word w1 compared to
w2 has a different vector then w2 compared to w1. Each
vector value is the sum of the total co-occurrences of w1
and x where x is each word in the corpus.
6.2
Single Word Statistics
In order to reduce sparse data, each word is
evaluated in order to count the number of each word cooccurrences in the corpus and to determine the sparseness
of each word in the corpus. Words that are too sparse are
eliminated or more precisely only the frequent words are
kept.
6.3
Normalization of Vectors
Detecting the closeness of a given word pair using
the original co-occurrence values is less than optimal. For
example, words with co-occurrence values of 1 do not tell
you if the frequency of the word is sufficiently high
enough to be valid. For example, if there is only a single
occurrence of a word and it happens to have a cooccurrence with another word, there is no way to
differentiate for another word pair where there may be
many occurrences of a word in the corpus but only appear
within the co-occurrence window once [1].
6.4
Vector Distances
Now that we have created these vectors, there
needs to be some distance metric for determining the
closeness of the word pairs. For the purpose of this paper
we are using City Block.
In creating the distance metric the important
aspect is determining if to measure distances using row,
column, or both row and column (append row to column)
vectors. It also needs to be determined if one is to use the
raw co-occurrence vectors or the normalized vectors.
6.5
(Hierarchical) Clustering
The clustering method chosen for the corpora was
hierarchical clustering. Hierarchical clustering is a bottom
up approach to recursively merge clusters until all of the
points have been merged. Distances are calculated using a
predetermined metric.
dist (c1, c3)  max( dist (c1, c 2), dist (c 2, c3))
There are several methods for determining the
distance (cn) of clusters for hierarchical clustering; single,
complete, and average linkage, which are different
methods for calculating the distance between clusters.
An advantage to hierarchical clustering is that the
determination of a good cluster is subjective to the person
viewing the results. This type of evaluation plays an
important part in finding meanings in word pairs because
of human interpretation [5].
Other clustering methods such as the K-means
algorithm are not as forgiving as hierarchical clustering
because they require the number of clusters to be set in
advance.
Hierarchical clustering is completely
deterministic, and will cluster high dimensional data most
accurately.
7
EXPERIMENT
The experimental process is a multi-step process
focusing on the formatting, storing and finally clustering
of corpora data.
7.1
Corpus Formatting
The corpus is first formatted using several layers
of cleaning and purification. The first pass clears all nonword data from the dataset. This type of data includes
headers, page number and etc. This includes any data that
does not have relation to the main corpus of data.
The next phase replaces all punctuation and
numbers with pre-defined markers, which can be
represented in the clustering application as its own word or
entity. Some examples of these replacements are provided
below.



“.” = <PERIOD>
“0-9” = <NUMBER>
“!” = <EXCLAMATION>
Punctuation and numbers are assumed to play roughly
the same role as high-frequency words. The idea behind
keeping punctuation and numbers is that even though they
may not have any direct value in terms of meaning, they
do play an important role in contributing to the structure of
the corpus.
7.2
Results Storage
Since the amount of storage required for holding
both the unique word key and the result co-occurrence
data is too large to fit into main memory, a method had to
be devised in order to efficiently store data while not
paying for the slow down associated with slow disk access
times. The method chosen for storage was a standard
DBMS storage system PostgreSQL. PostgreSQL was
chosen for its ability to do reliable transactions and handle
large tables.
The database contains three main tables. The first
table “wordid” contains the association of the words and
unique ids. The “wordcount” contains all of the total
number of occurrences of each word in the corpus. This
will be used for determining the top 5,000 words for
clustering. The final table “worddist” contains all of the
co-occurrence values for each of the words including the
window/band that they are associated with. The window
or band represents the area or frame, for which the cooccurrence had taken place.
The method provided causes a slight slowdown
for overall performance but allows for large corpora
limited only by the limitation of the PostgreSQL database.
By using a DBMS it also reduced the overall system
requirements for the standalone application because little
memory is required for analyzing and parsing data.
7.3
Clustering
The data for clustering is retrieved directly from
the database. The input data is made up of an NxN matrix
where each position represents the sum of the cooccurrences of a pair of unique words in a specified order.
So, for example the pair “fish” and “frog” is different from
the pair “frog” and “fish”.
Once the values have been populated, the
hierarchical clustering algorithm is executed and a
dendrogram is outputted. We hand analyzed the results
and compared clusters from the corpora of different sizes
and different corpus datasets to determine if the
experiment is useful for a given type and size of dataset.
8
RESULTS
Sample results from our most accurate runs can be
seen in figures 1 through 3. Figure 1 presents a high level
view of the overall structure, and figures 2 and 3 show a
partial zoom of end nodes of a tree. From such results you
can see how words that are related are grouped together.
Client, server, data, project are words related to technical
workplaces. Computer, system software, services, internet
are all words that are related to an internet service
provider.
Figure 3 (low level dendrogram)
9
Figure 1 (high level view)
ANALYSIS
The first result obtained from the experiment
showed the size of the corpus directly correlates with the
accuracy of the co-occurrence values. This was especially
apparent in the corpora’s smaller then 1,000,000 words.
Any corpus smaller then 1,000,000 words produced
spurious results at best. For corpora larger then about
1,000,000 words results obtained were more aligned with
expected word associations. Our best results were found
from corpora of around 10 million words or greater.
We found no substantial difference in using row
or column co-occurrence vectors, or using a combination
of both.
Leaving in punctuation and not using Porter
Stemming helped to preserve the grammatical associations
between word pairs.
In almost every random hierarchal sub-tree, there
contained obvious valid word pairs but there were a few
additional words included which had little or no
association with other words in the group.
10 CONCLUSION
Figure 2 (low level dendrogram)
The usage of the Porter Stemming suffix
algorithm proved unbeneficial, problematic in many cases,
because the removing of the suffix is not a perfect
algorithm and the removal did not always produce
favorable results since the removal of a suffix does reduce
the corpus word base but also seems to truncate meaning
indiscriminately as well. This was apparent through
several misclassifications, for example the words looker
and looked were both trimmed to the root word look.
Even though looker is slang, it still represents
misclassification of the corpus language.
As the results have shown, there is great promise
in the usage of hierarchical clustering in high dimensional
spaces.
The results show that not only does the
hierarchical clustering create obvious logic co-occurrences
but also the occurrences tend to cluster small but relevant
groups of words together.
The usage of the various clustering distance
metrics tends to not make much of a difference in creating
the hierarchical clusters as in every case the results were
promising. This was especially seen in the co-occurrences
where the data had a specific theme such as court cases,
sports and technology.
[11] Landauer, T. K., & Dumais, S. T. (1997). A solution
to Plato’s paradox: The Latent Semantic Analysis
Theory of Acquisition, induction, and representation
of knowledge. Psychological Review, 104, 211-240.
11 FUTURE WORK
[12] Steyvers, M., & Griffiths, T. (in press). Probabilistic
topics models. Forthcoming book on LSA.
Since a corpus sample can reach almost limitless
size, data reduction is important in creating data sets in a
low enough dimension to compute on standard
computation machines. In addition, the ability to not only
create clusters of words but also group them would also be
very helpful in exploring semantic meaning for future
research.
The method presented has been tested on a small
subset of available corpora. A wider range of corpus
samples such as political text might offer promising results
in meaning based search and language modeling.
12 Bibliography
[1] Lund, K. and Burgess, C., Producing highdimensional semantic spaces from lexical cooccurrence, Behavior Research Methods, Instruments,
& Computers 1996, 28 (2), 203-208
[2] Porter, M.F., 1980, An algorithm for suffix stripping,
Program, 14(3) :130-137
[3] Isbell, C. Lee, Jr. and Viola, P., Restructuring Sparse
High Dimensional Data for Effective Retrieval, AIM1636 1998, (7)
[4] Landauer, T.K. and Dumais, S. T. (1997) A solution
to Plato's problem: The Latent Semantic Analysis
theory of acquisition, induction and representation of
knowledge. Psychological Review, 104, 211-240
[5] Johnson, S.C. 1967, Hierarchical Clustering Schemes
Psychometrika, 2:241-254.
[6] Landauer, T. K., Foltz, P. W., & Laham, D.,
Introduction to Latent Semantic Analysis, Discourse
Processes, 1998, (25) 259-284
[7] Christiansen, M. H., & Chater, N. (1999). Toward a
connectionist model of recursion in human linguistic
performance. Cognitive Science, 23, 157-205.
[8] Elman, J. L. (1990). Finding structure in time.
Cognitive Science, 14, 179-211.
[9] Elman, J. L. (1991). Distributed representations,
simple recurrent networks, and grammatical structure.
Maching Learning, 7, 195-224.
[10] Furnas, G. W., Deerwester, S., Dumais, S. T.,
Landauer, T. K., Harshman, R. A., Streeter, L. A.,
Lochbaum, K. E. (1988). Information retrieval using a
singular value decomposition model of latent semantic
structure. Proceedings of the 11th annual international
ACM SIGIR conference on Research and
development in information retrieval. May 1988.
[13] Saffran, J.R., Aslin, R.N., & Newport, E.L. (1996).
Statistical learning by 8-month old infants. Science,
274, 1926-1928.
Download