Natural Language Processing using Hierarchical Clustering Benjamin Arai Chris Baron {barai, cbaron}@cs.ucr.edu ABSTRACT Although there are various methods for indexing semantic data, there are no efficient algorithms for determining if word co-occurrences contain useful relationships with highly subjective meanings for analysis such as surveys and poles. To this point it has been difficult to extract information from corpora because of the large amounts of computation and memory required for computing semantic spaces of lexical co-occurrences. In this paper we present a method for searching corpora to detect lexical co-occurrences that provide significant and useful associations in a reasonable amount of time, given only a large corpus as input. HAL (Hyperspace Analogue to Language) is a procedure that processes large corpora of text into numerical vectors (vector representation of word meanings), which can be used for determining word relationships. These vectors can be used for creating high dimensional spaces for analyzing statistical relationships of words and phrases. From this high dimensional space semantic information can be extracted. This method takes no human intervention [1]. Beyond this extraction method, little work has been done to interpret the data in a way that can be easily understood by human analysis. This paper presents a technique for reducing semantic space dimensionality in conjunction with a clustering technique to produce accurate hierarchical clusters regardless of the number of dimensions. These clusters may then be used for detecting word and language relations. 1 INTRODUCTION Current methods for detecting word co-occurrences are far from perfect. This paper presents a method for visualizing corpus data using methods rooted from the HAL algorithm and expanded upon through clustering methods to create a more concrete relationship scheme of interpretable features. Obviously, this method is far from optimal by itself. To address this issue, various data reduction techniques have been tested and examined to increase overall speed and accuracy. The question remains of how do create a dynamic clustering scheme to handle the subjective and nonlinear nature of human language. We have created a method for clustering this data, which allow for results to be interpreted and scrutinized in the hierarchical fashion. 2 RELATED WORK This paper expands upon the work of Kevin Lund and Curt Burgess entitled “Producing high-dimensional semantic spaces from lexical co-occurrences”. Their work includes the “examination of a method for creating a simulation that exhibits some of the characteristics of human semantic memory, a simulation that develops through the analysis of human experience with the world in the form of natural language text.” [1]. This work creates a basis from which our clustering techniques build upon. Methods for clustering high-dimensional corpora exist using various forms of semantic relation detection, but always include some manual method for determining valid relations. These methods, though useful, have failed to produce results which are relative to an entire corpus and valid beyond subjective means [6]. Nearest-neighbor methods used so far have been useful for detecting valid co-occurrences but only in circumstances where a co-occurrence is known to exist and the number of clusters is already known [3]. In high dimensions, datasets suffer as the number of dimensions increase, the closeness or relative distance of each object to every other object becomes arbitrarily close, causing the validity of co-occurrences to come into question. The paper is structured as follows: Section 2 presents a background of natural language processing and dimension reduction. Section 3 explains how databases are an integral part of our technique. Section 4 describes methods used for formatting and sampling corpus data sets. Section 5 explains how we eliminated unnecessary data. Section 6 describes how the algorithm works. Section 7 presents experiments of clustering sample corpora using various techniques. In section 8 we present our results. Lastly, in Section 9 and 10 we analyze our results, and conclude with future possibilities. 2.1 (SVD) Singular Value Decomposition SVD is a convenient automated way to reduce dimensionality using statistical techniques and can be applied to almost any kind of data. SVD focuses on combining like words into single dimensions recursively. This can be used as a basis for determining if a word is related to other words [4]. Linear algebra is used to create word vectors where each vector position represents its distance to another word. Since SVD not only reduces the original matrix but also changes the coefficients in the new condensed word vectors, a new representation of the word vectors is created [4]. SVD is a very useful dimension reduction procedure. By reducing noise, it allows for analysis of systematic similarities between vectors in the sub-matrices it produces. When attempting to categorize information into a small and static number of groups, focusing on overall and reliable similarity is important, and SVD is a good way to do this. The kinds of natural language processing in which we are interested does not lend as well to this kind of dimension reduction for two reasons. First, when doing natural language processing beyond simple categorization, the systematic (but small) differences (or the similarities within a very small context or subset of words) are more important. This is especially important for ambiguity resolution and/or predicting what word will come next. Imagine the following example. Imagine an algorithm that takes several words’ vectors as input and returns a binary vector that had 1’s for all vector elements where each input vector had a positive value and a 0 if any one of the input vectors had a zero for that element. The output vector would be the subset of contexts which all of the input vectors shared. If the number of input vectors is large or less related, the output vector will be more and more sparse. Additionally, if the overall vector size is small (because of dimension reduction) there will be more 1’s overall (making the output less distinctive and less useful for predicting unique occurrences for that particular set of inputs). The second reason has to do with mapping the original columns. In the HAL algorithm, the dimensions are meaningful. When dimension reduction (SVD, PCA, convolution, or some other transformation is used, like hidden-layer units in the Simple Recurrent Networks described below) is performed, it is harder to analyze each unit or dimension and understand what role it is playing in the bigger picture. It is useful to know, for a given concept, where it has high values and where it has low values. For example, we have noticed that for ambiguous words (like bank) many of the highest value elements are useful for disambiguating the word (e.g. money, teller, savings, river, side). From a psycholinguistics’ standpoint, this is very informative for understanding the nature of ambiguity and how it is resolved. These word-specific mappings may be very important when combined with the first reason, therefore we don’t use SVD. Imagine an algorithm that was providing the “current semantic state” of the system by producing a linear combination of the input. For example, the vectors for smelly and apple would be combined in a specific way (by upwards weighting the dimensions that are relevant to those two specific concepts). This new vector for the combined concept smelly apple can then be combined with new vectors. The number of dimensions (contexts) that could unify and be relevant to smelly apple and a third concept is going to be small and very context specific. Reducing the dimensions (and eliminating the small variance that could be applicable to that small subset of semantic space) could likely eliminate all hope of finding a useful solution. SVD could not be implemented in our project do to the fact that most corpora used for testing were extremely large, and the resulting matrix could not fit into memory. Even if memory was not an issue, SVD takes an extensive amount of computation, and the result would take an intractable amount of time to calculate. 2.2 Simple Recurrent Networks SRNs are a special class of multilayer neural networks. Specifically, they have a context layer that takes the output of the hidden layer at time(t) and feeds it in as additional input at time(t+1). Thus, the hidden units of the network will form a representation of the current input that is dependant on both the current input and whatever came before. SRNs have been used to model semantic and grammatical structure [8][9] and syntactic structure [7]. They are related to our work in the sense that context is crucial to providing information about what a word means or what is likely to follow a given set of input. The SRNs are fed a training data set of a set of words. Each word is assigned a random input layer pattern [8] fed the network a set of simple 3-word sentences with no more than 300 words. The networks task was to learn which word would come next. It learned to do this with proficiency. Analysis of the hidden layer for a given word as input (the vector of hidden layer unit activation levels) after learning allowed for analysis of what kinds of representations were created for each word. Cluster analysis of those vectors produced some that look like those we create with HAL. Elman also showed that SRNs were capable of segmenting speech. If you are trying to guess what letter will come next in an input stream, there is considerable variation in the transitional probability structure. Specifically, at the end of a word it is harder to guess what letter will come next than within a word (e.g. d-o- has a smaller subset of possible next letters than f-a-s-t-e-r- ). The networks learn these patterns and their error rates at predicting what letter will come next can be used to determine where word boundaries are placed. Recent work with infants [13] has shown that infants are sensitive to these transitional probabilities and use them to segment speech. Christiansen and Chater [7] used SRNs to analyze the ability of networks to learn recursively embedded patterns within natural language. The linguist Chomsky, a big critic of associative learning approaches, pointed out that a finite state grammar is necessary for producing infinitely embedded sentences of which he claimed humans were capable. Christiansen and Chater showed that the context and time-dependant structure of SRN representations could learn sentences with several centerembedded clauses. They also showed that humans (despite Chomsky’s intuitions) are not capable of processing and understanding center-embedded sentences. (e.g. The shot the soldier the mosquito the boy the girl kissed swatted bit fired missed). They concluded that while a finite state grammar may be necessary in order to have an algorithm that can parse infinitely embedded clauses, it is unnecessary to posit such a grammar for human language functioning. 2.3 Latent Semantic Analysis LSI (Latent semantic Indexing) is an algorithm designed by several people at Bell Labs in the late 1980s [4]. LSA (Latent Semantic Analysis) is the theoretical framework and collection of techniques for analyzing the matrices produced by LSI. So, they are one and the same. Psychology articles almost exclusively refer to LSA as it includes both the matrix construction and the different techniques they have developed for analyzing the data. LSA has several steps. First, a large corpus (divided into n documents) and a vocabulary list (composed of m words) are used as input. Word frequencies for each of the words are tabulated in each document, producing a vector of length n for each word, or an m x n matrix. Several matrix transformation steps are then used. First, all the frequencies are converted to log10 frequencies. Next, SVD is performed on the matrix, which reduces the original m x n matrix into three smaller matrices: k x m matrix: reduced dimensionality matrix containing all the most significant, least variant information for each word k x n matrix: reduced dimensionality matrix containing all the most significant, least variant information for each document 1 x k matrix: scaling value that gives the weight of each dimension Creating the word co-occurrences uses large amounts of disk space in order to store the calculated co-occurrence distances. This data is then loaded from disk into the DBMS via a bulking loading utility. The chosen DBMS used for storage is PostgreSQL. PostgreSQL’s ability to perform reliable transactions with delayed write support is important for achieving reasonable insertion speeds of records. This was especially beneficial in situations where the ability to delay writes to the disk could speedup inserts by causing them to be inserted in batches instead of on a single record basis. 4 In order to find interesting lexical co-occurrences, a broad range of corpora were used, varied in both the number of words and dictionary size. The content ranged from randomly sporadic data such as Usenet groups to theme based data such as fictional stories and books of the Bible. This large variation in corpora allowed us to test a wide range of data and also ensure co-occurrences were representative of normal language circumstances and not to only a certain corpus. Since the results may vary widely based upon the contents of a corpus, three different corpora were used, two are themed text: Moby Dick and the Bible, while the third corpus was un-themed, Usenet data. Each of the corpora were experimented upon using sample sizes of 500, 1,000, 10,000, 25,000, 250,000, and 1,000,000 words, depending on the maximum number of words in the original corpus. For corpora larger then a million words, an additional test was done for the entire dataset regardless of the total size. 5 For most psychological research, only the k x m matrix is used. In most published LSA papers, k = 300. The cosines between these word vectors are used to measure the similarity between items. Words will have higher cosines to the degree they were correlated, or be predictive of each other’s co-occurrence across the entire set of documents. LSA cosines have been used to predict categorization results, priming, and many other psychological tasks where similarity is a factor. Centroids (averages) of word vectors have also been computed to generate meanings for larger units (phrases, sentences, adjective-noun pairs) with some success. 3 DATABASE INTEGRATION The corpora range in size from thousands to over one million words. In order to handle these large corpus sizes, special data structures and storage methods had to be created/developed in order to efficiently store and search data. CORPUS SAMPLING DATA REDUCTION The task of data reduction is to retrieve a subset of the corpus’ original words in order to remove low impact words and to reduce time and complexity constraints [3]. In addition or the Porter Stemming algorithm, several other techniques were tested to reduce unnecessary information. 5.1 Data Cleaning In order to get compelling results, a cleaning method must be implemented to remove non-words and other data which might create errors or skewed results. Two cleaning methods were implemented; the first method removes all non-words but leaves in punctuation, and the second method removes punctuation leaving only the whole words. The first method includes punctuation and addresses each with the exception of the hyphen character as an individual word. We hoped punctuation would contribute to defining the structure of the corpus. The second method assumes that the punctuation doesn’t contribute to the final structure of the corpus, and are therefore should not be included in the cleaned dataset. After testing both methods, it has been concluded the inclusion of punctuation improves the accuracy of the co-occurrence values regardless of the corpora size or theme. 5.2 Porter Stemming Algorithm Words with a common stem usually have close or similar meanings. The ability to remove suffixes in an automated fashion is very important for information cleaning and standardization. Frequently the dimensional structure of a corpus will be reduced using the Porter Stemming algorithm by reducing a set of words into a single common stem [2]. Given a random corpus, there are many words which are the same but appear to be different because of the difference in suffix. The ability to ignore suffixes and only evaluating the root of each word may be more beneficial in recording co-occurrences because there is not a logical difference between words with different suffixes except for the context in which they are used. After testing both with and without Porter Stemming, we have concluded the inclusion of the Porter Stemming is not beneficial for hierarchical clustering cooccurrences because the suffixes in corpora play a large part in determining grammatical structure. 5.3 High-Frequency Words High frequency words that appear very often in a corpus of text are not important for comparison but are very useful in considering the structure of the corpora. For example, the occurrence of the high-frequency word does not imply meaning by itself but it does contribute to the co-occurrences of other words. This is important because by contributing to the co-occurrence values of the other lower frequency words it still contributes to the overall value and structure of the corpus. 5.4 Low-Frequency Words Low frequency words that appear very infrequently in comparison to other words in a corpus are considered to have very low co-occurrence values because they contribute little information to the location of other words in the corpus. Unlike high-frequency words, lowfrequency words bare little or no value for clustering word pairs. Low-frequency words are removed and only the top n words for a given corpus are used for clustering. 6 ALGORITHM The algorithm has three distinct phases. The first step involves the cleaning and ordering of the data. The second step removes useless and other low-impact data, and the final step is clustering of the data. Cleaning of the data involves the obvious removal of any outlier data such as numbers or corpus specific data such as titles in play scripts etc. This process is straight forward but requires meticulous attention to ensure the cleaning of the corpus does not effect or skew results of the clustering. The data reduction phase involves removing sparse words. For all tests, we select the 5,000 most used words, and ignore the rest. The final and most important phase is clustering. Assuming the most important aspect of clustering is accuracy, the method used is hierarchical clustering. This method ensures that the closeness found of any given group is deterministically the closest points (by average linkage). It is important to differentiate from other algorithms because our algorithm provides deterministic results of closeness/distance. 6.1 Extracting Whole Vectors For each unique word chosen, there is a row vector created which contains a value for every other word including itself. For example, the word w1 compared to w2 has a different vector then w2 compared to w1. Each vector value is the sum of the total co-occurrences of w1 and x where x is each word in the corpus. 6.2 Single Word Statistics In order to reduce sparse data, each word is evaluated in order to count the number of each word cooccurrences in the corpus and to determine the sparseness of each word in the corpus. Words that are too sparse are eliminated or more precisely only the frequent words are kept. 6.3 Normalization of Vectors Detecting the closeness of a given word pair using the original co-occurrence values is less than optimal. For example, words with co-occurrence values of 1 do not tell you if the frequency of the word is sufficiently high enough to be valid. For example, if there is only a single occurrence of a word and it happens to have a cooccurrence with another word, there is no way to differentiate for another word pair where there may be many occurrences of a word in the corpus but only appear within the co-occurrence window once [1]. 6.4 Vector Distances Now that we have created these vectors, there needs to be some distance metric for determining the closeness of the word pairs. For the purpose of this paper we are using City Block. In creating the distance metric the important aspect is determining if to measure distances using row, column, or both row and column (append row to column) vectors. It also needs to be determined if one is to use the raw co-occurrence vectors or the normalized vectors. 6.5 (Hierarchical) Clustering The clustering method chosen for the corpora was hierarchical clustering. Hierarchical clustering is a bottom up approach to recursively merge clusters until all of the points have been merged. Distances are calculated using a predetermined metric. dist (c1, c3) max( dist (c1, c 2), dist (c 2, c3)) There are several methods for determining the distance (cn) of clusters for hierarchical clustering; single, complete, and average linkage, which are different methods for calculating the distance between clusters. An advantage to hierarchical clustering is that the determination of a good cluster is subjective to the person viewing the results. This type of evaluation plays an important part in finding meanings in word pairs because of human interpretation [5]. Other clustering methods such as the K-means algorithm are not as forgiving as hierarchical clustering because they require the number of clusters to be set in advance. Hierarchical clustering is completely deterministic, and will cluster high dimensional data most accurately. 7 EXPERIMENT The experimental process is a multi-step process focusing on the formatting, storing and finally clustering of corpora data. 7.1 Corpus Formatting The corpus is first formatted using several layers of cleaning and purification. The first pass clears all nonword data from the dataset. This type of data includes headers, page number and etc. This includes any data that does not have relation to the main corpus of data. The next phase replaces all punctuation and numbers with pre-defined markers, which can be represented in the clustering application as its own word or entity. Some examples of these replacements are provided below. “.” = <PERIOD> “0-9” = <NUMBER> “!” = <EXCLAMATION> Punctuation and numbers are assumed to play roughly the same role as high-frequency words. The idea behind keeping punctuation and numbers is that even though they may not have any direct value in terms of meaning, they do play an important role in contributing to the structure of the corpus. 7.2 Results Storage Since the amount of storage required for holding both the unique word key and the result co-occurrence data is too large to fit into main memory, a method had to be devised in order to efficiently store data while not paying for the slow down associated with slow disk access times. The method chosen for storage was a standard DBMS storage system PostgreSQL. PostgreSQL was chosen for its ability to do reliable transactions and handle large tables. The database contains three main tables. The first table “wordid” contains the association of the words and unique ids. The “wordcount” contains all of the total number of occurrences of each word in the corpus. This will be used for determining the top 5,000 words for clustering. The final table “worddist” contains all of the co-occurrence values for each of the words including the window/band that they are associated with. The window or band represents the area or frame, for which the cooccurrence had taken place. The method provided causes a slight slowdown for overall performance but allows for large corpora limited only by the limitation of the PostgreSQL database. By using a DBMS it also reduced the overall system requirements for the standalone application because little memory is required for analyzing and parsing data. 7.3 Clustering The data for clustering is retrieved directly from the database. The input data is made up of an NxN matrix where each position represents the sum of the cooccurrences of a pair of unique words in a specified order. So, for example the pair “fish” and “frog” is different from the pair “frog” and “fish”. Once the values have been populated, the hierarchical clustering algorithm is executed and a dendrogram is outputted. We hand analyzed the results and compared clusters from the corpora of different sizes and different corpus datasets to determine if the experiment is useful for a given type and size of dataset. 8 RESULTS Sample results from our most accurate runs can be seen in figures 1 through 3. Figure 1 presents a high level view of the overall structure, and figures 2 and 3 show a partial zoom of end nodes of a tree. From such results you can see how words that are related are grouped together. Client, server, data, project are words related to technical workplaces. Computer, system software, services, internet are all words that are related to an internet service provider. Figure 3 (low level dendrogram) 9 Figure 1 (high level view) ANALYSIS The first result obtained from the experiment showed the size of the corpus directly correlates with the accuracy of the co-occurrence values. This was especially apparent in the corpora’s smaller then 1,000,000 words. Any corpus smaller then 1,000,000 words produced spurious results at best. For corpora larger then about 1,000,000 words results obtained were more aligned with expected word associations. Our best results were found from corpora of around 10 million words or greater. We found no substantial difference in using row or column co-occurrence vectors, or using a combination of both. Leaving in punctuation and not using Porter Stemming helped to preserve the grammatical associations between word pairs. In almost every random hierarchal sub-tree, there contained obvious valid word pairs but there were a few additional words included which had little or no association with other words in the group. 10 CONCLUSION Figure 2 (low level dendrogram) The usage of the Porter Stemming suffix algorithm proved unbeneficial, problematic in many cases, because the removing of the suffix is not a perfect algorithm and the removal did not always produce favorable results since the removal of a suffix does reduce the corpus word base but also seems to truncate meaning indiscriminately as well. This was apparent through several misclassifications, for example the words looker and looked were both trimmed to the root word look. Even though looker is slang, it still represents misclassification of the corpus language. As the results have shown, there is great promise in the usage of hierarchical clustering in high dimensional spaces. The results show that not only does the hierarchical clustering create obvious logic co-occurrences but also the occurrences tend to cluster small but relevant groups of words together. The usage of the various clustering distance metrics tends to not make much of a difference in creating the hierarchical clusters as in every case the results were promising. This was especially seen in the co-occurrences where the data had a specific theme such as court cases, sports and technology. [11] Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s paradox: The Latent Semantic Analysis Theory of Acquisition, induction, and representation of knowledge. Psychological Review, 104, 211-240. 11 FUTURE WORK [12] Steyvers, M., & Griffiths, T. (in press). Probabilistic topics models. Forthcoming book on LSA. Since a corpus sample can reach almost limitless size, data reduction is important in creating data sets in a low enough dimension to compute on standard computation machines. In addition, the ability to not only create clusters of words but also group them would also be very helpful in exploring semantic meaning for future research. The method presented has been tested on a small subset of available corpora. A wider range of corpus samples such as political text might offer promising results in meaning based search and language modeling. 12 Bibliography [1] Lund, K. and Burgess, C., Producing highdimensional semantic spaces from lexical cooccurrence, Behavior Research Methods, Instruments, & Computers 1996, 28 (2), 203-208 [2] Porter, M.F., 1980, An algorithm for suffix stripping, Program, 14(3) :130-137 [3] Isbell, C. Lee, Jr. and Viola, P., Restructuring Sparse High Dimensional Data for Effective Retrieval, AIM1636 1998, (7) [4] Landauer, T.K. and Dumais, S. T. (1997) A solution to Plato's problem: The Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104, 211-240 [5] Johnson, S.C. 1967, Hierarchical Clustering Schemes Psychometrika, 2:241-254. [6] Landauer, T. K., Foltz, P. W., & Laham, D., Introduction to Latent Semantic Analysis, Discourse Processes, 1998, (25) 259-284 [7] Christiansen, M. H., & Chater, N. (1999). Toward a connectionist model of recursion in human linguistic performance. Cognitive Science, 23, 157-205. [8] Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211. [9] Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Maching Learning, 7, 195-224. [10] Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., Lochbaum, K. E. (1988). Information retrieval using a singular value decomposition model of latent semantic structure. Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval. May 1988. [13] Saffran, J.R., Aslin, R.N., & Newport, E.L. (1996). Statistical learning by 8-month old infants. Science, 274, 1926-1928.