a word-based soft clustering algorithm for documents

A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS King-Ip Lin, Ravikumar Kondadadi Department of Mathematical Sciences, The University of Memphis, Memphis, TN 38152, USA. linki,kondadir@msci.memphis.edu Abstract: Document clustering is an important tool for applications such as Web search engines. It enables the user to have a good overall view of the information contained in the documents. However, existing algorithms suffer from various aspects; hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually inefficient. We propose WBSC (Word-based Soft Clustering), an efficient soft clustering algorithm based on a given similarity measure. WBSC uses a hierarchical approach to cluster documents having similar words. WBSC is very effective and efficient when compared with existing hard clustering algorithms like K-means and its variants. Keywords: Document clustering, Word based clustering, soft clustering 1. Introduction Document clustering is a very useful tool in today’s world where a lot of documents are stored and retrieved electronically. It enables one to discover hidden similarity and key concepts. Moreover, it enables one to summarize a large amount of documents using key or common attributes of the clusters. Thus clustering can be used to categorize document databases and digital libraries, as well as providing useful summary information of the categories for browsing purposes. For instance, a typical search on the World Wide Web can return thousands of documents. Automatic clustering of the documents enables the user to have a clear and easy grasp of what kind of documents are retrieved, providing tremendous help for him/her to locate the right information. Many clustering techniques have been developed and they can be applied to clustering documents. [3, 6] contain examples of using such techniques. Most of these traditional approaches use documents as the basis of clustering. In this paper, we took an alternative approach, word-based soft clustering (WBSC). Instead of comparing documents and clustering them directly, we cluster the words used in such documents. For each word, we form a cluster containing documents that have that word. After that, we combine clusters that contain similar sets of documents, by an agglomerative approach [5]. The resulting clusters will contain documents that share a similar set of words. We believe this approach has many advantages. Firstly, it allows a document to reside in multiple clusters. This allows one to capture documents that contain multiple topics. Secondly, it leads itself to a natural representation of the clusters, which are the words that is associated with the clusters themselves. Also, the running time is strictly linear to the number of documents. Even though the running time grows quadratically with the number of clusters, this number is bounded by the number of words, which has a rather fixed upper bound (the number of words in the language). This allows the algorithm to scale well. The rest of the paper is organized as follows: Section 2 summarizes related work in the field. Section 3 describes our algorithm in more detail. Section 4 provides some preliminary experimental results. Section 5 outlines future direction of this work. 2. Related work Clustering algorithms have been developed and used in many fields. [4] provides an extensive survey of various clustering techniques. In this section, we highlight work done on document clustering. Many clustering techniques have been applied to clustering documents. For instance, Willett [9] provided a survey on applying hierarchical clustering algorithms into clustering documents. Cutting et al. [1] adapted various partition-based clustering algorithms to clustering documents. Two of the techniques are Buckshot and Fractionation. Buckshot selects a small sample of documents to pre-cluster them using a standard clustering algorithm and assigns the rest of the documents to the clusters formed. Fractionation splits the N documents into ‘m’ buckets where each bucket contains N/m documents. Fractionation takes an input parameter , which indicates the reduction factor for each bucket. The standard clustering algorithm is applied so that if there are ‘n’ documents in each bucket, they are clustered into n/ clusters. Now each of these clusters is treated as if they were individual documents and the whole process is repeated until there are only ‘K’ clusters. A relatively new technique is proposed by Zamir and Etzioni [10]. They introduce the notion of phrasebased document clustering. They use a generalized suffixtree to obtain information about the phrases and use them to cluster the documents. 3. WBSC As the name suggests, WBSC uses a word-based approach to build clusters. It first forms initial clusters of the documents, with each cluster representing a single word. For instance, WBSC forms a cluster for the word ‘tiger’ made up of all the documents that contain the word ‘tiger’. After that, WBSC merges similar clusters – clusters are similar if they contain the similar set of documents – using a hierarchical based approach until some stopping criterion is reached. At the end, the clusters are displayed based on the words associated with them. Cluster Initialization: We represent each cluster as a vector of documents. For each word, we create a bit vector (called term vector), with each dimension representing whether a certain document is present or absent (denoted by 1 and 0, respectively). This step requires only one pass through the set of the documents. However, we do maintain some sort of index (e.g. a hash table) on the list of words in the document to speed up the building process. Notice that the number of clusters is independent to the number of documents. However, it grows with the total number of words in all the documents. Thus, we do not consider stop-words -- words that either carry no meaning (like prepositions) or very common words. The second idea can be extended. Words that appear in too many documents are not going to help clustering. Similarly, if a word appear in only one document, it is not likely that it will merge with other clusters and remain useless in clustering. Thus, we remove all initial clusters that have only one document or have more than half of the total number of documents in it. Cluster Building: Once the initial clusters are formed, we apply a hierarchical-based algorithm to combine the clusters. Here we need a measure of similarity between any pair of clusters. Let us assume we have two clusters A and B, with n1 and n2 documents each respectively (without loss of generality, assume n1  n2), with c documents in common. Intuitively we would like to combine the two clusters if they have many common documents (relative to the size of the clusters). We tried c min( n1 , n2 ) as our similarity measure. However, this measure does not perform well. After trying with different measures, we settled with the Tanimoto measure [2], c . n1  n 2  c WBSC iterates over the clusters to merge similar clusters using this similarity measure. Clusters are merged if their similarity value is higher than a pre-defined threshold. We consider every pair of clusters, merging them if the similarity is larger than the threshold. Also, there are cases when one cluster is simply the subset of the other. We merge them in that case. Currently we use 0.33 as our threshold. We are experimenting with different ways of finding a better threshold. Displaying results In the algorithm, we keep track of the words that are used to represent each cluster. When two clusters merge, the new cluster acquires the words from both clusters. Thus at the end of the algorithm, each cluster will have a set of representative words. This set can be used to provide a description of the cluster. Also, we found out that there are a lot of clusters that have very few words associated with them – as they have either never been merged, or are only merged 2-3 times. Our results show that discarding such clusters does not affect the results significantly. In fact, it allows the results to be interpreted more clearly. 4. Experiments & Results: This section describes the results of the various experiments that we carried out. In order to evaluate the performance of WBSC, we compared it with other algorithms like K-Means, Fractionation and Buckshot [1]. Experimental Setup: Our test bed consists of two separate datasets:  Web: This contains 2000 documents downloaded from the Web. They are from different categories such as Food, Cricket, Astronomy, Clustering, Genetic Algorithms, Baseball, Movies, XML, Salsa (Dance) and Olympics. We use search engines to help us locate the appropriate documents.  Newsgroups: This contains 2000 documents from the UCI KDD archive (http://kdd.ics.uci.edu/) [8]. It has documents retrieved from 20 different news groups. All the experiments are carried out on a 733 MHz, 260 MB RAM PC. We did experiments on different document sets of different sizes. We ran the algorithm to get the clusters and tested the effectiveness of clustering (the types of clusters formed), both qualitatively and quantitatively. We also compared the execution times of all the algorithms for document sets of different sizes. Effectiveness of Clustering: We did many experiments with different number of documents taken from the above-mentioned test bed. All the algorithms were run to produce the same number of clusters with same input parameters. WBSC formed clusters for each of the different categories in the document sets, while the other algorithms (K-Means, Fractionation and Buckshot) did not. In addition, the other algorithms formed clusters with documents of many categories that are not related to each other. Figure 1 shows sample output from one of the experiments with 500 documents picked from 10 of the categories of the newsgroups data set. The categories include atheism, graphics, x-windows, hardware, space, electronics, Christianity, guns, baseball and hockey. Cluster results: Keywords Category BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players Hockey Human, Christian, bible, background, research, errors, April, President, members, member, private, community Christian Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season Baseball Analog, digital, straight, board, chip, processor Hardware Planes, pilots, ride, signals, offered Electronics Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems Windows-x America, carry, isn, weapons, science, armed Guns Military, fold, NASA, access, backup, systems, station, shuttle Space Notion, soul, several, hell, Personally, created, draw, exists, god Atheism Drivers, SVGA, conversion, VGA, compression, cpu, tools Graphics Keith, Jon, attempt, questions Atheism Limit, Police, Illegal, law Guns Ohm, impedance, circuit, input Electronics Season, Lemieux, Bure, runs, wings Hockey Figure 1: clusters formed by WBSC on a sample from UCI data archive matching can be found by a maximum weight bipartite matching algorithm [7]. We return the number of documents in the matching. The more documents that are matched, the more resemble the clusters are to the original categories. For our algorithm, and for the purpose of this comparison, we assigned each document to the cluster that has the largest similarity value. Figure 2 shows the number of matches with the original categories for different algorithms for different sets of documents. Number of matches/100 WBSC formed 14 clusters. It formed more than one cluster for some of the categories like Hockey, Guns etc. The document set on hockey has different documents related to Detroit Redwings and Vancouver Canucks. Our algorithm formed two different clusters for these two different sets. We ran the other algorithms with 10 as the required number of clusters to be formed. Many of the clusters have documents that are not at all related to each other (they have documents from different categories). To show the effectiveness of our algorithm we run the various algorithms on our web data set, where some documents are in multiple categories (for example a document related to both baseball and movies should be in all the three clusters: baseball, movie, and baseballmovie). We cannot present the results due to limitation on space. K-Means, Fractionation and Buckshot formed only clusters about baseball and Movies and did not form a cluster for documents related to both categories. However, WBSC formed a cluster for baseball-movies and put the documents related to baseball-movies in that cluster. This shows the effectiveness of our method. We also measure the effectiveness of WBSC quantitatively. We compared the clusters formed by the documents against the documents in the original categories and matched the clusters with the categories one-to-one. For each matching, we counted the number of documents that are common in corresponding clusters. The matching with the largest number of common documents is used to measure the effectiveness. This 60.00 50.00 40.00 30.00 20.00 10.00 0.00 100 WBSC 200 300 400 500 Number of documents K-Means Buckshot Fractionation Figure 2: Comparison of quality of clusters by different clustering algorithms We can clearly see that WBSC outperforms the other algorithms in effectiveness. Execution time (in minutes) Execution Time: We also measured the execution time of various algorithms. Figure 3 gives the comparison of execution time. As the graph shows, WBSC outperforms almost all other algorithms in execution time, especially as the number of documents increases. 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 200 400 600 800 Figure 4: Clusters formed for search term “cardinals” by WBSC This shows that our algorithm is effective even with the web search results. 5. Conclusions In this paper, we presented WBSC, a word-based document-clustering algorithm. We have shown that it is a promising means for clustering documents as well as presenting the clusters in a meaningful way to the user. We have compared WBSC with other existing algorithms and have shown that it performs well, both in terms of running time and quality of the clusters formed. Future work includes tuning the parameters of the algorithm, as well as adapting new similarity measures and data structures to speed up the algorithm. 1000 1200 1400 Number of documents WBSC K-Means Buckshot Fractionation Figure 3: Execution times of various clustering algorithms Web Search Results: We also tested WBSC on the results from Web search engines. We downloaded documents returned from the Google search engine (www.google.com) and we apply WBSC on them. Limitation on space prohibits us from showing all the results. Here we show some of the clusters found by WBSC by clustering the top 100 URLs returned from searching the term “cardinal”. The categories correspond to the common usage of the word “cardinal” in documents over the Web (name of baseball teams, nickname for schools, a kind of bird, and a Catholic clergy). Figure 4 shows some of the clusters formed by WBSC. Cluster results: Keywords and sample documents Related topic Bishops, Catholic, world, Roman, Church, cardinals, College, Holy 1. Cardinals in the Catholic Church Catholicpages.com 2. The Cardinals of the Holy Roman Church Roman Catholic Church Cardinals library Dark, Bird, female, color 1. Cardinal Page 2. First Internet Bird Club Cardinal Bird Benes, rookie, players, hit, innings, runs, Garrett, teams, league, Saturday 1. St.Louis Cardinals Notes 2. National League Baseball Brewers vs. Cardinals St.Louis Cardinals (MLB team) 6. References: [1] Douglass R. Cutting, David R. Karger, Jan O. Pedersen, John W. Tukey, Scatter/Gather: A Clusterbased Approach to Browsing Large Document Collections, In Proceedings of the Fifteenth Annual International ACM SIGIR Conference, pp 318-329, June 1992. [2] Dean, P. M. Ed., Molecular Similarity in Drug Design, Blackie Academic & Professional, 1995, pp 111 –137. [3] D. R. Hill, A vector clustering technique, in: Samuelson (Ed.), Mechanized Information Storage, Retrieval and Dissemination, North-Holland, Amsterdam, 1968. [4] A.K. Jain, M.N. Murty and P.J. Flynn, Data Clustering: A Review, ACM Computing Surveys. 31(3): 264-323, Sept 1999. [5] F. Murtagh, A Survey of Recent Advances in Hierarchical Clustering Algorithms", The Computer Journal, 26(4): 354-359, 1983. [6] J. J. Rocchio, Document retrieval systems – optimization and evaluation, Ph.D. Thesis, Harvard University, 1966. [7] Robert E. Tarjan, Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics, 1983. [8]http://kdd.ics.uci.edu/databases/20newsgroup s/20newsgroups.html, last visited November 18th, 2000. [9] P.Willett, Recent trends in hierarchical document clustering: a critical review, Information processing and management, 24: 577-97, 1988. [10] O.Zamir, O.Etzioni, Web document clustering: a feasibility demonstration, in Proceedings of 19th international ACM SIGIR conference on research and development in information retrieval (SIGIR 98), 1998, pp 46-54.

a word-based soft clustering algorithm for documents

Related documents

Products

Support

a word-based soft clustering algorithm for documents

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib