a word-based soft clustering algorithm for documents

advertisement
A WORD-BASED SOFT CLUSTERING ALGORITHM FOR DOCUMENTS
King-Ip Lin, Ravikumar Kondadadi
Department of Mathematical Sciences,
The University of Memphis,
Memphis, TN 38152, USA.
linki,kondadir@msci.memphis.edu
Abstract:
Document clustering is an important tool for
applications such as Web search engines. It enables the
user to have a good overall view of the information
contained in the documents. However, existing algorithms
suffer from various aspects; hard clustering algorithms
(where each document belongs to exactly one cluster)
cannot detect the multiple themes of a document, while
soft clustering algorithms (where each document can
belong to multiple clusters) are usually inefficient. We
propose WBSC (Word-based Soft Clustering), an
efficient soft clustering algorithm based on a given
similarity measure. WBSC uses a hierarchical approach to
cluster documents having similar words. WBSC is very
effective and efficient when compared with existing hard
clustering algorithms like K-means and its variants.
Keywords:
Document clustering, Word based clustering,
soft clustering
1. Introduction
Document clustering is a very useful tool in today’s
world where a lot of documents are stored and retrieved
electronically. It enables one to discover hidden similarity
and key concepts. Moreover, it enables one to summarize
a large amount of documents using key or common
attributes of the clusters. Thus clustering can be used to
categorize document databases and digital libraries, as
well as providing useful summary information of the
categories for browsing purposes. For instance, a typical
search on the World Wide Web can return thousands of
documents. Automatic clustering of the documents
enables the user to have a clear and easy grasp of what
kind of documents are retrieved, providing tremendous
help for him/her to locate the right information.
Many clustering techniques have been developed and
they can be applied to clustering documents. [3, 6]
contain examples of using such techniques. Most of these
traditional approaches use documents as the basis of
clustering. In this paper, we took an alternative approach,
word-based soft clustering (WBSC). Instead of
comparing documents and clustering them directly, we
cluster the words used in such documents. For each word,
we form a cluster containing documents that have that
word. After that, we combine clusters that contain similar
sets of documents, by an agglomerative approach [5]. The
resulting clusters will contain documents that share a
similar set of words.
We believe this approach has many advantages.
Firstly, it allows a document to reside in multiple clusters.
This allows one to capture documents that contain
multiple topics. Secondly, it leads itself to a natural
representation of the clusters, which are the words that is
associated with the clusters themselves. Also, the running
time is strictly linear to the number of documents. Even
though the running time grows quadratically with the
number of clusters, this number is bounded by the number
of words, which has a rather fixed upper bound (the
number of words in the language). This allows the
algorithm to scale well.
The rest of the paper is organized as follows: Section
2 summarizes related work in the field. Section 3
describes our algorithm in more detail. Section 4 provides
some preliminary experimental results. Section 5 outlines
future direction of this work.
2. Related work
Clustering algorithms have been developed and used
in many fields. [4] provides an extensive survey of
various clustering techniques. In this section, we highlight
work done on document clustering.
Many clustering techniques have been applied to
clustering documents. For instance, Willett [9] provided a
survey on applying hierarchical clustering algorithms into
clustering documents. Cutting et al. [1] adapted various
partition-based clustering algorithms to clustering
documents. Two of the techniques are Buckshot and
Fractionation. Buckshot selects a small sample of
documents to pre-cluster them using a standard clustering
algorithm and assigns the rest of the documents to the
clusters formed. Fractionation splits the N documents into
‘m’ buckets where each bucket contains N/m documents.
Fractionation takes an input parameter , which indicates
the reduction factor for each bucket. The standard
clustering algorithm is applied so that if there are ‘n’
documents in each bucket, they are clustered into n/
clusters. Now each of these clusters is treated as if they
were individual documents and the whole process is
repeated until there are only ‘K’ clusters.
A relatively new technique is proposed by Zamir
and Etzioni [10]. They introduce the notion of phrasebased document clustering. They use a generalized suffixtree to obtain information about the phrases and use them
to cluster the documents.
3. WBSC
As the name suggests, WBSC uses a word-based
approach to build clusters. It first forms initial clusters of
the documents, with each cluster representing a single
word. For instance, WBSC forms a cluster for the word
‘tiger’ made up of all the documents that contain the word
‘tiger’. After that, WBSC merges similar clusters –
clusters are similar if they contain the similar set of
documents – using a hierarchical based approach until
some stopping criterion is reached. At the end, the
clusters are displayed based on the words associated with
them.
Cluster Initialization:
We represent each cluster as a vector of
documents. For each word, we create a bit vector (called
term vector), with each dimension representing whether a
certain document is present or absent (denoted by 1 and 0,
respectively). This step requires only one pass through the
set of the documents. However, we do maintain some sort
of index (e.g. a hash table) on the list of words in the
document to speed up the building process.
Notice that the number of clusters is independent
to the number of documents. However, it grows with the
total number of words in all the documents. Thus, we do
not consider stop-words -- words that either carry no
meaning (like prepositions) or very common words. The
second idea can be extended. Words that appear in too
many documents are not going to help clustering.
Similarly, if a word appear in only one document, it is not
likely that it will merge with other clusters and remain
useless in clustering. Thus, we remove all initial clusters
that have only one document or have more than half of
the total number of documents in it.
Cluster Building:
Once the initial clusters are formed, we apply a
hierarchical-based algorithm to combine the clusters.
Here we need a measure of similarity between any pair of
clusters. Let us assume we have two clusters A and B,
with n1 and n2 documents each respectively (without loss
of generality, assume n1  n2), with c documents in
common. Intuitively we would like to combine the two
clusters if they have many common documents (relative
to the size of the clusters). We tried c min( n1 , n2 ) as
our similarity measure. However, this measure does not
perform well. After trying with different measures, we
settled with the Tanimoto measure [2],
c
.
n1  n 2  c
WBSC iterates over the clusters to merge similar
clusters using this similarity measure. Clusters are merged
if their similarity value is higher than a pre-defined
threshold. We consider every pair of clusters, merging
them if the similarity is larger than the threshold. Also,
there are cases when one cluster is simply the subset of
the other. We merge them in that case. Currently we use
0.33 as our threshold. We are experimenting with
different ways of finding a better threshold.
Displaying results
In the algorithm, we keep track of the words that
are used to represent each cluster. When two clusters
merge, the new cluster acquires the words from both
clusters. Thus at the end of the algorithm, each cluster
will have a set of representative words. This set can be
used to provide a description of the cluster. Also, we
found out that there are a lot of clusters that have very few
words associated with them – as they have either never
been merged, or are only merged 2-3 times. Our results
show that discarding such clusters does not affect the
results significantly. In fact, it allows the results to be
interpreted more clearly.
4. Experiments & Results:
This section describes the results of the various
experiments that we carried out. In order to evaluate the
performance of WBSC, we compared it with other
algorithms like K-Means, Fractionation and Buckshot [1].
Experimental Setup:
Our test bed consists of two separate datasets:
 Web: This contains 2000 documents downloaded
from the Web. They are from different categories
such as Food, Cricket, Astronomy, Clustering,
Genetic Algorithms, Baseball, Movies, XML,
Salsa (Dance) and Olympics. We use search
engines to help us locate the appropriate
documents.
 Newsgroups: This contains 2000 documents
from
the
UCI
KDD
archive
(http://kdd.ics.uci.edu/) [8]. It has documents
retrieved from 20 different news groups.
All the experiments are carried out on a 733 MHz,
260 MB RAM PC. We did experiments on different
document sets of different sizes. We ran the algorithm to
get the clusters and tested the effectiveness of clustering
(the types of clusters formed), both qualitatively and
quantitatively. We also compared the execution times of
all the algorithms for document sets of different sizes.
Effectiveness of Clustering:
We did many experiments with different number
of documents taken from the above-mentioned test bed.
All the algorithms were run to produce the same number
of clusters with same input parameters. WBSC formed
clusters for each of the different categories in the
document sets, while the other algorithms (K-Means,
Fractionation and Buckshot) did not. In addition, the other
algorithms formed clusters with documents of many
categories that are not related to each other. Figure 1
shows sample output from one of the experiments with
500 documents picked from 10 of the categories of the
newsgroups data set. The categories include atheism,
graphics, x-windows, hardware, space, electronics,
Christianity, guns, baseball and hockey.
Cluster results: Keywords
Category
BTW, remind, NHL, hockey, Canada, Canucks, hockey, teams, players
Hockey
Human, Christian, bible, background, research, errors, April, President, members, member, private,
community
Christian
Cult, MAGIC, baseball, Bob, rules, poor, teams, players, season
Baseball
Analog, digital, straight, board, chip, processor
Hardware
Planes, pilots, ride, signals, offered
Electronics
Distribution, experience, anti, message, Windows, error, week, X Server, Microsoft, problems
Windows-x
America, carry, isn, weapons, science, armed
Guns
Military, fold, NASA, access, backup, systems, station, shuttle
Space
Notion, soul, several, hell, Personally, created, draw, exists, god
Atheism
Drivers, SVGA, conversion, VGA, compression, cpu, tools
Graphics
Keith, Jon, attempt, questions
Atheism
Limit, Police, Illegal, law
Guns
Ohm, impedance, circuit, input
Electronics
Season, Lemieux, Bure, runs, wings
Hockey
Figure 1: clusters formed by WBSC on a sample from UCI data archive
matching can be found by a maximum weight bipartite
matching algorithm [7]. We return the number of
documents in the matching. The more documents that are
matched, the more resemble the clusters are to the original
categories. For our algorithm, and for the purpose of this
comparison, we assigned each document to the cluster
that has the largest similarity value.
Figure 2 shows the number of matches with the
original categories for different algorithms for different
sets of documents.
Number of matches/100
WBSC formed 14 clusters. It formed more than
one cluster for some of the categories like Hockey, Guns
etc. The document set on hockey has different documents
related to Detroit Redwings and Vancouver Canucks. Our
algorithm formed two different clusters for these two
different sets. We ran the other algorithms with 10 as the
required number of clusters to be formed. Many of the
clusters have documents that are not at all related to each
other (they have documents from different categories).
To show the effectiveness of our algorithm we
run the various algorithms on our web data set, where
some documents are in multiple categories (for example a
document related to both baseball and movies should be
in all the three clusters: baseball, movie, and baseballmovie). We cannot present the results due to limitation on
space. K-Means, Fractionation and Buckshot formed only
clusters about baseball and Movies and did not form a
cluster for documents related to both categories.
However, WBSC formed a cluster for baseball-movies
and put the documents related to baseball-movies in that
cluster. This shows the effectiveness of our method.
We also measure the effectiveness of WBSC
quantitatively. We compared the clusters formed by the
documents against the documents in the original
categories and matched the clusters with the categories
one-to-one. For each matching, we counted the number of
documents that are common in corresponding clusters.
The matching with the largest number of common
documents is used to measure the effectiveness. This
60.00
50.00
40.00
30.00
20.00
10.00
0.00
100
WBSC
200
300
400
500
Number of documents
K-Means Buckshot Fractionation
Figure 2: Comparison of quality of clusters by
different clustering algorithms
We can clearly see that WBSC outperforms the
other algorithms in effectiveness.
Execution time (in minutes)
Execution Time:
We also measured the execution time of various
algorithms. Figure 3 gives the comparison of execution
time. As the graph shows, WBSC outperforms almost all
other algorithms in execution time, especially as the
number of documents increases.
140.00
120.00
100.00
80.00
60.00
40.00
20.00
0.00
200
400
600
800
Figure 4: Clusters formed for search term “cardinals” by
WBSC
This shows that our algorithm is effective even with the
web search results.
5. Conclusions
In this paper, we presented WBSC, a word-based
document-clustering algorithm. We have shown that it is
a promising means for clustering documents as well as
presenting the clusters in a meaningful way to the user.
We have compared WBSC with other existing algorithms
and have shown that it performs well, both in terms of
running time and quality of the clusters formed.
Future work includes tuning the parameters of
the algorithm, as well as adapting new similarity
measures and data structures to speed up the algorithm.
1000 1200 1400
Number of documents
WBSC
K-Means
Buckshot
Fractionation
Figure 3: Execution times of various clustering
algorithms
Web Search Results:
We also tested WBSC on the results from Web
search engines. We downloaded documents returned from
the Google search engine (www.google.com) and we
apply WBSC on them. Limitation on space prohibits us
from showing all the results. Here we show some of the
clusters found by WBSC by clustering the top 100 URLs
returned from searching the term “cardinal”.
The
categories correspond to the common usage of the word
“cardinal” in documents over the Web (name of baseball
teams, nickname for schools, a kind of bird, and a
Catholic clergy). Figure 4 shows some of the clusters
formed by WBSC.
Cluster results: Keywords and sample
documents
Related
topic
Bishops, Catholic, world, Roman,
Church, cardinals, College, Holy
1. Cardinals in the Catholic
Church Catholicpages.com
2. The Cardinals of the Holy
Roman Church
Roman
Catholic
Church
Cardinals
library
Dark, Bird, female, color
1. Cardinal Page
2. First Internet Bird Club
Cardinal
Bird
Benes, rookie, players, hit, innings,
runs, Garrett, teams, league, Saturday
1. St.Louis Cardinals Notes
2. National League Baseball Brewers vs. Cardinals
St.Louis
Cardinals
(MLB
team)
6. References:
[1] Douglass R. Cutting, David R. Karger, Jan O.
Pedersen, John W. Tukey, Scatter/Gather: A Clusterbased Approach to Browsing Large Document
Collections, In Proceedings of the Fifteenth Annual
International ACM SIGIR Conference, pp 318-329, June
1992.
[2] Dean, P. M. Ed., Molecular Similarity in
Drug Design, Blackie Academic & Professional, 1995, pp
111 –137.
[3] D. R. Hill, A vector clustering technique, in:
Samuelson (Ed.), Mechanized Information Storage,
Retrieval and Dissemination, North-Holland, Amsterdam,
1968.
[4] A.K. Jain, M.N. Murty and P.J. Flynn, Data
Clustering: A Review, ACM Computing Surveys. 31(3):
264-323, Sept 1999.
[5] F. Murtagh, A Survey of Recent Advances in
Hierarchical Clustering Algorithms", The Computer
Journal, 26(4): 354-359, 1983.
[6] J. J. Rocchio, Document retrieval systems –
optimization and evaluation, Ph.D. Thesis, Harvard
University, 1966.
[7] Robert E. Tarjan, Data Structures and
Network Algorithms, Society for Industrial and Applied
Mathematics, 1983.
[8]http://kdd.ics.uci.edu/databases/20newsgroup
s/20newsgroups.html, last visited November 18th, 2000.
[9] P.Willett, Recent trends in hierarchical
document clustering: a critical review, Information
processing and management, 24: 577-97, 1988.
[10] O.Zamir, O.Etzioni, Web document
clustering: a feasibility demonstration, in Proceedings of
19th international ACM SIGIR conference on research
and development in information retrieval (SIGIR 98),
1998, pp 46-54.
Download