Bring Order to Text Corpora Using Automatically Generated Concept

advertisement
Bring Order to Corpora Using Automatically Generated Concept
Hierarchies
Chulin Meng
Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E Daniel St,
Champaign, IL 61820. Email: cmeng@uiuc.edu
Abstract
It is about using an automatic generated top-down treelike concept hierarchy to improve the presentation of
document collections, which may be search results of the
web, a community information repository, or a large-scale
digital library.
We proposed a way to build concept hierarchy based on a
modified document clustering algorithm and term cooccurrence analysis. We use singular value decomposition
technique to select out the informative descriptors before
we actually forming the clusters. We also present the
preliminary results of the evaluation of the usefulness of
the derived concept hierarchies.
Introduction
As the size of information collections grows, it is always
a frustrating experience to find relevant information. Most
information retrieval systems respond to user query with a
long ranked list of documents with useful information
buried in the irrelevant documents. We propose a way to
organize documents into an automatically generated
concept hierarchy.
Hierarchical structures, such as library classification
systems or web directories, are common and successful
tools for organizing large amount of information. However,
these manually created systems require a lot of initial effort
to create and are difficult to maintain. People are
conducting researches on using automatically generated
subject hierarchies to organize text collections.
Document clustering techniques have long been used to
organize documents into topical categories. However,
although hierarchical clustering algorithms always generate
some kinds of hierarchies, the descriptors sometime are
hard to understand, which is due to the fact that the clusters
might be meaningless even if their internal cohesiveness
and external distinctiveness are high according to some
distance measure.
Inspired by Vivisimo’s interesting idea of “conceptual
clustering” - good clusters possess good concise
descriptions, [Vivisimo 04] we proposed a modified
document clustering algorithm to construct a concept
hierarchy with good descriptors. The basic idea here is to
find informative phrases before actually forming the
clusters. We implement our algorithm and run on search
results of web search engines.
General approach
Our purpose is to construct a browsable tree-like concept
hierarchy and use it to organize documents. In this way, we
provide user with an overview of the topics and help them
identify the specific group of documents they were looking
for. Thus it is very important to have an understandable
overall
structures
and
informative
descriptors.
Unfortunately, the problem of the quality of cluster
descriptions seems to have been neglected in previous
proposed algorithms.
The majority of currently used text clustering algorithms
form the clusters first, then based on the content in the
clusters to determine labels. This may result in some
groups' descriptions being meaningless to the users. To
avoid such problems, we adopt a different approach to find
and describe clusters. When designing the clustering
algorithm, more attention is paid to ensuring that both
contents and descriptors of the resulting groups are
meaningful to the users.
The general idea here is to find meaningful descriptions
of clusters first, and then, determine the content of each
cluster based on the descriptors. The crucial point of this
approach is the careful selection of cluster labels. We use
the Latent Semantic Indexing (LSI) technique to ensure
that the labels both differ from each other and at the same
time cover most of the topics present in the input
collection. Past researches [Bellegarda et al., 00] [Berry et
al., 99] [Landauer et al., 98] on LSI show that it could
identify the abstract concepts that represent the documents.
In LSI, text documents and the words comprising these
documents are analyzed using singular value
decomposition. LSI can identify the abstract concepts
based on the co-occurrence of words in the documents. We
believe that these abstract concepts are good candidates of
cluster descriptors.
After the group descriptors are identified, we use a vector
space model to determine the cluster content. Then, we use
co-occurrence analysis technique to determine the
generality and specificity of the group descriptors and form
the concept hierarchy. As proved by Sanderson and Croft,
this method is simple but effective. [Sanderson et al., 99]
This way, the hierarchical structure not only relies on the
actual content of groups, but also on the semantic
relationships between the descriptors.
T. Landauer, P. Foltx, and D. Laham, (1998) “Introduction to
latent semantic analysis,” Discourse Processes, vol. 25, pp.
259–284.
System implementation
Vivisimo, (2004). http://vivisimo.com/
We decide to run our system on web search results, since
the search results are more likely to form good clusters.
The first step is to remove the HTML codes and nonletter characters from the snippets returned by search
engine. Stemming and stop-word list are also applied to the
snippets to get rid of the possible noise.
The second step is to use suffix arrays to discover those
terms and phrases that exceed the term frequency threshold
to be candidate of group descriptors. The Singular Value
Decomposition is employed to obtain abstract concepts for
cluster labels.
In the third step, a vector space model is used to assign
the input snippets to the cluster labels induced in the
previous phase. The input snippets are matched to every
descriptor and the snippet could be in multiple groups.
Finally, the selected descriptors are organized
hierarchically using co-occurrence analysis techniques and
presented to users in the form of a hierarchical menu.
Discussions and future work
We did a small-scale user study about the usefulness of
the concept hierarchy generated by our system. The study
shows that our approach is capable of generating
meaningful and reasonably described hierarchical clusters.
There is more room for possible improvement. In next
phase, we plan to conduct further assessments, which will
perform the comparison to the ranked list by search engine
and clustering interface by other clustering methods.
And for the algorithm itself, we will also explore if some
techniques, such as lexical reference system or statistical
model of language, could help to make the concept
hierarchies better.
REFERENCES
J. Bellegarda, (2000) “Exploiting latent semantic information in
statistical language modeling,” Proc. IEEE, vol. 88, no. 8, pp.
1279–1296.
Michael Berry, Zlatko Drmavc, and Elizabeth R. Jessup, (1999)
"Matrices, vector spaces, and information retrieval", SIAM
Review, 41(2):335-362.
Sanderson and Croft. (1999) "Deriving Concept Hierarchies from
Text". In: Proceedings of the 22nd Annual International ACM
SIGIR Conference on Research and Development in
Information Retrieval, 206-213, Berkeley, CA, ACM.
Download