Bring Order to Corpora Using Automatically Generated Concept Hierarchies Chulin Meng Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign, 501 E Daniel St, Champaign, IL 61820. Email: cmeng@uiuc.edu Abstract It is about using an automatic generated top-down treelike concept hierarchy to improve the presentation of document collections, which may be search results of the web, a community information repository, or a large-scale digital library. We proposed a way to build concept hierarchy based on a modified document clustering algorithm and term cooccurrence analysis. We use singular value decomposition technique to select out the informative descriptors before we actually forming the clusters. We also present the preliminary results of the evaluation of the usefulness of the derived concept hierarchies. Introduction As the size of information collections grows, it is always a frustrating experience to find relevant information. Most information retrieval systems respond to user query with a long ranked list of documents with useful information buried in the irrelevant documents. We propose a way to organize documents into an automatically generated concept hierarchy. Hierarchical structures, such as library classification systems or web directories, are common and successful tools for organizing large amount of information. However, these manually created systems require a lot of initial effort to create and are difficult to maintain. People are conducting researches on using automatically generated subject hierarchies to organize text collections. Document clustering techniques have long been used to organize documents into topical categories. However, although hierarchical clustering algorithms always generate some kinds of hierarchies, the descriptors sometime are hard to understand, which is due to the fact that the clusters might be meaningless even if their internal cohesiveness and external distinctiveness are high according to some distance measure. Inspired by Vivisimo’s interesting idea of “conceptual clustering” - good clusters possess good concise descriptions, [Vivisimo 04] we proposed a modified document clustering algorithm to construct a concept hierarchy with good descriptors. The basic idea here is to find informative phrases before actually forming the clusters. We implement our algorithm and run on search results of web search engines. General approach Our purpose is to construct a browsable tree-like concept hierarchy and use it to organize documents. In this way, we provide user with an overview of the topics and help them identify the specific group of documents they were looking for. Thus it is very important to have an understandable overall structures and informative descriptors. Unfortunately, the problem of the quality of cluster descriptions seems to have been neglected in previous proposed algorithms. The majority of currently used text clustering algorithms form the clusters first, then based on the content in the clusters to determine labels. This may result in some groups' descriptions being meaningless to the users. To avoid such problems, we adopt a different approach to find and describe clusters. When designing the clustering algorithm, more attention is paid to ensuring that both contents and descriptors of the resulting groups are meaningful to the users. The general idea here is to find meaningful descriptions of clusters first, and then, determine the content of each cluster based on the descriptors. The crucial point of this approach is the careful selection of cluster labels. We use the Latent Semantic Indexing (LSI) technique to ensure that the labels both differ from each other and at the same time cover most of the topics present in the input collection. Past researches [Bellegarda et al., 00] [Berry et al., 99] [Landauer et al., 98] on LSI show that it could identify the abstract concepts that represent the documents. In LSI, text documents and the words comprising these documents are analyzed using singular value decomposition. LSI can identify the abstract concepts based on the co-occurrence of words in the documents. We believe that these abstract concepts are good candidates of cluster descriptors. After the group descriptors are identified, we use a vector space model to determine the cluster content. Then, we use co-occurrence analysis technique to determine the generality and specificity of the group descriptors and form the concept hierarchy. As proved by Sanderson and Croft, this method is simple but effective. [Sanderson et al., 99] This way, the hierarchical structure not only relies on the actual content of groups, but also on the semantic relationships between the descriptors. T. Landauer, P. Foltx, and D. Laham, (1998) “Introduction to latent semantic analysis,” Discourse Processes, vol. 25, pp. 259–284. System implementation Vivisimo, (2004). http://vivisimo.com/ We decide to run our system on web search results, since the search results are more likely to form good clusters. The first step is to remove the HTML codes and nonletter characters from the snippets returned by search engine. Stemming and stop-word list are also applied to the snippets to get rid of the possible noise. The second step is to use suffix arrays to discover those terms and phrases that exceed the term frequency threshold to be candidate of group descriptors. The Singular Value Decomposition is employed to obtain abstract concepts for cluster labels. In the third step, a vector space model is used to assign the input snippets to the cluster labels induced in the previous phase. The input snippets are matched to every descriptor and the snippet could be in multiple groups. Finally, the selected descriptors are organized hierarchically using co-occurrence analysis techniques and presented to users in the form of a hierarchical menu. Discussions and future work We did a small-scale user study about the usefulness of the concept hierarchy generated by our system. The study shows that our approach is capable of generating meaningful and reasonably described hierarchical clusters. There is more room for possible improvement. In next phase, we plan to conduct further assessments, which will perform the comparison to the ranked list by search engine and clustering interface by other clustering methods. And for the algorithm itself, we will also explore if some techniques, such as lexical reference system or statistical model of language, could help to make the concept hierarchies better. REFERENCES J. Bellegarda, (2000) “Exploiting latent semantic information in statistical language modeling,” Proc. IEEE, vol. 88, no. 8, pp. 1279–1296. Michael Berry, Zlatko Drmavc, and Elizabeth R. Jessup, (1999) "Matrices, vector spaces, and information retrieval", SIAM Review, 41(2):335-362. Sanderson and Croft. (1999) "Deriving Concept Hierarchies from Text". In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 206-213, Berkeley, CA, ACM.