TEXTUAL INFORMATION CLUSTERING AND VISUALIZATION FOR KNOWLEDGE DISCOVERY AND MANAGEMENT Xavier Polanco INTRODUCTION We are concerned with the design and development of computer-based information analysis tools in which clustering analysis, computational linguistics and artificial intelligence techniques are combined. On the technology side, an information analysis computer-based system is an integrated environment that somehow assisted a user in carrying out the complex process of converting information from the textual data sources to knowledge. TEXT MINING Text-mining consists of extracting information from hidden patterns in large textual collections. A very big amount of information is available in textual form in databases and online information sources. In this context, manual analysis and effective extraction of useful information are not possible. It is relevant to provide automatic tools for analyzing large textual collections. The goal of text mining is to extract information from patterns in large textual collections. The results can be important both for the analysis of the collection, and for providing intelligent navigation and browsing methods (Feldman et al., 1998; Landau et al., 1998)). The text mining process can be organized roughly into five-major steps: [1] Data Selection, [2] Term Extraction and Filtering, [3] Data Clustering, [4] Cluster Mapping or Visualization, [5] Result Interpretation (Polanco and François, 2000b). CLUSTERING The aim of our activity is performing the analysis of information by computer using cluster analysis and cartography (or mapping) algorithms which represent the generated clusters in the form of maps. We have applied this approach to the domain of scientific and technical information, i.e. stored publications and patents in databases (Polanco et al., 1995; 1998a; 1998b). The analysis of the textual information is divided into two phases. The first involves the cluster generation using clustering procedures, in which learning is unsupervised (the user does not define classes), while the second consists of positioning the clusters on a global map in order to display the topical organization of knowledge. These two phases are data driven. A hypertext interface generator provides the user with a user-friendly interface displaying the global map, the topics or clusters and the documents set and then it gives access to useful information organized by topics (clusters). Artificial neural networks (ANNs) are a useful class of models consisting of layers of nodes. Our interest in ANNs is based on the links which exist between data analysis and the ANNs approaches in the areas of clustering and mapping (Kohonen, 1997; Polanco et al., 1998a; 1998b; 2000a; 2000b). INFORMATION VISUALIZ ATION Information visualization is using vision to think (Card et al., 1999). We are concerned with cartography algorithms that represent the clusters in the form of maps. The studied maps (in Polanco et al., 1998b) are not only means of visualization. They also represent an analysis tool insofar as they allow users to evaluate the relative position of clusters (or topics) in the multidimensional space of representation. As we have observed, we must deal with the problem of readability of such maps. 1 The maps are "visualization-based analysis tools." In the context of data mining and knowledge discovery in databases, Brachman and Anand (1996) have noted that "The visualization produced is by itself a model, and the user can examine the visualization to determine its explanatory power (...) Appropriate display of data points and their relationships can give the analyst insight that is virtually impossible to get from looking at tables of output or simple summary statistics. In fact, for some tasks, appropriate visualization is the only thing needed to solve a problem or confirm a hypothesis, even though we do not usually think of picture-drawing as a kind of analysis." KNOWLEDGE DISCOVERY As a framework of what means knowledge discovery in databases (KDD), we summarize here the view of Brachman and Anand (1996). They invite to look for KDD as a human centered process. A KDD system is a technical way of support discovery of knowledge by a user. In a given context, the output of the knowledge discovery process would more typically be the specification for a knowledge discovery application. The goal of the design of the KDD as a process is to help us better understanding how to do knowledge discovery, and how to support the human analysts advantage. Without human analysts KDD is unthinking. It is crucial emphasizing the key role played by humans in knowledge discovery. It is important to understand who the user is and what tasks the user has performed. We assume that our user is not a business end-user, but the "analyst." So it is the analyst's needs and tasks that will determine our attention. The analyst "analyzes" the data using data analysis and visualization tools. This analysis leads the analyst to some sort of "insight" about the data. The analyst then uses presentation tools to disseminate this insight to a broader audience, that is the parties that generated the original goal of the analysis. KNOWLEDGE MANAGEMENT We would add to the information analysis a formalized operator for processing the knowledge produced by experts when they analyze the clusters. The knowledge management presupposes that it is implemented by a system. The system must be able to process the results of the knowledge organization allowing not only exploration and visualization, but also the possibility of performing operations on this knowledge. The system must be able to manage at least three types of data that we wish to combine: clusters, classes and the bibliographic or textual data (which may themselves be of different types). The idea is therefore to model not only the bibliographic data and the clusters obtained from this data, as it currently is performed, but also the classes of knowledge obtained from the cluster experts analyze. Generally speaking, a knowledge management system (KMS) is concerned with the identification, acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge. Without going into details, we can accept for our purpose the following general concept: "Knowledge management is the formal management of knowledge for facilitating creation, access, and reuse of knowledge, typically using advanced technology" (O'Leary, 1998a; 1998c). In order to capitalize the expert knowledge produces in an science and technology watch analyze, as a way of reusing this knowledge in new actions concerning the same domain. In our opinion, science and technology watch and knowledge management will take full advantage if these tasks are fully integrated. A more operational definition of knowledge management is in terms of converting and connecting. "Knowledge management is a process of converting knowledge from the sources accessible to an organization and connecting people with that knowledge" (O’Leary, 1998b). Then the functions that a KMS represents are converting knowledge from textual data sources, and connecting people with that knowledge. Science and technology watch in the broadest sense can be considered as the observation and following up of scientific and technological changes in order to alert decision-makers, about the consequences of scientific and technological issues and trends. 2 REFERENCES Brachman R. J. Anand T., 1996, “The Process of Knowledge Discovery in Databases.” In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds), Advances in Knowledge Discovery and Data Mining, Menlo Park, Calif., AAAI Press / The MIT Press, p. 37-57. Card S. K. Mackinlay J. D. Shneiderman B. (eds.), 1999, Readings in Information Visualisation. San Francisco, Calif. Morgan Kaufmann Publishers Inc. Feldman R. Aumann Y Zilberstein A. Ben-Yehuda Y.,1998, “Trend Graphs: Visualizing the Evolution of Concept Relationships in Large Document Collections.” In J.M. Zytkow and M. Quafafou (eds) Principles of Data Mining and Knowledge Discovery. Berlin: Springer Verlag, pp. 38-46. Kohonen T., 1997, Self-Organizing Maps, Berlin, Springer. Landau D. Feldman R. Aumann Y. Fresko M. Lindell Y Lipshtat O. Zamir O., 1998, “TextVis: An Integrated Visual Environment for Text Mining.” In J.M. Zytkow and M. Quafafou (eds) Principles of Data Mining and Knowledge Discovery. Berlin: Springer Verlag, pp. 56-64. O'Leary D. E., 1998a, “Knowledge-Management Systems: Converting and Connecting.” IEEE Intelligent Systems, vol. 13, n° 3, pp. 30-33 O'Leary D. E., 1998b, “Using AI in Knowledge Management: Knowledge Bases and Ontologies.” IEEE Intelligent Systems, vol. 13, n° 3, pp. 34-39. O'Leary D. E., 998c, “Enterprise Knowledge Management.” Computer, vol. 31, n° 3, pp. 54-61 Polanco X. Grivel L. Royauté J., 1995, “How To Do Things with Terms in Informetrics : Terminological Variation and Stabilization as Science Watch Indicators.” In Proceedings of the Fifth International Conference of the International Society for Scientometrics and Informetrics. Edited by M.E.D. Koening et A. Bookstein. Medford, N.J., Learned Information Inc., p. 435-444. Polanco X. François C. Keim J-P., 1998a, “Artificial Neural Network Technology for the Classification and Cartography of Scientific and Technical Information.” Scientometrics, vol. 41, num. 1, p. 69-82. Polanco X. François C. Ould Louly A., 1998b, “For Visualization-Based Analysis Tools in Knowledge Discovery Process : A Multilayer Perceptron versus Principal Components Analysis - A Comparative Study.” In J.M. Zytkow and M. Quafafou (eds) Principles of Data Mining and Knowledge Discovery. Second European Symposium, PKDD’98, Nantes, France, 23-26 September 1998. Lecture Note in Artificial Intelligence 1510. Subseries of Lecture Notes in Computer Science. Berlin, Springer, p. 28-37. Polanco X. François C., Lamirel J. Ch.2000a, “Using Artificial Neural Networks for Mapping Science.” In Book of Abstracts of the Sixth International Conference on Science and Technology Indicators. Leiden, The Netherlands, 24-27 May, p. 89. Polanco X. François C., 2000b, “Data Clustering and Cluster Mapping or Visualization in Text Processing and Mining.” In Proceedings of the Sixth International Conference of the International Society of Knowledge Organization, July 10-13, 2000, in Toronto, Canada, p. 359-365. 3