abstract in MsWord

advertisement
TEXTUAL INFORMATION CLUSTERING AND VISUALIZATION
FOR KNOWLEDGE DISCOVERY AND MANAGEMENT
Xavier Polanco
INTRODUCTION
We are concerned with the design and development of computer-based information analysis tools in
which clustering analysis, computational linguistics and artificial intelligence techniques are combined. On
the technology side, an information analysis computer-based system is an integrated environment that
somehow assisted a user in carrying out the complex process of converting information from the textual
data sources to knowledge.
TEXT MINING
Text-mining consists of extracting information from hidden patterns in large textual collections. A
very big amount of information is available in textual form in databases and online information sources.
In this context, manual analysis and effective extraction of useful information are not possible. It is
relevant to provide automatic tools for analyzing large textual collections. The goal of text mining is to
extract information from patterns in large textual collections. The results can be important both for the
analysis of the collection, and for providing intelligent navigation and browsing methods (Feldman et al.,
1998; Landau et al., 1998)).
The text mining process can be organized roughly into five-major steps: [1] Data Selection, [2] Term
Extraction and Filtering, [3] Data Clustering, [4] Cluster Mapping or Visualization, [5] Result
Interpretation (Polanco and François, 2000b).
CLUSTERING
The aim of our activity is performing the analysis of information by computer using cluster analysis
and cartography (or mapping) algorithms which represent the generated clusters in the form of maps. We
have applied this approach to the domain of scientific and technical information, i.e. stored publications
and patents in databases (Polanco et al., 1995; 1998a; 1998b).
The analysis of the textual information is divided into two phases. The first involves the cluster
generation using clustering procedures, in which learning is unsupervised (the user does not define
classes), while the second consists of positioning the clusters on a global map in order to display the
topical organization of knowledge. These two phases are data driven. A hypertext interface generator
provides the user with a user-friendly interface displaying the global map, the topics or clusters and the
documents set and then it gives access to useful information organized by topics (clusters).
Artificial neural networks (ANNs) are a useful class of models consisting of layers of nodes. Our
interest in ANNs is based on the links which exist between data analysis and the ANNs approaches in the
areas of clustering and mapping (Kohonen, 1997; Polanco et al., 1998a; 1998b; 2000a; 2000b).
INFORMATION VISUALIZ ATION
Information visualization is using vision to think (Card et al., 1999). We are concerned with
cartography algorithms that represent the clusters in the form of maps. The studied maps (in Polanco et
al., 1998b) are not only means of visualization. They also represent an analysis tool insofar as they allow
users to evaluate the relative position of clusters (or topics) in the multidimensional space of
representation. As we have observed, we must deal with the problem of readability of such maps.
1
The maps are "visualization-based analysis tools." In the context of data mining and knowledge
discovery in databases, Brachman and Anand (1996) have noted that "The visualization produced is by
itself a model, and the user can examine the visualization to determine its explanatory power (...)
Appropriate display of data points and their relationships can give the analyst insight that is virtually
impossible to get from looking at tables of output or simple summary statistics. In fact, for some tasks,
appropriate visualization is the only thing needed to solve a problem or confirm a hypothesis, even though
we do not usually think of picture-drawing as a kind of analysis."
KNOWLEDGE DISCOVERY
As a framework of what means knowledge discovery in databases (KDD), we summarize here the
view of Brachman and Anand (1996). They invite to look for KDD as a human centered process. A KDD
system is a technical way of support discovery of knowledge by a user. In a given context, the output of
the knowledge discovery process would more typically be the specification for a knowledge discovery
application. The goal of the design of the KDD as a process is to help us better understanding how to do
knowledge discovery, and how to support the human analysts advantage. Without human analysts KDD is
unthinking. It is crucial emphasizing the key role played by humans in knowledge discovery.
It is important to understand who the user is and what tasks the user has performed. We assume that
our user is not a business end-user, but the "analyst." So it is the analyst's needs and tasks that will
determine our attention. The analyst "analyzes" the data using data analysis and visualization tools. This
analysis leads the analyst to some sort of "insight" about the data. The analyst then uses presentation tools
to disseminate this insight to a broader audience, that is the parties that generated the original goal of the
analysis.
KNOWLEDGE MANAGEMENT
We would add to the information analysis a formalized operator for processing the knowledge
produced by experts when they analyze the clusters. The knowledge management presupposes that it is
implemented by a system. The system must be able to process the results of the knowledge organization
allowing not only exploration and visualization, but also the possibility of performing operations on this
knowledge. The system must be able to manage at least three types of data that we wish to combine:
clusters, classes and the bibliographic or textual data (which may themselves be of different types). The idea
is therefore to model not only the bibliographic data and the clusters obtained from this data, as it currently
is performed, but also the classes of knowledge obtained from the cluster experts analyze.
Generally speaking, a knowledge management system (KMS) is concerned with the identification,
acquisition, development, diffusion, use, and preservation of the enterprise’s knowledge. Without going
into details, we can accept for our purpose the following general concept: "Knowledge management is the
formal management of knowledge for facilitating creation, access, and reuse of knowledge, typically using
advanced technology" (O'Leary, 1998a; 1998c). In order to capitalize the expert knowledge produces in an
science and technology watch analyze, as a way of reusing this knowledge in new actions concerning the
same domain. In our opinion, science and technology watch and knowledge management will take full
advantage if these tasks are fully integrated.
A more operational definition of knowledge management is in terms of converting and connecting.
"Knowledge management is a process of converting knowledge from the sources accessible to an
organization and connecting people with that knowledge" (O’Leary, 1998b). Then the functions that a
KMS represents are converting knowledge from textual data sources, and connecting people with that
knowledge.
Science and technology watch in the broadest sense can be considered as the observation and
following up of scientific and technological changes in order to alert decision-makers, about the
consequences of scientific and technological issues and trends.
2
REFERENCES
Brachman R. J. Anand T., 1996, “The Process of Knowledge Discovery in Databases.” In U. M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (eds), Advances in Knowledge Discovery and Data
Mining, Menlo Park, Calif., AAAI Press / The MIT Press, p. 37-57.
Card S. K. Mackinlay J. D. Shneiderman B. (eds.), 1999, Readings in Information Visualisation. San
Francisco, Calif. Morgan Kaufmann Publishers Inc.
Feldman R. Aumann Y Zilberstein A. Ben-Yehuda Y.,1998, “Trend Graphs: Visualizing the
Evolution of Concept Relationships in Large Document Collections.” In J.M. Zytkow and M. Quafafou
(eds) Principles of Data Mining and Knowledge Discovery. Berlin: Springer Verlag, pp. 38-46.
Kohonen T., 1997, Self-Organizing Maps, Berlin, Springer.
Landau D. Feldman R. Aumann Y. Fresko M. Lindell Y Lipshtat O. Zamir O., 1998, “TextVis: An
Integrated Visual Environment for Text Mining.” In J.M. Zytkow and M. Quafafou (eds) Principles of Data
Mining and Knowledge Discovery. Berlin: Springer Verlag, pp. 56-64.
O'Leary D. E., 1998a, “Knowledge-Management Systems: Converting and Connecting.” IEEE
Intelligent Systems, vol. 13, n° 3, pp. 30-33
O'Leary D. E., 1998b, “Using AI in Knowledge Management: Knowledge Bases and Ontologies.”
IEEE Intelligent Systems, vol. 13, n° 3, pp. 34-39.
O'Leary D. E., 998c, “Enterprise Knowledge Management.” Computer, vol. 31, n° 3, pp. 54-61
Polanco X. Grivel L. Royauté J., 1995, “How To Do Things with Terms in Informetrics :
Terminological Variation and Stabilization as Science Watch Indicators.” In Proceedings of the Fifth
International Conference of the International Society for Scientometrics and Informetrics. Edited by M.E.D. Koening et
A. Bookstein. Medford, N.J., Learned Information Inc., p. 435-444.
Polanco X. François C. Keim J-P., 1998a, “Artificial Neural Network Technology for the
Classification and Cartography of Scientific and Technical Information.” Scientometrics, vol. 41, num. 1, p.
69-82.
Polanco X. François C. Ould Louly A., 1998b, “For Visualization-Based Analysis Tools in Knowledge
Discovery Process : A Multilayer Perceptron versus Principal Components Analysis - A Comparative
Study.” In J.M. Zytkow and M. Quafafou (eds) Principles of Data Mining and Knowledge Discovery. Second
European Symposium, PKDD’98, Nantes, France, 23-26 September 1998. Lecture Note in Artificial
Intelligence 1510. Subseries of Lecture Notes in Computer Science. Berlin, Springer, p. 28-37.
Polanco X. François C., Lamirel J. Ch.2000a, “Using Artificial Neural Networks for Mapping
Science.” In Book of Abstracts of the Sixth International Conference on Science and Technology Indicators. Leiden,
The Netherlands, 24-27 May, p. 89.
Polanco X. François C., 2000b, “Data Clustering and Cluster Mapping or Visualization in Text
Processing and Mining.” In Proceedings of the Sixth International Conference of the International Society of Knowledge
Organization, July 10-13, 2000, in Toronto, Canada, p. 359-365.
3
Download