Text Mining - Beyond Information Extraction Towards Text Information Exploitation Project Proposal for „Future and Emerging Technologies“ in the EU-IST Programme S. Staab1, R. Studer K. Markert, B. Webber N. Kushmerick B. Bremdal, R. Engels Karlsruhe University University of Edinburgh University College Dublin Cognit a.s http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/ 1 Abstract Motivation: The revolutionary step from printed text to digital documents has lead to an explosive growth of knowledge available (semi-)publicly through the internet or through community and corporate intranets. With this flood of potentially useful information, there comes the urgent need to sift through it, find the golden nuggets of information and analyze it for making informed decisions. Problem: So far, work in text mining has mostly concentrated on purposes like extracting informations from texts, summarizing the relevant informations, or answering questions on texts. However, this information extraction-based vision, which has been elaborated, e.g., in approaches for message understanding, mostly neglects the amount and complexity of information that the user must deal with and act upon. In contrast, the further connotations of text mining that go well beyond information extraction — the aggregation and analysis of information into golden nuggets of knowledge that may lead to informed actions — has hardly been investigated so far. Objective: In our project we want to go beyond information extraction towards text information exploitation. This means we want to aggregate extracted information in order to deduce knowledge that may not have been in the mind of the authors of the individual texts. For example, a decade ago there were a lot of separate medical research reports describing symptoms of a special migraine headache, such as a lack of magnesium (among others). However, the causal link between magnesium deficits and migraine headaches was implicit and sometimes it was given only indirectly via other symptoms. A manual text recherche found that the literature reported eight different types of direct and indirect links between magnesium deficits and migraine headaches. This result strongly supported the hypothesis that a lack of magnesium causes the type of migraine headaches – a hypothesis which easily proved true in subsequent medical experiments and, thus, became very valuable to know. However, though all the information was in the documents, the knowledge about the potential causal link was very hard to discover. 1 Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.unikarlsruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717 Text mining in the sense we envision will allow for applications that handle tasks like finding causal links described in texts (semi-)automatically. Once, the text mining application has been set up by a systems engineer, the naive user lets the system extract information, aggregate it and discover those nuggets of knowledge that may actually help him to solve a problem. Methods: The objective just described will be put to work in a realistic environment. This means we must consider: 1. Real-world texts: This means we must include semi-structured information such as given in de facto web standards, like HTML and XML. This also implies that we must try to extract information from layout structures and from more rigid formats, such as tables appearing in natural language texts. 2. Integrating Information Extraction Techniques: We need a broad basis for information extraction in order to solve the information exploitation task. For this purpose, we want to build on the experiences that have been made with TREC- and MUCstyle approaches as well as with machine learning techniques that have been applied for wrapping semi-structured data. This requires strong competence in the fields of Information Retrieval and Extraction, Computational Linguistics and Machine Learning. 3. Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a domain ontology that acts as a semantic mediator between different information extraction methods, that allows for knowledge discovery at different levels of granularity and that allows for mappings between different terminologies. This domain ontology may be coded manually up to a point, but ultimately it needs to be extended (semi)automatically – including proposals for ontology induction by knowledge discovery techniques. 4. Text mining as a semi-automatic process: We consider text mining a semi-automatic process that is designed and set up with a particular application and particular topics in mind. The design involves the construction of a domain ontology and a domain lexicon, the formulation and/or learning of interesting structures with computational linguistics and/or information retrieval techniques and the exploration of the corresponding results. Once, the domain specific text mining application is set up the naive user may run it to extract information and – in particular – to find associations and rules that were not present in the original texts, but that could only be found by considering, integrating and comparing various text sources. This approach parallels the development in data mining where the utopia of a fully automatic knowledge discovery process has been given up in favor of a greatly successful process model that distinguishes between design and use. 5. Discovering knowledge: Finally, we actually need to apply machine learning techniques to aggregate and analyse extracted information yielding those „golden nuggets of knowledge“ we are looking for in the first place. Scenario: As an interesting case study we choose the mining of annual business reports and analysts‘ reports available on the internet that comment on companies from a particular area (e.g., telecommunication). This scenario is very appropriate, because 1. It allows the observation of competitors and the detection of trends that are extremely important for decision makers, such as trends in organizational structures or in markets and products. 2. The understanding of these texts is most valuable if information is compared against each other – annual changes or variations between companies are most interesting. This is exactly the type of implicit knowledge our methods are supposed to discover. 3. The setting is well enough observed and understood by professionals in order to verify the techniques we develop. Since AIFB is integrated in a business faculty and since Cognit has some experience with this domain we may also provide a satisfying amount of experience in this area. 2 Chances for Europe Our proposal objective is aimed at making make the most knowledge out of the flood of information you have in large collections of text. Numerous chances and possibilities arise from this information exploitation approach, several major ones of which are: 1 Informed decisions supporting competitivity: In order to be competitive European businesses must be able to recognize and react quickly to new challenges and trends and make informed decisions. Large collections of texts implicitly contain this type of knowledge, but it is not accessible with current tools and techniques. Information exploitation from texts will provide extremely valuable means to provide such mission-critical knowledge from the internet and from the corporate intranet, thus keeping European businesses competitive and flexible. 2 Individual learning: The individual will have tools that allow him/her easy access towards large collections of text and that give him/her the possibility to discover and understand knowledge that is hidden in texts. In a connected society where everyone works with the net such tools will become indispensible in order to bridge the widening gap between computer literates and those with less experience with computers. 3 Progress in Research: Though our scenario develops a particular business case, thinking of the migraine example, it is easy to understand that many research fields may profit similarly from text information exploitation, just because research articles in all science and arts are being put on the internet. All these factors are critical to develop knowledge of Europeans and for Europeans. Informed decisions, individual learning and improved research all work together in keeping Europe competitive and in helping the computer user with overseeing the flood of texts on the world wide web. 3 Contribution of Partners We consider text mining as being a knowledge acquisition process that should be facilitated by learning approaches and by the techniques found in information retrieval and computational linguistics. Hence, the consortium includes partners with different backgrounds: Karlsruhe University, Institute for Applied Informatics, will focus on knowledge acquisition tools, techniques and methodologies. The University of Edinburgh, Division of Informatics, will concentrate on computation linguistics tools and techniques for information extraction. Important for them is .... University College Dublin, Department of Computer Science, will contribute competence in machine learning and integration of semi-structured data. ..... Cognit a.s has broad experience with information retrieval techniques that they also combine with machine learning approaches.... The work packages will be split along the following lines: Univ. Karlsruhe Real-world texts IR methods on XML texts with Syntax, Semantics Learning from Extracting in(with Layout) structured formation with documents IR methods Ontology modeling and induction Text Mining Methodology as Process Information Exploitation Univ. College Cognit Dublin Syntax, Semantics of XML text Integrating IE Wrappers techniques ontologies Domain ontology Univ. Edinburgh Ontology induction Reuseable components? Knowledge Discovery from IR and CL data Knowledge Discovery from IR and CL data 4 Persons' Profiles Knowledge Acquisition (Karlsruhe University): Prof. Dr. Rudi Studer has a chair for knowledge management at Karlsruhe University. He has carried out research and organized numerous activities in the fields of knowledge acquisition, knowledge management and data mining for over 20 years. Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University, where he is also project manager for Karlsruhe in the government-funded project GETESS, which aims at an information extraction system for tourism. Dr. Staab got his M.S.E. from the Univ. of Pennsylvania, USA, and his Dr. rer. nat. from Freiburg University, Germany. His dissertation was dealing with semantic and ontological issues of text understanding. A paper about knowledge representation and reasoning won him a best paper award at the European Conference on AI ´98. His research interests and organizational activities now include topics in knowledge management and knowledge discovery. Computational Linguistics (University of Edinburgh): Prof. Dr. Bonnie Webber... Dr. Katja Markert.... Machine Learing (University College Dublin): Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science, University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the University of Washington, and his dissertation was nominated for the ACM Distinguished Dissertation award. Dr. Kushmerick has worked in the areas of planning, machine learning, and information-extraction, -integration, and -retrieval. His worked has been published in several international journals, and he has been on the organizing committee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses on the use of machine learning to scale up knowledge engineering on the Internet, in service of problems such as information extraction and designing intelligent browsing assistants. Information Retrieval and Extraction (Cognit a.s): Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After finishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on the application of artificial intelligence, rule-based and object-oriented programming in project planning in 1988. After he has been affiliated with a variety of companies he cofounded and directs CognIT a.s. Author of more than 50 articles and published reports on computer applications in engineering and industry, design and planning, objectoriented technology and artificial intelligence. Most recent publication is Braunschweig and Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996. Dr. Robert Engels: Studied Psychology and Computer Science at the university of Amsterdam, NL. He conducted his MSc thesis on applications of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD from the university of Karlsruhe for research conducted in the area of Knowledge Discovery and Data Mining. He (co-) authored a variety of papers, and organised several international and national (German) workshops on practical applications of Data Mining. Currently he is affiliated with CognIT as a senior systems architect. 5 Partner Adresses Dr. Steffen Staab, Prof. Dr. Rudi Studer Institute for Applied Informatics and Formal Description Methods (AIFB), Karlsruhe University, D-76128 Karlsruhe, Germany http://www.aifb.uni-karlsruhe.de/WBS mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de Dr. Katja Markert, Prof. Dr. Bonnie Webber Division of Informatics, University of Edinburgh, 80 South Bridge Edinburgh EH1 1HN, Scotland http://www.informatics.ed.ac.uk/research/irr/ mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk Dr. Nicholas Kushmerick Department of Computer Science, University College Dublin, Dublin 4, Ireland http://www.cs.ucd.ie/staff/nick/ mailto:nick@ucd.ie Dr. Robert Engels, Dr. Bernt Bremdal Cognit a.s, P.B. 610, N-1754 Halden, Norway http://www.cognit.no/ mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no We should have a section „open research issues in detail that we want to tackle“ in a longer version Research Issues: Open research issues in this field are manifold, e.g.: 1. Extracting the semantics of layout with computational linguistics, aligning semistructured data with the corresponding ontology information 2. Aligning several information extraction techniques (TREC-style) with Integrating techniques: ontology, machine learning, information retrieval and extraction, computational linguistics (learning with ontologies, inducing ontologies from computational linguistics and information extraction techniques, aligning wrapper induction with ontologies, applying information extraction measures to the syntactic and semantic level,