Text Mining - Beyond Information Extraction
Towards Text Information Exploitation
Project Proposal for „Future and Emerging Technologies“
in the EU-IST Programme
S. Staab1, R. Studer
K. Markert, B. Webber
N. Kushmerick
B. Bremdal, R. Engels
Karlsruhe University
University of Edinburgh
University College Dublin
Cognit a.s
1 Abstract
Motivation: The revolutionary step from printed text to digital documents has lead to an
explosive growth of knowledge available (semi-)publicly through the internet or through
community and corporate intranets. With this flood of potentially useful information,
there comes the urgent need to sift through it, find the golden nuggets of information and
analyze it for making informed decisions.
Problem: So far, work in text mining has mostly concentrated on purposes like extracting informations from texts, summarizing the relevant informations, or answering questions on texts. However, this information extraction-based vision, which has been elaborated, e.g., in approaches for message understanding, mostly neglects the amount and
complexity of information that the user must deal with and act upon. In contrast, the
further connotations of text mining that go well beyond information extraction — the
aggregation and analysis of information into golden nuggets of knowledge that may
lead to informed actions — has hardly been investigated so far.
Objective: In our project we want to go beyond information extraction towards text information exploitation. This means we want to aggregate extracted information in order
to deduce knowledge that may not have been in the mind of the authors of the individual
For example, a decade ago there were a lot of separate medical research reports describing symptoms of a special migraine headache, such as a lack of magnesium (among
others). However, the causal link between magnesium deficits and migraine headaches
was implicit and sometimes it was given only indirectly via other symptoms. A manual
text recherche found that the literature reported eight different types of direct and indirect links between magnesium deficits and migraine headaches. This result strongly supported the hypothesis that a lack of magnesium causes the type of migraine headaches –
a hypothesis which easily proved true in subsequent medical experiments and, thus, became very valuable to know. However, though all the information was in the documents,
the knowledge about the potential causal link was very hard to discover.
Text mining in the sense we envision will allow for applications that handle tasks like
finding causal links described in texts (semi-)automatically. Once, the text mining application has been set up by a systems engineer, the naive user lets the system extract information, aggregate it and discover those nuggets of knowledge that may actually help
him to solve a problem.
The objective just described will be put to work in a realistic environment. This means
we must consider:
1. Real-world texts: This means we must include semi-structured information such as
given in de facto web standards, like HTML and XML. This also implies that we
must try to extract information from layout structures and from more rigid formats,
such as tables appearing in natural language texts.
2. Integrating Information Extraction Techniques: We need a broad basis for information extraction in order to solve the information exploitation task. For this purpose,
we want to build on the experiences that have been made with TREC- and MUCstyle approaches as well as with machine learning techniques that have been applied
for wrapping semi-structured data. This requires strong competence in the fields of
Information Retrieval and Extraction, Computational Linguistics and Machine
3. Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a domain ontology that acts as a semantic mediator between different information extraction methods, that allows for knowledge discovery at different levels of granularity
and that allows for mappings between different terminologies. This domain ontology
may be coded manually up to a point, but ultimately it needs to be extended (semi)automatically – including proposals for ontology induction by knowledge discovery
4. Text mining as a semi-automatic process: We consider text mining a semi-automatic
process that is designed and set up with a particular application and particular topics
in mind.
The design involves the construction of a domain ontology and a domain lexicon, the
formulation and/or learning of interesting structures with computational linguistics
and/or information retrieval techniques and the exploration of the corresponding results. Once, the domain specific text mining application is set up the naive user may
run it to extract information and – in particular – to find associations and rules that
were not present in the original texts, but that could only be found by considering, integrating and comparing various text sources.
This approach parallels the development in data mining where the utopia of a fully
automatic knowledge discovery process has been given up in favor of a greatly successful process model that distinguishes between design and use.
5. Discovering knowledge: Finally, we actually need to apply machine learning techniques to aggregate and analyse extracted information yielding those „golden nuggets
of knowledge“ we are looking for in the first place.
Scenario: As an interesting case study we choose the mining of annual business reports
and analysts‘ reports available on the internet that comment on companies from a particular area (e.g., telecommunication). This scenario is very appropriate, because
1. It allows the observation of competitors and the detection of trends that are extremely
important for decision makers, such as trends in organizational structures or in markets and products.
2. The understanding of these texts is most valuable if information is compared against
each other – annual changes or variations between companies are most interesting.
This is exactly the type of implicit knowledge our methods are supposed to discover.
3. The setting is well enough observed and understood by professionals in order to verify the techniques we develop. Since AIFB is integrated in a business faculty and
since Cognit has some experience with this domain we may also provide a satisfying
amount of experience in this area.
2 Chances for Europe
Our proposal objective is aimed at making make the most knowledge out of the flood of
information you have in large collections of text. Numerous chances and possibilities
arise from this information exploitation approach, several major ones of which are:
1 Informed decisions supporting competitivity: In order to be competitive European
businesses must be able to recognize and react quickly to new challenges and trends
and make informed decisions. Large collections of texts implicitly contain this type
of knowledge, but it is not accessible with current tools and techniques. Information
exploitation from texts will provide extremely valuable means to provide such mission-critical knowledge from the internet and from the corporate intranet, thus keeping European businesses competitive and flexible.
2 Individual learning: The individual will have tools that allow him/her easy access
towards large collections of text and that give him/her the possibility to discover and
understand knowledge that is hidden in texts. In a connected society where everyone
works with the net such tools will become indispensible in order to bridge the widening gap between computer literates and those with less experience with computers.
3 Progress in Research: Though our scenario develops a particular business case, thinking of the migraine example, it is easy to understand that many research fields may
profit similarly from text information exploitation, just because research articles in
all science and arts are being put on the internet.
All these factors are critical to develop knowledge of Europeans and for Europeans. Informed decisions, individual learning and improved research all work together in keeping Europe competitive and in helping the computer user with overseeing the flood of
texts on the world wide web.
3 Contribution of Partners
We consider text mining as being a knowledge acquisition process that should be facilitated by learning approaches and by the techniques found in information retrieval and
computational linguistics. Hence, the consortium includes partners with different backgrounds:
Karlsruhe University, Institute for Applied Informatics, will focus on knowledge acquisition tools, techniques and methodologies.
The University of Edinburgh, Division of Informatics, will concentrate on computation linguistics tools and techniques for information extraction. Important for them is
University College Dublin, Department of Computer Science, will contribute competence in machine learning and integration of semi-structured data. .....
Cognit a.s has broad experience with information retrieval techniques that they also
combine with machine learning approaches....
The work packages will be split along the following lines:
Univ. Karlsruhe
IR methods on
XML texts
with Syntax, Semantics Learning from Extracting in(with Layout)
formation with
IR methods
Ontology modeling and induction
Text Mining Methodology
as Process
Univ. College Cognit
Syntax, Semantics
of XML text
Integrating IE Wrappers
Univ. Edinburgh
Ontology induction
Reuseable components?
Knowledge Discovery from IR
and CL data
from IR and
CL data
4 Persons' Profiles
Knowledge Acquisition (Karlsruhe University):
Prof. Dr. Rudi Studer has a chair for knowledge management at Karlsruhe University.
He has carried out research and organized numerous activities in the fields of knowledge
acquisition, knowledge management and data mining for over 20 years.
Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University, where he is
also project manager for Karlsruhe in the government-funded project GETESS, which
aims at an information extraction system for tourism. Dr. Staab got his M.S.E. from the
Univ. of Pennsylvania, USA, and his Dr. rer. nat. from Freiburg University, Germany.
His dissertation was dealing with semantic and ontological issues of text understanding.
A paper about knowledge representation and reasoning won him a best paper award at
the European Conference on AI ´98. His research interests and organizational activities
now include topics in knowledge management and knowledge discovery.
Computational Linguistics (University of Edinburgh):
Prof. Dr. Bonnie Webber...
Dr. Katja Markert....
Machine Learing (University College Dublin):
Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science,
University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the
University of Washington, and his dissertation was nominated for the ACM
Distinguished Dissertation award. Dr. Kushmerick has worked in the areas of planning,
machine learning, and information-extraction, -integration, and -retrieval. His worked
has been published in several international journals, and he has been on the organizing
committee of numerous conferences and workshops. Dr. Kushmerick’s current work
focuses on the use of machine learning to scale up knowledge engineering on the
Internet, in service of problems such as information extraction and designing intelligent
browsing assistants.
Information Retrieval and Extraction (Cognit a.s):
Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After finishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on
the application of artificial intelligence, rule-based and object-oriented programming in
project planning in 1988. After he has been affiliated with a variety of companies he cofounded and directs CognIT a.s. Author of more than 50 articles and published reports
on computer applications in engineering and industry, design and planning, objectoriented technology and artificial intelligence. Most recent publication is Braunschweig
and Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996.
Dr. Robert Engels: Studied Psychology and Computer Science at the university of Amsterdam, NL. He conducted his MSc thesis on applications of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD from the university of Karlsruhe for
research conducted in the area of Knowledge Discovery and Data Mining. He (co-) authored a variety of papers, and organised several international and national (German)
workshops on practical applications of Data Mining. Currently he is affiliated with CognIT as a senior systems architect.
5 Partner Adresses
Dr. Steffen Staab, Prof. Dr. Rudi Studer
Institute for Applied Informatics and Formal Description Methods (AIFB),
Karlsruhe University, D-76128 Karlsruhe, Germany
Dr. Katja Markert, Prof. Dr. Bonnie Webber
Division of Informatics, University of Edinburgh, 80 South Bridge
Edinburgh EH1 1HN, Scotland
Dr. Nicholas Kushmerick
Department of Computer Science, University College Dublin, Dublin 4, Ireland
Dr. Robert Engels, Dr. Bernt Bremdal
Cognit a.s, P.B. 610, N-1754 Halden, Norway
Research Issues:
Open research issues in this field are manifold, e.g.:
1. Extracting the semantics of layout with computational linguistics, aligning semistructured data with the corresponding ontology information
2. Aligning several information extraction techniques (TREC-style) with
Integrating techniques: ontology, machine learning, information retrieval and extraction, computational linguistics (learning with ontologies, inducing ontologies from
computational linguistics and information extraction techniques, aligning wrapper
induction with ontologies, applying information extraction measures to the syntactic
and semantic level,