Deep Knowledge Discovery from Natural ... Udo Hahn & Klemens Schnattinger

From: KDD-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved.
Deep Knowledge Discovery from Natural Language Texts
Udo Hahn & Klemens Schnattinger
m
ComputationalLinguistics Lab - Text KnowledgeEngineeringGroup
Freiburg University, Werthmannplatz,D-79085Freiburg,Germany
(hahn,schnattinger)@coling.uni-freiburg.de
http://www.coling.uni-freiburg.de
Abstract
Weintroducea knowledge-based
approach
to deep knowledge
discovery from real-world natural languagetexts. Data mining,
data interpretation, and data cleaning are all incorporatedin cycles of quality-based terminological reasoningprocesses..The
methodology we propose identifies new knowledge items and
assimilates them into a continuously updated domain knowledge base.
Introduction
The work reported in this paper is part of a large-scale
project aiming at the development of SYNDIKATE,
I,,,..,,&---,
1,^^._.,-,a,, zmsIuuIiaLI”,I
,-2,:I,+:,a Germaii-lmg;uagc:
CGAL
lul”wlc;ugc;
system
(Hahn, Schnattinger,& Romacker 1996). ‘Iwo real-world
applicationdomainsare currently underactiveinvestigation
- test reportson information technologyproducts(101 documents with lo5 words) and, the major application, medical reports (approximately 120,000 documentswith lo7
words). The task of the system is to aggressivelyassimilate any facts, propositions and evaluativeassertionsit can
gleanfrom the sourcetexts and feed them into a sharedtext
knowledgepool. This goal is actually evenmore ambitious
than the MUC task (for a survey, cf. (Grishman & Sundheim 1996))which requiresmappingnaturallanguagetexts
onto a highly selectiveand fixed set of knowledgetemplates
in order to extract factual knowledgeitems only.
Given that only a few of the relevant domain concepts
can be suppliedin the hand-codedinitial domainknowledge
base,a tremendousconceptlearningproblem arisesfor text
knowledge acquisition systems. Any other KDD system
running on natural languagetext input for the purpose of
knowledge extraction also faces the challengeof an open
set of knowledgetemplates;evenmore so when it is specifically targetedat R~XJknowledgeitems. In order to break the
high complexity barrier of a systemintegrating text understandingand conceptlearningunderrealistic conditions,we
supply a natural languageparser (Neuhaus& Hahn 1996)
that is inherently robust and has various strategiesto get
nearly optimal results out of deficient, i.e., underspecified
1-----.,-J-- sources
~.~~...~m
*~ rems
I- ..- 01
-Epamal,
~-.*-I nmma-aepm
I. ~.&.J3 ~&Ipars~~_...
mowleage
ing. The price we pay for this approach is underspecifica-
tion and uncertainty associatedwith the knowledgewe extract from texts. To cope with theseproblems,we build on
expressivelyrich knowledge representationmodels of the
‘Copyright 01997, American Association for Artificial Intelligence (www.aaai.org). All rights reserved.
underlying domain (Hahn, Klenner, & Schnattinger1996).
Accordingly, we provide a start-up core ontology (such as
the PenmanUpper Model (Bateman et al. 1990)) in the
format of terminologicalassertions.The task of the module
we describein this paperis then to position stewknowledge
items which occur in a text in that concepthierarchy and to
link them with valid conceptualroles, role fillers and role
filler constraints;hence,deep knowledgediscovery.
Concepthypothesesreflecting different conceptualreadings for new knowledge items emergeand are updatedon
the basis of two types of evidence. First, we consider the
type of Zinguistic constructionin which an unknown lexical
:+a- ,-.n,.,,.-”
;., 4,, +a%?+
nnn.v-nel..-.e
.L-.
,...A
1CU‘
“bbL&lJ
ll
111
Cl2.LIn:..T.a
\JIUbtz.-,a
we (WDUUIFi
L114l.
UK ,.....+?.-4
cI”IILcaLauu
type of granunatical construction forces a particular interpretation on the unknown item); second,conceptualcriteria
are accountedfor which reflect structural patterns of consistency,mutual justification, analogy, etc. of concept hypotheseswith conceptdescriptionsalready available in the
domain knowledgebase.Both kinds of evidenceare representedby a set of qua&y labels. The generalconceptlearning problem can then be viewed as a quality-based decision task which is decomposedinto three constituentparts:
the continuous generation of quality labels for single concept hypotheses(reflecting the reasons for their formation
and their significancein the light of other hypotheses),the
estimation of the overall credibility of single concept hypotheses(taking the availableset of quality labels for each
hypothesisinto account),and the computation of a.preference order for the entire set of competinghypotheses(based
on theseaccumulatedquality judgments) to select the most
plausibleones.Thesephasesdirectly correspondto the major steps underlying KDD procedures(Fayyad, PiatetskyShapiro,& Smyth 1996),viz. data mining, data interpretation, and data cleaning,respectively.
A Scenario for Deep Knowledge Discovery
In order to illustrate our problem, consider the following
knowledgediscovery scenario. Suppose,your knowledge
of the information technoloav
Aauar-----‘2, domain tells ,vou
- -- that
--~~~~
--1.---ius is a company. In addition, you incidentally know that
AS-168 is a computer system manufactured by Aquarius.
By convention, you know absolutely nothing about Megaline. Imagine, one day your favorite computer magazine featuresan article starting with “The Megaline unit by
Aquarius ..‘I. Has your knowledge increamo,
what
did you learn alreadyfrom just this phrase?
Hahn
175
cus here on the issuesof learning accuracyand the learning
rate. Due to the given learning environment, the measures
we apply deviate from those commonly usedin the machine
learning community. In concept learning algorithms like
IBL (Aha, Kibler, & Albert 1991) there is no hierarchy of
IT-.... any preamuon
3: -rl- 01
-2 Al-llcTT memoersmp
.---.--L-..ql-:.COiiCeptS.Hence,
cue class
of a new instance is either true or false. However, as such
hierarchies naturally emergein terminological frameworks,
a prediction can be more or less precise, i.e., it may approximate the goal concept at different levels of specificity.
This is capturedby our measureof learning accuracy which
takes into account the conceptual distance of a hypothesis
to the goal concept of an instance rather than simply relating the number of correct and false predictions, as in IBL.
In our approach, learning is achieved by the refinement
of multiple hypothesesabout the class membership of an
to stand for a COMPUTER,then hypothesizing HARDWARE
(correct, but too general) receives an LA = .75 (note that
OBJECTISA TOP), while hypothesizing PRINTER(incorrect, but still not entirely wrong) receives an LA = .6.
Fig. 2 depicts the learning accuracy curve for the entire data set. It starts at LA values in the interval between 48% to 54% for LA -, LA TH and LA CB in the
first learning step.
ALL
LA CB gives the
1.0 - IACB
accuracy rate for
the full qualifi,.__.:
_,..:
.....i .i...... ..... i ;,
cation calculus
0.7 .
including thresh- 5
old and credibility
OA_
criteria, LA TH
o.3.
considers linguis0.2-
instance.
t;c
Thnn.
^ __-I,
the
-__- m-easgre
of
jp~~??jqy
r&e
wp. nrnnnse
r--r---
is concernedwith the reduction of hypothesesas more and
more information becomes available about one particular
new instance. In contrast, IBL-style algorithms consider
only one concept hypothesis per learning cycle and their
notion of learning rate relates to the increaseof correct predictions as more and more instances are being processed.
We considered a total of 101 texts taken from a corpus
of information technology magazines. For each of them 5
to 15 learning steps were considered. A learning step is
operationalized by the representation structure that results
from the semantic interpretation of an utterancewhich contains the unknown lexical item. Since the unknown item is
usually referred to several times in a text, several learning
stepsresult. For instance, the learning stepsassociatedwith
our scenario are given by: MEGALINE INST-OFUNIT and
MEGALINE PRODUCT-OFAQUARIUS.
Learning Accuracy
In a first seriesof experiments,we investigatedthe learning
accuracy of the system, i.e., the degree to which the system correctly predicts the concept class which subsumes
the target concept under consideration. Learning accuracy
for a single lexical item (LA) is here defined as (n being
the number of concept hypothesesfor that item):
LAi
LA := x
-yiE{l...tl}
with LA
:=
I
SPi specifies the length of the shortest path (in terms of
the number of nodes traversed) from the TOP node of the
concept hierarchy to the maximally specific concept subsuming the instance to be learnedin hypothesisi; CPi specifies the length of the path from the TOP node to that concept node in “njipo’r”leSiS
,i which is COiiiniOiiboih for ihe
shortest path (as defined above) and the actual path to the
predicted concept (whether correct or not); FP,- specifies
the length of the path from the TOP node to the predicted (in
this casefalse) concept and DPi denotesthe node distance
between the predicted node and the most specific concept
correctly subsuming the target in hypothesis i, respectively.
As an example, consider Fig. 1. If we assumeMEGALINE
cg&pIa
nnlv.
----, 9
O.l -
while LA - de’1 2 3 4 5 6 Learningsteps
7 5 9 10 11 12 13 14 15
picts the accuracy
Figure 2: Learning Accuracy
values without
incorporating quality criteria at all, though terminological
reasoning is still employed. In the final step, LA rises up
to 79%, 83% and 87% for LA -, LA TH and LA CB,
respectively. Hence, the pure terminological reasoning
machinery which does not incorporate the qualification
calculus always achieves an inferior level of learning
accuracythan the learner equipped with it.
Learning Rate
The learning accuracy focuses on the predictive power
and validity of the learning procedure. By considering
the learning rate (LR), we supply data from the stepwise reduction of alternatives in the learning process. Fig.
3 depicts the mean number of transitively included concepts for all consideredhypothesis spacesper learning step
(each
concept
hypothesis relates
to a concept
which transitively
subsumes various
subconcepts;
hence, pruning of
concept subhierarchies reduces
the number of
concepts being
’ 1 2 3 4 5 Learningsteps
6 7 8 9 10 11 12 13 14 5
considered as hyFigure 3: Learning Rate
potheses). Note
that the most general concept hypothesis in our example
denotesOBJECT which currently includes 196 concepts. In
_^^^_^1 __.^
-1.,.^_.^1 ilI auuul;
,.b. ..-.-- ..,.,..*:..,
,l,,, “I.-AAL,,
,...-.^
gwGLq
WG““SaVGlJ
UG~slU”c;
s,upc;
LllciGUIVG
for the learning rate. After the first step, slightly less than
50% of the included concepts are pruned (with 93,94 and
97 remaining concepts for LR CB, LR TH and LR -,
respectively). Summarizing this evaluation experiment, the
system yields competitive accuracy rates (a mean of 87%),
while at the same time exhibiting significant and valid
reductions of the predicted concepts.
Hahn
177
Related Work
The issue of text analysis is only rarely dealt with in the
KDD community. The reasonfor this should be fairly obvious. Unlike pre-structureddatarepositories(e.g., schemata
and relations in databasesystems),data mining in textual
sourcesrequires to determine content-basedformal structures in text strings prior to putting KDD proceduresto
work. Similar to approachesin the field of information
retrieval (Dumais 1990), statistical methods of text structuralization are favoredin the KDD framework (Feldman&
Dagan 1995). While this leadsto the determinationof associative relations betweenlexical items, it doesnot allow the
identification and relation of particularfacts and assertions
about or evenevaluations to particular concepts.If this kind
of deep knowledgeis to be discovered,sophisticatednatural
languageprocessingmethodologiesmust come into play.
Our approachbears a close relationship to the work of,
e.g., (Rau, Jacobs, & Zernik 1989) and (Hastings 1996),
who aim at the automatedlearning of word meaningsfrom
the textual context using a knowledge-intensiveapproach.
But our work differs from theirs in that the need to cope
with several competing concept hypothesesand to aim at
a reason-based selection is not an issue in those studies.
In the SCISOR system (Rau, Jacobs,& Zernik 1989),e.g.,
the selectionof hypothesesdependsonly on an ongoingdiscrimination process based on the availability of linguistic
and conceptualclues, but does not incorporatea dedicated
inferencing schemefor reasonedhypothesisselection. The
differencein learningperformance- in the light of our evaluation study discussedin the previous section, at least omnnn+n
thra
ul‘
,“U,,LUtnC”QUA
” 10,r~.naL-l~rinn
U”IIUIUUIIA,E)
LllVrliff~vaanrw
UlllWl”ll”Y hdrrrsan
““L”“V”ll T.A
U‘) -(plain terminological reasoning)and LA CB values(terminological me&reasoningbased on the qualification calculus). Acquiring knowledge from real-world textual input
usually provides the learner with only sparse,highly fragmentary clues, such that multiple concept hypothesesare
likely to be derived from that input. So we stressthe need
for a hypothesis generation& evaluation component as
an integral part of large-scalereal-world text understanders
e.,,..,,:..,
:.. . . . ..Tl.
l,..,....l,A*,
,I:,,,.,,,
A,,:,.,”
“pcaar,Il~
111LaIIuwll..,:+l.
WlLIlnu”wls2”~G
uK%vv-ay
Uc2”IL,~D.
This requirement also distinguishes our approachfrom
the currently active field of information extraction (IE)
(e.g., (Appelt et al. 1993)). The IE task is definedin terms
of a$xed set of apriori templateswhich haveto be instantiated (i.e., filled with factual knowledgeitems) in the course
of text analysis. In contradistinction to our approach,no
new templateshave to be created.
Conclusion
We have introduced a new quality-basedknowledge discovery methodology the constituent parts of which can be
equatedwith the major steps underlying KDD procedures
(Fayyad, Piatetsky-Shapiro,& Smyth 1996) - the generation of quality labels relates to the data mining (pattern
extraction) phase, the estimation of the overall credibility
of a single concepthypothesisrefers to the data interpretation phase,while the selection of the most suitable concept
hypothesiscorrespondsto the data cleaning mode.
178
KDD-97
Our approachis quite knowledge-intensivesince knowledge discovery is fully integrated in the text understanding mode. No specialized learning algorithm is needed,
since concept formation is turned into an inferencing task
carried out by the classifier of a terminological reasoning
system. Quality labels can be chosenfrom any knowledge
“,-,l,e,,P
D”UIb,cI
l nt
lu4L
“nnm”
5~~1113
~r\...m..:s..t
b”IIY~IIIGlIL,
.I...,...“.a.:-Cl1113 cxJ”II11~
A_,...
c;clsy
-l-..d..lAl
Lu.qJL’I”ll-
ity. These labels also achieve a high degreeof pruning of
the searchspacefor hypothesesin very early phasesof the
learning cycle. This is of major importancefor considering
our approacha viable contribution to KDD methodology.
Acknowledgments.We wouldlike to thankour colleagues in the
CLIF groupfor fruitful discussionsand instantsupport,in particular Joe Bush who polishedthe text as a nativespeaker.K.
Schnattinger
is supported
by a grantfrom DFG (Ha 2097/3-l).
References
Aha,D.; Kibler,D.; andAlbert,M. 1991. Instance-based
learning
algorithms.Machine Learning 6:37-66.
Appelt,D.; Hobbs,J.; Bear,J.; Israel,D.; and‘I@on,M. 1993.
FASTUS: A finite-state processorfor information extraction from
real-worldtext. In IJCAZ’93 - Proc. 13th Intl. Joint ConJ(:on ArtiJicial Intelligence, 1172-l 178.
Bateman,J.; Kasper,R.; Moore,J.; andWhitney,R. 1990.A general organizationof knowledgefor naturallanguageprocessing:
The
A..” PFJNMAN
A -A . ..l. -11 ~mnermnrld
..yy.a I..“..“..
Terhnirnl
L”“...YVYI
‘rmw-wt
y.z”.C,
TVWlTCT
““WI”1.
Dumais,S. 1990. Text informationretrieval. In Helander,M.,
ed.,Handbook of Human-Computer Interaction, 673-700. Amsterdam:North-Holland.
Fayyad,U.; Piatetsky-Shapiro,
G.; andSmyth,I? 1996. From
data mining to knowledgediscoveryin databases.AI Magazine
17(3):37-54.
Feldman,R., andDagan,
-T--f--I. 1995. Knowledge discovery in textual
databases(WTj. in KUU 93 - Proc. ist inti. Con& on Knowiedge
Discovery and Data Mininn.
Grishman, R., and Sundhim, B. 1996. Message understanding
conference- 6: A brief historv. In COLZNG’96 - Proc. 36th Ind
ConJ: on Computational Linguistics, 466-471.
Hahn, U., and Schnattinger,K. 1997. A qualitative growth model
for real-world text knowledge bases.In RZAO’97- Proc. 5th Con&
on Computer-Assisted Information Searching on the Internet.
Hahn,U.; Klenner,M.; andSchnattinger,
K. 1996.Learningfrom
texts: A terminologicalmetareasoning
perspective.In Wermter,
9. Rilnff
FY.)
. UaaU
nnrt YYUVI”‘
Crhdnr
f-2
P*“t:nr:rnl
,*,I U
Y.,
I.1IV11,
.A.,
, .arln
UYY.,P,mnnrt;,m:rt
~“rrr‘rv,r*“r“o‘“‘
, UI‘.t‘II.U‘
U,‘
Symbolic Approaches to Learningfor
ing, 453-468. Berlin: Springer.
Natural Language Process-
Hahn,U.; Schnattinger,~K.;
and Romacker,M. 1996.Automatic
knowledgeaccluisitionfrom medicaltexts. In AMZA’96 - Proc.
AMZA Aim&Fall
Symposium. Beyond the Superhighway:
ploiting the Internet with Medical Informatics, 383-387.
I%-
Hastings, F! 1996. Implications of an automatic lexical acquiAti-- ^_._A-- 111
,- .l,--L-n n:l-m 0.;
n dnu
..-l acneler,
n-*.-r-.. u.,
m ea.,
-3.
SIUOII system.
we111w21,
3.; KIIUII,
Connectionist, Statistical and Symbolic Approaches to Learning
in Natural Language Processing, 261-274.Berlin: Stringer.
Neuhaus, P., aid I&hn, U. 1996. Trading off com$ete.ess for
efficiency: The PARSETALKnerformance grammar aDDroachto
real-worid text parsing. In FtiZRS’96 - P&z. 9th Fbzda Art@cial Intelligence Research Symposium, 60-65.
Rau, L.; Jacobs, P.; and Zemik, U. 1989. Information extraction and text summarization using linguistic knowledge acquisition. Information ProcessinR & Management 2X4):419-428.
Schnatinger, K., and Hahn,%. 1996.2 termino&ical oualifica-
tion calculusfor ureferential
reasoning
underunce&intv.?nKZ’96
- Proc. 20th An&al
German Con$ oiArh&ial
362. Berlin: Springer.
Intellig&ce,
349-