Applying Link Analysis on Automatically Extracted Information from... Within KNOW-IT 1 (.KNOWledge

From: AAAI Technical Report FS-98-01. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved.
Applying Link Analysis on Automatically Extracted Information from Texts
1 (.KNOWledgeBase _Information Tools)
Within KNOW-IT
Woojin Paik, Elizabeth Liddy, Eileen Allen, Eric Brown, Andrew Farris,
Robert Irwin, Jennifer Liddy, and Ian Niles
TextWiseInc.
2-212Center for Scienceand Technology
Syracuse, NY13244
{woojin,liz, eileen, eric, drew,rjirwin, jennifer, ihniles}@textwise.com
Introduction
A robust information extraction system, which can
accurately and rapidly convert vast amountsof textual
data into a semantic networktype knowledge
representation scheme,will be a useful preprocessor for
various link analysis applications. In this paper, we
describe an end-to-end knowledgediscovery system
which takes raw texts as input and generates a knowledge
base whichcan be utilized for a variety ofinforrnation
access systems, such as a visual knowledgebase browser
and a question-answering system.
mentionjust a few examples. Anotherdistinctive aspect
of the extraction componentof KNOW-IT
is that it
constitutes a unique compromisebetween deep, domaindependent information extraction and shallow, domainindependent extraction. KNOW-IT’s
pattern-matching
heuristics and sense-disambiguation moduleprovide a
representation of texts of any subject area. However,
political content (particularly that of international
significance) is provided with an enriched representation
by meansof handcrafted templates for verbs and nouns
relating, respectively, to political events and actors.
Wewish to explore the possibility of using our
visual knowledgebase browser as a link analyzer, and
showthat it is possible to apply link analysis techniques
to vast amountsof textual data through an automatic
information extraction system.
Oncethe extractions are available, they are loaded
into a database wherethey can be accessed by the
browser. The key innovation of the browser is that it
seamlessly incorporates the organizational principles of
both an ontology and a flat semantic network. One
windowof the browser allows the user to navigate
through the hierarchy of meaningrelationships between
concepts, while a second windowpermits one to expand
any node of this hierarchy into a flat networkrepresenting
all of the fact-based relationships involving this node.
Thus, the metaphorbehind the browseris to structure
information in the form of a three-dimensional cone,
whosedepth corresponds to the semantic relationships
between concepts and whosebreadth represents the
factual information extracted from text. Since this
browseris a graphically navigable, automatically
constructed index of all of the free-grained content
expressedby input text, it will enable the analyst in a very
short period of time to target precisely that information
whichis relevant to a critical decision.
First,. we will explain the KNOW-IT
system. Then,
underlying natural language processing techniques which
comprisethe information extraction systemwill be
described. Finally, browsingwill be discussed in the
context of information retrieval, and the visual knowledge
base browserwill be described.
KNOW-IT Overview
KNOW-IT
is an integrated suite of robust tools which
transform vast, unmanageablebodies of raw textual data
into accessible and surveyableintelligence. The initial
componentof the system is an information extraction
module which maps open source documents into
structured, unmnbiguous
representations of their content.
Unlike most other information extraction systems, the
input is not restricted to newswiretext - it will encompass
everything fi’om State Departmenttravel advisories, to
online battlespace manuals, to WWW
homepages, to
Aside from being a stand-alone system, KNOW-IT
can also be used to enhancethe performanceof other
knowledge-based systems. Since the KNOW-IT
extractions are sets of triples whichcan be easily mapped
1 This research and developmentproject has been supported in part by RomeLaboratory, USAFunder contract F30602-96C-0164 and DARPA
under the High Performance KnowledgeBase (HPKB)project.
78
to both frame-based and logic-based knowledge
representations, these extractions can be used by any
knowledge-basedsystem to complementits own
specialized base of common-senseor domainknowledge.
Oncethe extractions have been loaded into a database,
they can be called by a knowiedge-basedsystem
wheneverit requires a repository of general, factual
information to answer a query, perform an inference, or
locate instances of a conceptor relation.
then locating and recording the associated linguistic
constructions, it is possible to construct complex,in-depth
histographies of events and their specific relation to a
given proper namethat can span several decades.
In addition, our approachutilizes more general
semantic relations to link concepts whichoccur in texts
in contrast to the relations which are used in MUC.
For
example, MUCused semantic relations such as ’weapons
used’ or ’victim’ in the ’terrorism in South America’
domain. However,we limited the semantic relations to
fairly generic onessuch as affiliation, agent, duration,
location, or point-in-time. Thesedifferences allowed
simplification of the developedinformation extraction
algorithm and the application of the information
extraction system to domain-independentdata.
Information Extraction
In recent years, there has been increasedinterest in textual
information extraction research using natural language
processing techniques. The most commonmediumof
storing knowledgeis texts; textual informationextraction
is an approach to acquire knowledgefrom texts.
The application of natural language processing
techniques for knowledge discovery in the KNOW-IT
system can be summarizedas conversion from 1) ’text’ to
the ’extraction’ output; 2) ’extraction’ output to ’semantic
representation’; and 3) ’semanticrepresentation’ to
’knowledgeproduct’.
Manynoticeable research efforts have been in
Message Understanding Conferences (MUC).The goal
of MUC
(Chincor, 1992) was to automatically extract
information from newstexts to populate databases.
Participants of MUC
were given the task of extracting
information about clearly defined event types (or
domains) such as "terrorism in South America." For each
event type, the MUC
participants were given predeterminedcategories of information that their systems
were required to extract.
The Figure 1 shows input text to the KNOW-IT
information extraction system. The pseudo-Standard
Generalized MarkupLanguage (SGML)tags, which are
enclosed between angle brackets, such as <HL>signals
the beginning of the headline and <SO,DATE>
shows
that the source and date information of the documentwill
follow the code.
The approaches which were employed in MUC
dependon the careful analysis of commonterminologies
which are used in each event type. Thus, every
participating systemhas to be reworkedeither to capture
the typical roles of the exhaustivelist of entities which
have potential to occur in the designatedevent or to
identify all possible verbs whichcan be used to describe
the event and the associated roles of the syntactic
arguments of the verbs. These processes can take long
periods of time, varying from a few weeksto several
months.
<HL>
Annan to stay in Baghdad Friday-Sunday
<SO, DATE>AFX-Europe, 2/18/98
During a meeting in Tehran with Iraqi Foreign Minister
Mohammed
Said al-Sahhaf, Kharazi called for
cooperation with the UNand denouncedthe foreign
military presence in the Gulf.
Iran, which opposes the growingUSmilitary might in the
Gulf, says that a U.S. attack on Iraq wouldserve Israel’s
interests.
Figure 1. Text
Mostparticipating systems were successful in
extracting relevant information; however,given that there
are almost infinite numberof event types or subject
domains,it does not seem feasible to build a domainindependenttextual information extraction system by
following MUC’sone domain at a time approach.
The documentprocessing module, which consists
of various natural language processing programs,
identifies various fields, clauses, parts-of-speech (POS),
and punctuation in a text, and annotates a documentwith
identifying tags for these units. Theidentification process
occurs at the sentence, paragraph, and discourse levels
and is a fundamentalprecursor to the concept-relationconcept (CRC)extraction step. The extracted CRCsform
semantic networks which can be used for link analysis or
other knowledgediscovery processes.
Thus, we have taken an alternative approachto
extract domainindependent, time-stamped information
from various types of texts including newstexts by
extracting information about namedentities without
recognizing domainevents (Paik, 1994.) This particular
approach takes advantages of the commonpractice among
writers of including predictable information-rich
linguistic constructions in close proximityto related
proper names. By recognizing proper namesin a texts,
79
<PN>
0
30
31
7
1
1
7
2
30
31
7
3
50
4
6
5
7
7
6
category 30 stands for the nameof a person, 31 stands for
a title, 7 stands for anameof a country, and 1 stands for a
nameof a city. Thus, ’Tehran’ is categorized as a name
of a city (1) whichis located in Iran whichis a country
(7). Multi-part namessuch as ’Iraqi Foreign Minister
Mohammed
Said al-Sahhaf’ are identified as one unit and
categorized by each part.
Kamal Kharazi
Foreign Minister
Iran
Tehran
Iran
MohammedSaid al-Sahhaf
Foreign Minister
Iran
United Nations
Persian Gulf
Iran
United States
Finally, the KNOW-IT
document processor
delineates the discourse-level organization of a
document’s content by assigning discourse component
tags, such as ’<FA>’whichstands for factual information
and ’<AN/FU>’
which represents for analysis and future,
to each clause in the text.
<FA> DuringliN alDT meetinglNNin[IN TehranlNPI1
with
Iraqi Foreign Minister_Mohammed_Said_alSahhaflNPI2,I, KharazilNPI0calledlVBD
for[IN cooperationlNN withlIN the[DT UN[NP[3and[CC
denounced]VBDthelDT
<cn>foreign[JJ military[JJ presence]NN</cn> in]IN
thelDT GUlI]NP[4 </FA>
Basedon the automatically annotated syntactic,
semantic, and discourse level information about the text
(shownin Figure 2 in conjunction with the output from
semantic parsing), CRCtriples which form the semantic
representation of the documentcontent are extracted from
the text. CRCtriples consist of semantic relations, which
are widely used in ConceptualGraphrelated studies
(Sowa, 1984), and concepts, which are linked by the
semantic relations.
<FA> IranlNP[5 ,I, whichlWDT
opposeslVBZthelDT
growinglVt3G USINPI6
<cn>military[NN mightJNN</en> inlIN thelDT
Gult]NP[4,[, says[VBZ</FA>
<AN/FU>
thatJWDT a[DT <en> U.S.JNP[6 attackJNN
</en> on[iN Iraq[NP]7 wouldJMD
servelVB<cn>Israel[NP[8 ’s[POSinterests[NNS </cn> .].
</AN/FU>
Figure 2. ’extraction’ output
For example, if’A opposes B’ then ’A’ is the agent
of ’opposing’ and ’B’ is the object of ’opposing’. This
semantic information is represented as the following:
AGENT(oppose, A)
OBJECT(oppose, B)
During a meeting in Tehran with Iraqi Foreign Minister
Mohammed
Said al-Sahhaf, Kharazi called for
cooperation with the UNand denouncedthe foreign
military presencein the Gulf.
Figure 2 showsthe annotated result from the POS
tagger which assigns one of 48 possible grammatical
forms such as preposition (IN), determiner (DT),
singular noun (NN)to words. Each word in the text
followed by a ’[’ symboland a POStag.
AGENT
(call for; Kamal
Kharazi[person)
OBJECT
(call fol; cooperatiolO
ASSOCIATION
(cooperation,UnitedNations[organization)
Certain phrases such as ’foreign military presence’
or ’military might’ are enclosed within the ’<cn>’ and
’</cn>’ tags, whichrepresents the boundariesof a type of
noun phrase called complexnominals.
Iran, which opposes the growingUSmilitary might in the
Gulf, says that a U.S. attack on Iraq wouldserve Israel’s
interests.
AGENT
(oppose,h’anIcounoy)
OBJECT
(oppose,militaly migho
AFFILIATION
(militaO, might,UnitedStates[counOy)
LOCATION
(militaly
might,PersianGuljqregiol
0
Proper nmnes(PN) including multi-part names are
identified and categorized according to a previously
defined PNclassification scheme(Paik, et al, 1996.) ’NP’
and ’NPS’ are the POStags for proper names and each
tag is followed by a numericid. For example,’Tehran’ is
followedby singular PNPOStag, ’NP’, and the id, ’ 1 ’.
This id ’ 1’ is indication that the informationabout the PN
identified as 1 is shownin record number1 in the PN
table at the beginning of the annotated document.The
SGML
tag, ’<PN>’signals the beginning of the PNtable.
The first columnof the table is the record number.The
second columnis the PNcategory id. For example, PN
Figure 3. Semantic Representation
Figure 3 showspartial output of the CRCextraction
and the resulting semanticrepresentation of the text.
Currently, the accuracy of CRCextraction, based
on testing data whichare not used for training, for texts
80
such as reports, briefings, and area studies, is 90%and the
coverageof the extraction is 75%.Accuracyis defined as
a numberof correctly extracted CRCsdivided by the
ntunber of all extracted CRCs.Coverageis defined as the
numberof correctly extracted CRCsdivided by the
numberof all correct CRCsin the testing data.
at a time. Browsingcan be viewedas an alternative
approachto the traditional informationretrieval tasks.
According to Cove &Walsh (1987), browsing can
be divided into three categories:
¯
Browsingis described as a type of searching where
the initial criterion is only partially defined or known,and
is characterized by the presenceof a search goal but not a
strategy.
¯
¯
Finally, Figure 4 showsa schematic view of a
semantic networkwhichis a collection of extracted
CRCs.
Originally we developedthe visualization
capability as a knowledgebase inspector/editor for the
KNOW-IT
system. The knowledge base inspector/editor
was designed to be used by domainexperts to inspect or
verify the integrity or the correcmessof an automatically
constructed knowledgebase, as well as to edit the
knowledgebase by adding or deleting entries.
Subsequently, we have determined that KNOW-IT’s
knowledgebase editor/inspector functions well as a
semanticnetworkvisualizer, and is able to support all
three browsingactivities described above. Thus, it is
useful as a browserfor informationretrieval tasks.
Persian
Gulf
milita~,
might IX, ac "1
" growing
I \t.e,
I
"St
Figure 5 is a snap shot of the browser.This
snapshot utihzes a knowledgebase concerning Iran. The
knowledgebase was constructed by extracting
information from documentswhich contain the country
name, Iran, from NewYork Times documents published
between July 1994 and December1996. There were
1,458 documentsin this data set.
United
States
A windowin the browser, labeled as the Hyperbolic
Browser,located on the upperleft side of Figure 5, is the
ontology navigation windowdescribed in the KNOW-IT
Overviewsection. This exampleshows the instances
from the Religion portion of the ontology which occurred
in the Iran knowledgebase. To explore these instances
further, users can select one concept/nodefrom the
window.Anyconcept/node in the left windowcan be
selected in turn to further explore or drill downand
retrieve relevant information about a specific concept.
attack I
Israel’s
interest
[/nefacto
Figure 4. KnowledgeProduct
Browsing
of Semantic
Searchbrowsing,whichis directed, specific, or goaloriented;
Generalpurposebrowsing, which is semi-directed
predictive, or purposive; and
Serendipity browsing, which is undirected and not
goal oriented
The windowwith the label, ConceptVisualzer, on
the right side of the screen, is the fiat semanticnetwork
representing all of the fact-based relationships basedon
the selected concept/node from the ontology window.
Currently, the windowshowsthe state whena user has
selected ’Muslim’from the left windowand selected
’Party of God’ successively from the initial right window
view about ’Muslim.’ The Concept Visualizer window
displays related informationabout a guerilla group called,
’Party of God’,includingthe facts that Iran supports the
group, and the group is a radical anti-Israel movement.
Networks
Traditional task-oriented, informationretrieval systems
have query-based, command-driven
user interfaces. Users
are required to express and submit their information needs
in explicitly formedBooleanor natural languagequeries
to the information retrieval systems. The systems then
bring back a list of documentswhichare possibly relevant
to users’ queries. Theusers usually go throughthe list to
find whatis relevant by reading documentsin the list one
81
[] DEATH
TOLLREACHES
00,; SOME
SEEIRANIAN
INFunt~to~i~
Ne-’~YOr~
TiiT~e$ 07e~5/~4
2g
Theconcepts displayed currer~tty were found in the~fo/Iowing sentences:
On Monday~newspaperher~.Clarin, pubHshed~e~cerpts
t-it sa
BUENOS
AIRES, Argentina - As the death toll reached 80 on Venezue~nintelligence
Monday,Argentines asserted that lran wielded a secret hand J Caracasand undergroun~ce|~sin South America of thep~-~ of God, the
truck bombthat demolished the main community
fundamen~list guerrilla group supported by iron__
Argentine Jewsone weekago.
"Suspicions fall very heavily on tz~n, ° RubenEzra Beraja, a ~
Close
Jewish group leader, said on Mondayafter meeting with the
Argentine president, Carlos Saul Menern.Beraja is president of
Delegation of Argentine Jewish Associations, a
groups that was housedin the building destroyed 8~
"
In addition to the confirmed dead, 24 people are listed as
missing and 52 remain in hospitals.
OnHonday, the Argentine judge investigating
the bombing. Juan
Close
J Java,~plet Window
Figure 5 Visual KnowledgeBase Browser
Conclusion
The bottom left windowshowsthe full text view of
the documentwhichis the source of the extracted
information and the bottom right windowshowsa specific
sentence from which the visualized information was
extracted.
Wehave discussed automatic information extraction
system, whichis a part of the larger knowledgediscovery
system, and the browser, whichvisualizes the semantic
networkrepresentation of the extracted information.
Recently, we have constructed a knowledgebase
from one year’s worth ofa newswire which consists of
410,000 documents. Fromthis documentset, 4.2 million
CRCswere extracted and formed a knowledgebase.
There were about 1.2 million unique concepts in the
extracted CRCs.
Webelieve that the browsingwhich begins with the
selection of a concept in the ontology windowsupports
the previously mentioned General browsing and
Serendipity browsingtypes. In addition, the browser has
the functionality whichallows users to type in a concept
they wish to display, and this feature supports Search
browsing.
82
Paik, W., Liddy, E.D., Yu, E. & McKenna,M.
"Categorizing and Standardizing Proper Nounsfor
Efficient Information Retrieval," CorpusProcessing for
Lexical Acquisition, Boguraev, B. &Pustejovsky, J.
(eds.), Cambridge,MA:MITPress, pp. 62-73, 1996.
Webelieve our domain-independent information
extraction systemcan be used to extend the potential of
link analysis to informationwhichresides in a vast
numberof textual documents.In addition, our unique
combination of ontology and semantic network based
visual browsingstrategy mightserve as an interesting
methodto conduct link analysis.
Paik, W. "Chronological Information Extraction System
(CIES)," Dagstuhl-Seminar-Report: 79 on Summarizing
Text for Intelligent Communication,EndresNiggermeyer,B., Hobbs, J. & Jones, K.S. (eds.), Wadern,
Germany:IBFI, 1994.
References
Chincor, N. "MUC-4Evaluation Metrics," Proceedings
of the Fourth Message Understanding Conference (MUC4), McLean,VAJune 16-18, 1992.
Sowa,J., Conceptual Structures: Information Processing
in Mind and Machine, Reading, MA:Addison-Wesley,
1984
Cove, J.F. & Walsh, B.C. "Browsing as a Meansof
Online Text Retrieval," Information Services and Use,
1987.
83