From: AAAI Technical Report FS-98-01. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. Applying Link Analysis on Automatically Extracted Information from Texts 1 (.KNOWledgeBase _Information Tools) Within KNOW-IT Woojin Paik, Elizabeth Liddy, Eileen Allen, Eric Brown, Andrew Farris, Robert Irwin, Jennifer Liddy, and Ian Niles TextWiseInc. 2-212Center for Scienceand Technology Syracuse, NY13244 {woojin,liz, eileen, eric, drew,rjirwin, jennifer, ihniles}@textwise.com Introduction A robust information extraction system, which can accurately and rapidly convert vast amountsof textual data into a semantic networktype knowledge representation scheme,will be a useful preprocessor for various link analysis applications. In this paper, we describe an end-to-end knowledgediscovery system which takes raw texts as input and generates a knowledge base whichcan be utilized for a variety ofinforrnation access systems, such as a visual knowledgebase browser and a question-answering system. mentionjust a few examples. Anotherdistinctive aspect of the extraction componentof KNOW-IT is that it constitutes a unique compromisebetween deep, domaindependent information extraction and shallow, domainindependent extraction. KNOW-IT’s pattern-matching heuristics and sense-disambiguation moduleprovide a representation of texts of any subject area. However, political content (particularly that of international significance) is provided with an enriched representation by meansof handcrafted templates for verbs and nouns relating, respectively, to political events and actors. Wewish to explore the possibility of using our visual knowledgebase browser as a link analyzer, and showthat it is possible to apply link analysis techniques to vast amountsof textual data through an automatic information extraction system. Oncethe extractions are available, they are loaded into a database wherethey can be accessed by the browser. The key innovation of the browser is that it seamlessly incorporates the organizational principles of both an ontology and a flat semantic network. One windowof the browser allows the user to navigate through the hierarchy of meaningrelationships between concepts, while a second windowpermits one to expand any node of this hierarchy into a flat networkrepresenting all of the fact-based relationships involving this node. Thus, the metaphorbehind the browseris to structure information in the form of a three-dimensional cone, whosedepth corresponds to the semantic relationships between concepts and whosebreadth represents the factual information extracted from text. Since this browseris a graphically navigable, automatically constructed index of all of the free-grained content expressedby input text, it will enable the analyst in a very short period of time to target precisely that information whichis relevant to a critical decision. First,. we will explain the KNOW-IT system. Then, underlying natural language processing techniques which comprisethe information extraction systemwill be described. Finally, browsingwill be discussed in the context of information retrieval, and the visual knowledge base browserwill be described. KNOW-IT Overview KNOW-IT is an integrated suite of robust tools which transform vast, unmanageablebodies of raw textual data into accessible and surveyableintelligence. The initial componentof the system is an information extraction module which maps open source documents into structured, unmnbiguous representations of their content. Unlike most other information extraction systems, the input is not restricted to newswiretext - it will encompass everything fi’om State Departmenttravel advisories, to online battlespace manuals, to WWW homepages, to Aside from being a stand-alone system, KNOW-IT can also be used to enhancethe performanceof other knowledge-based systems. Since the KNOW-IT extractions are sets of triples whichcan be easily mapped 1 This research and developmentproject has been supported in part by RomeLaboratory, USAFunder contract F30602-96C-0164 and DARPA under the High Performance KnowledgeBase (HPKB)project. 78 to both frame-based and logic-based knowledge representations, these extractions can be used by any knowledge-basedsystem to complementits own specialized base of common-senseor domainknowledge. Oncethe extractions have been loaded into a database, they can be called by a knowiedge-basedsystem wheneverit requires a repository of general, factual information to answer a query, perform an inference, or locate instances of a conceptor relation. then locating and recording the associated linguistic constructions, it is possible to construct complex,in-depth histographies of events and their specific relation to a given proper namethat can span several decades. In addition, our approachutilizes more general semantic relations to link concepts whichoccur in texts in contrast to the relations which are used in MUC. For example, MUCused semantic relations such as ’weapons used’ or ’victim’ in the ’terrorism in South America’ domain. However,we limited the semantic relations to fairly generic onessuch as affiliation, agent, duration, location, or point-in-time. Thesedifferences allowed simplification of the developedinformation extraction algorithm and the application of the information extraction system to domain-independentdata. Information Extraction In recent years, there has been increasedinterest in textual information extraction research using natural language processing techniques. The most commonmediumof storing knowledgeis texts; textual informationextraction is an approach to acquire knowledgefrom texts. The application of natural language processing techniques for knowledge discovery in the KNOW-IT system can be summarizedas conversion from 1) ’text’ to the ’extraction’ output; 2) ’extraction’ output to ’semantic representation’; and 3) ’semanticrepresentation’ to ’knowledgeproduct’. Manynoticeable research efforts have been in Message Understanding Conferences (MUC).The goal of MUC (Chincor, 1992) was to automatically extract information from newstexts to populate databases. Participants of MUC were given the task of extracting information about clearly defined event types (or domains) such as "terrorism in South America." For each event type, the MUC participants were given predeterminedcategories of information that their systems were required to extract. The Figure 1 shows input text to the KNOW-IT information extraction system. The pseudo-Standard Generalized MarkupLanguage (SGML)tags, which are enclosed between angle brackets, such as <HL>signals the beginning of the headline and <SO,DATE> shows that the source and date information of the documentwill follow the code. The approaches which were employed in MUC dependon the careful analysis of commonterminologies which are used in each event type. Thus, every participating systemhas to be reworkedeither to capture the typical roles of the exhaustivelist of entities which have potential to occur in the designatedevent or to identify all possible verbs whichcan be used to describe the event and the associated roles of the syntactic arguments of the verbs. These processes can take long periods of time, varying from a few weeksto several months. <HL> Annan to stay in Baghdad Friday-Sunday <SO, DATE>AFX-Europe, 2/18/98 During a meeting in Tehran with Iraqi Foreign Minister Mohammed Said al-Sahhaf, Kharazi called for cooperation with the UNand denouncedthe foreign military presence in the Gulf. Iran, which opposes the growingUSmilitary might in the Gulf, says that a U.S. attack on Iraq wouldserve Israel’s interests. Figure 1. Text Mostparticipating systems were successful in extracting relevant information; however,given that there are almost infinite numberof event types or subject domains,it does not seem feasible to build a domainindependenttextual information extraction system by following MUC’sone domain at a time approach. The documentprocessing module, which consists of various natural language processing programs, identifies various fields, clauses, parts-of-speech (POS), and punctuation in a text, and annotates a documentwith identifying tags for these units. Theidentification process occurs at the sentence, paragraph, and discourse levels and is a fundamentalprecursor to the concept-relationconcept (CRC)extraction step. The extracted CRCsform semantic networks which can be used for link analysis or other knowledgediscovery processes. Thus, we have taken an alternative approachto extract domainindependent, time-stamped information from various types of texts including newstexts by extracting information about namedentities without recognizing domainevents (Paik, 1994.) This particular approach takes advantages of the commonpractice among writers of including predictable information-rich linguistic constructions in close proximityto related proper names. By recognizing proper namesin a texts, 79 <PN> 0 30 31 7 1 1 7 2 30 31 7 3 50 4 6 5 7 7 6 category 30 stands for the nameof a person, 31 stands for a title, 7 stands for anameof a country, and 1 stands for a nameof a city. Thus, ’Tehran’ is categorized as a name of a city (1) whichis located in Iran whichis a country (7). Multi-part namessuch as ’Iraqi Foreign Minister Mohammed Said al-Sahhaf’ are identified as one unit and categorized by each part. Kamal Kharazi Foreign Minister Iran Tehran Iran MohammedSaid al-Sahhaf Foreign Minister Iran United Nations Persian Gulf Iran United States Finally, the KNOW-IT document processor delineates the discourse-level organization of a document’s content by assigning discourse component tags, such as ’<FA>’whichstands for factual information and ’<AN/FU>’ which represents for analysis and future, to each clause in the text. <FA> DuringliN alDT meetinglNNin[IN TehranlNPI1 with Iraqi Foreign Minister_Mohammed_Said_alSahhaflNPI2,I, KharazilNPI0calledlVBD for[IN cooperationlNN withlIN the[DT UN[NP[3and[CC denounced]VBDthelDT <cn>foreign[JJ military[JJ presence]NN</cn> in]IN thelDT GUlI]NP[4 </FA> Basedon the automatically annotated syntactic, semantic, and discourse level information about the text (shownin Figure 2 in conjunction with the output from semantic parsing), CRCtriples which form the semantic representation of the documentcontent are extracted from the text. CRCtriples consist of semantic relations, which are widely used in ConceptualGraphrelated studies (Sowa, 1984), and concepts, which are linked by the semantic relations. <FA> IranlNP[5 ,I, whichlWDT opposeslVBZthelDT growinglVt3G USINPI6 <cn>military[NN mightJNN</en> inlIN thelDT Gult]NP[4,[, says[VBZ</FA> <AN/FU> thatJWDT a[DT <en> U.S.JNP[6 attackJNN </en> on[iN Iraq[NP]7 wouldJMD servelVB<cn>Israel[NP[8 ’s[POSinterests[NNS </cn> .]. </AN/FU> Figure 2. ’extraction’ output For example, if’A opposes B’ then ’A’ is the agent of ’opposing’ and ’B’ is the object of ’opposing’. This semantic information is represented as the following: AGENT(oppose, A) OBJECT(oppose, B) During a meeting in Tehran with Iraqi Foreign Minister Mohammed Said al-Sahhaf, Kharazi called for cooperation with the UNand denouncedthe foreign military presencein the Gulf. Figure 2 showsthe annotated result from the POS tagger which assigns one of 48 possible grammatical forms such as preposition (IN), determiner (DT), singular noun (NN)to words. Each word in the text followed by a ’[’ symboland a POStag. AGENT (call for; Kamal Kharazi[person) OBJECT (call fol; cooperatiolO ASSOCIATION (cooperation,UnitedNations[organization) Certain phrases such as ’foreign military presence’ or ’military might’ are enclosed within the ’<cn>’ and ’</cn>’ tags, whichrepresents the boundariesof a type of noun phrase called complexnominals. Iran, which opposes the growingUSmilitary might in the Gulf, says that a U.S. attack on Iraq wouldserve Israel’s interests. AGENT (oppose,h’anIcounoy) OBJECT (oppose,militaly migho AFFILIATION (militaO, might,UnitedStates[counOy) LOCATION (militaly might,PersianGuljqregiol 0 Proper nmnes(PN) including multi-part names are identified and categorized according to a previously defined PNclassification scheme(Paik, et al, 1996.) ’NP’ and ’NPS’ are the POStags for proper names and each tag is followed by a numericid. For example,’Tehran’ is followedby singular PNPOStag, ’NP’, and the id, ’ 1 ’. This id ’ 1’ is indication that the informationabout the PN identified as 1 is shownin record number1 in the PN table at the beginning of the annotated document.The SGML tag, ’<PN>’signals the beginning of the PNtable. The first columnof the table is the record number.The second columnis the PNcategory id. For example, PN Figure 3. Semantic Representation Figure 3 showspartial output of the CRCextraction and the resulting semanticrepresentation of the text. Currently, the accuracy of CRCextraction, based on testing data whichare not used for training, for texts 80 such as reports, briefings, and area studies, is 90%and the coverageof the extraction is 75%.Accuracyis defined as a numberof correctly extracted CRCsdivided by the ntunber of all extracted CRCs.Coverageis defined as the numberof correctly extracted CRCsdivided by the numberof all correct CRCsin the testing data. at a time. Browsingcan be viewedas an alternative approachto the traditional informationretrieval tasks. According to Cove &Walsh (1987), browsing can be divided into three categories: ¯ Browsingis described as a type of searching where the initial criterion is only partially defined or known,and is characterized by the presenceof a search goal but not a strategy. ¯ ¯ Finally, Figure 4 showsa schematic view of a semantic networkwhichis a collection of extracted CRCs. Originally we developedthe visualization capability as a knowledgebase inspector/editor for the KNOW-IT system. The knowledge base inspector/editor was designed to be used by domainexperts to inspect or verify the integrity or the correcmessof an automatically constructed knowledgebase, as well as to edit the knowledgebase by adding or deleting entries. Subsequently, we have determined that KNOW-IT’s knowledgebase editor/inspector functions well as a semanticnetworkvisualizer, and is able to support all three browsingactivities described above. Thus, it is useful as a browserfor informationretrieval tasks. Persian Gulf milita~, might IX, ac "1 " growing I \t.e, I "St Figure 5 is a snap shot of the browser.This snapshot utihzes a knowledgebase concerning Iran. The knowledgebase was constructed by extracting information from documentswhich contain the country name, Iran, from NewYork Times documents published between July 1994 and December1996. There were 1,458 documentsin this data set. United States A windowin the browser, labeled as the Hyperbolic Browser,located on the upperleft side of Figure 5, is the ontology navigation windowdescribed in the KNOW-IT Overviewsection. This exampleshows the instances from the Religion portion of the ontology which occurred in the Iran knowledgebase. To explore these instances further, users can select one concept/nodefrom the window.Anyconcept/node in the left windowcan be selected in turn to further explore or drill downand retrieve relevant information about a specific concept. attack I Israel’s interest [/nefacto Figure 4. KnowledgeProduct Browsing of Semantic Searchbrowsing,whichis directed, specific, or goaloriented; Generalpurposebrowsing, which is semi-directed predictive, or purposive; and Serendipity browsing, which is undirected and not goal oriented The windowwith the label, ConceptVisualzer, on the right side of the screen, is the fiat semanticnetwork representing all of the fact-based relationships basedon the selected concept/node from the ontology window. Currently, the windowshowsthe state whena user has selected ’Muslim’from the left windowand selected ’Party of God’ successively from the initial right window view about ’Muslim.’ The Concept Visualizer window displays related informationabout a guerilla group called, ’Party of God’,includingthe facts that Iran supports the group, and the group is a radical anti-Israel movement. Networks Traditional task-oriented, informationretrieval systems have query-based, command-driven user interfaces. Users are required to express and submit their information needs in explicitly formedBooleanor natural languagequeries to the information retrieval systems. The systems then bring back a list of documentswhichare possibly relevant to users’ queries. Theusers usually go throughthe list to find whatis relevant by reading documentsin the list one 81 [] DEATH TOLLREACHES 00,; SOME SEEIRANIAN INFunt~to~i~ Ne-’~YOr~ TiiT~e$ 07e~5/~4 2g Theconcepts displayed currer~tty were found in the~fo/Iowing sentences: On Monday~newspaperher~.Clarin, pubHshed~e~cerpts t-it sa BUENOS AIRES, Argentina - As the death toll reached 80 on Venezue~nintelligence Monday,Argentines asserted that lran wielded a secret hand J Caracasand undergroun~ce|~sin South America of thep~-~ of God, the truck bombthat demolished the main community fundamen~list guerrilla group supported by iron__ Argentine Jewsone weekago. "Suspicions fall very heavily on tz~n, ° RubenEzra Beraja, a ~ Close Jewish group leader, said on Mondayafter meeting with the Argentine president, Carlos Saul Menern.Beraja is president of Delegation of Argentine Jewish Associations, a groups that was housedin the building destroyed 8~ " In addition to the confirmed dead, 24 people are listed as missing and 52 remain in hospitals. OnHonday, the Argentine judge investigating the bombing. Juan Close J Java,~plet Window Figure 5 Visual KnowledgeBase Browser Conclusion The bottom left windowshowsthe full text view of the documentwhichis the source of the extracted information and the bottom right windowshowsa specific sentence from which the visualized information was extracted. Wehave discussed automatic information extraction system, whichis a part of the larger knowledgediscovery system, and the browser, whichvisualizes the semantic networkrepresentation of the extracted information. Recently, we have constructed a knowledgebase from one year’s worth ofa newswire which consists of 410,000 documents. Fromthis documentset, 4.2 million CRCswere extracted and formed a knowledgebase. There were about 1.2 million unique concepts in the extracted CRCs. Webelieve that the browsingwhich begins with the selection of a concept in the ontology windowsupports the previously mentioned General browsing and Serendipity browsingtypes. In addition, the browser has the functionality whichallows users to type in a concept they wish to display, and this feature supports Search browsing. 82 Paik, W., Liddy, E.D., Yu, E. & McKenna,M. "Categorizing and Standardizing Proper Nounsfor Efficient Information Retrieval," CorpusProcessing for Lexical Acquisition, Boguraev, B. &Pustejovsky, J. (eds.), Cambridge,MA:MITPress, pp. 62-73, 1996. Webelieve our domain-independent information extraction systemcan be used to extend the potential of link analysis to informationwhichresides in a vast numberof textual documents.In addition, our unique combination of ontology and semantic network based visual browsingstrategy mightserve as an interesting methodto conduct link analysis. Paik, W. "Chronological Information Extraction System (CIES)," Dagstuhl-Seminar-Report: 79 on Summarizing Text for Intelligent Communication,EndresNiggermeyer,B., Hobbs, J. & Jones, K.S. (eds.), Wadern, Germany:IBFI, 1994. References Chincor, N. "MUC-4Evaluation Metrics," Proceedings of the Fourth Message Understanding Conference (MUC4), McLean,VAJune 16-18, 1992. Sowa,J., Conceptual Structures: Information Processing in Mind and Machine, Reading, MA:Addison-Wesley, 1984 Cove, J.F. & Walsh, B.C. "Browsing as a Meansof Online Text Retrieval," Information Services and Use, 1987. 83