From: AAAI Technical Report FS-98-01. Compilation copyright © 1998, AAAI (www.aaai.org). All rights reserved. Automatic Information Extraction from Documents: A Tool for Intelligence and Law Enforcement Analysts Richard Lee Sterling Software 1650TysonsBlvd. #800 McLeanVA22102 rice@mclean.sterling.corn betweenthe entities and events. In the later systems, the messagesare also stored directly in that same data base. Typically,the entities of interest include: Abstract Wehavedevelopeda numberof systemsto aid analysts in trackingthe activities and associationsof variouscriminal elements. Thesesystemseach consist of a relational data base containing the details on the relevant entities, activities, andassociations,assortedtools for retrievingand analyzingthose details, includinglink analysis, anda tool for automaticallyextractingthose details fromdocuments. ¯ Individuals ¯ Organizations (govermnent, commercial, military, extralegal, etc) ¯ Places (anything from street addresses or coordinates to continents) ¯ Facilities (factories, airports, hotels, warehouses,etc) ¯ Documents(passports, driver’s licenses, bank books, etc) ¯ Money ¯ Vehicles(air, land, or sea) ¯ Drugs ¯ Weapons Introduction Onseveral projects, Sterling Software has been leading the development of software systems for the intelligence analyst. These systems have been designed to assist the analyst in discovering and keeping track of individuals and organizations involved in various activities such as drug trafficking, terrorism, insurgency, weaponsproliferation, etc. Theyalso provide the ability to tie these individuals and organizations to the facilities, vehicles, etc, they use, and to the events and activities they are involvedin. Eachsystem is a collection of tools integrated around a relational data base. The tools provide a variety of waysto search the data base and to view selected data from it; they include both a link display tool and a geographic display tool. Each system has a facility for storing incoming documents(messages) containing raw, free-text data. One key tool uses Natural Language Understanding (NLU) technology to automatically extract all the above-listed types of information from those messages and store that information appropriately in the data base. For each entity, the data base can contain the appropriate detailed information. The exact nature of the details dependsof course on the type of entity. For an individual, for example,the data base typically includes name,aliases, gender, marital status, citizenship, race, date of birth, occupation, hair color, eye color, height, and weight. This set of entities is largely independentof a particular analytical shop’s area of interest. The events, however, tend to be moreclosely tied to that area. For Counter-Drug analysts, the events include: ¯ Processing, purchasing, transporting, etc of drugs ¯ Planning, meeting, or communicatingabout any of the above ¯ Arrest of traffickers or seizure of drugs, money, weapons, etc The Data For Counter-Terrorism analysts, on the other hand, the events include: As noted, a relational data base forms the core of each of these systems. In that data base, all the pertinent information about each of the items of interest is stored. These items consist of entities, events, and associations ¯ ¯ ¯ ¯ ¯ Copyright ©1998,American Association for Artificial Intelligence (x~vw.aaai.org). Allrightsreserved. 63 Killing, kidnapping, hostage-taking, etc of people Bombing,hijacking, etc of buildings and vehicles Buying, stealing, etc of weaponsand money Training in weaponsand tactics Arrest, conviction, punishmentetc of terrorists ¯ Other government actions knownto provoke terrorist retaliation mappedto a single data base record, with each slot mapped to a data base field, but it is often morecomplicated. Each item in an association record is of course stored in the data base using the key for the correspondingitem record. The associations can be groupedinto three categories: ¯ Entity-to-entity ¯ Entity-to-event ¯ Event-to-event Technical Approach The IE’s analysis of text is accomplishedby applying large collections of patterns. The patterns for recognizing entity references use a combination of internal structures and external contextual clues, each involving both syntactic and semantic information about words and phrases. For example, the phrase "Analyst John Q. Jacobson of Marigold Financial Corp" contains both internal and external indicators for both the individual and the organization. Each reference thus recognized is reducedreplaced by a single token - with the pieces of recognized information (name, title, occupation, etc) distributed appropriately into slots in the attached frame. After the entity analysis is done, the relation analysis uses patterns which look for text containing the right combinations of relation-specifying words and phrases, entity tokens, and other bits of text. Simple examples would be "John’s wife Mary", "John, an analyst for Foobar Inc.", "membersof the Garcia Group Larry, Moe, and Curly", "an aircraft belonging to John", etc. Certain clauses involving state verbs are also recognized as expressing relations; for example, "John owns a Cessna", "John was born in Cleveland", "Boston is the headquarters of Foobar Inc.", etc. Each recognized chunk of text is again reduced, to a token representing the "head" entity, with a frame representing the relation information attached. Finally, event analysis uses clause-level patterns which look for appropriate combinations of verbs, prepositions, etc, along with the entity (plus data and time) tokens, recognize references to events of interest. These patterns also indicate howthe entities should fill the roles of the events; the analysis constructs the event frames accordingly,with a slot for each role. Each of these three stages actually involves multiple steps of pattern application and reduction. This is because Entity-to-entity associations indicate relations, the nature of which depend on the types of entities involved: Typical relations between two individuals might include blood relative, spouse, employee, neighbor, etc. Typical relations between an individual and an organization might include owner, head, employee, etc. Typical relations betweenan individual or organization and a vehicle might include owner and user. Typical relations between an individual and a place might include place of birth, residence, current location, etc. Entity-to-event associations indicate the roles the entities play in the event. For example, a bombingevent would typically have roles for bomber, victim, object, weapon, place, etc., each of which could be filled by one of more entities of the appropriate type(s). Each event would also have "roles" for any date and time information. Event-to-event associations include tying each communicating,meeting, planning (etc) event to the events discussed. Whenthe messagesare stored in the same data base, one special type of association ties each detailed item record to the message(s) the information camefrom. The Extraction On each of these projects, we have developed (or are developing) an Information Extraction (IE) tool to run as background process, for near-real-time extraction of information from incoming messages. The IE processes a single message at a time, finding all the kinds of information described above and generating the appropriate data base entries containing that information. The IE operates by first looking for phrases containing all the references to the entities of interest, plus any date and time references. It then analyzes the phrases and clauses containingthose references to fred all the entity-toentity associations (relations). It then analyzes the clauses for events of interest, assigning each entity reference, date and time in the clause to the appropriate role in the event. For each item found, it constructs a frame - a representation which categorizes each piece of pertinent informationby putting it into the appropriate slot. Onceall the item (entity, event, association) references have been thus identified and represented as frames, those that appear to actually refer to the same item are merged, with the result that (ideally) exactly one frame is left for each distinct extracted item. These frames are then translated into data base records. Typically, each frame is ¯ Entity references, particularly places, can be nested inside others ("Japan National Railways .... U.S. dollars", "MiamiInternational Airport") ¯ Entity references can provide strong contextual clues for others ("Suitomo Bank’s Tomiko Hashimoto .... a hospital in Kasumigaseki") ¯ Relation phrases can be nested (" John Doe, of Bostonbased Foobar Corp") 64 DATE: 27 Examples Here is a very brief example message body fragment, to demonstratethe processing steps and show the wealth of informationthat is extractedby an IE tool. PILOT PABLO GARCIA, COLOMBIAN,PPT 2324224, ARRIVED AT MIAMI INTERNATIONAL AIRPORT ON 27 JUN 97. HE WAS ARRESTED BY U.S. CUSTOMS AGENTS WHEN 300 KGS OF COCAINE WAS FOUND IN THE SPARE FUEL TANK OF HIS CESSNA FIREBAT. Theinitial processingstages reducethe entity, date, and time references, as well as key verb constructs, into tokens with attached frames. This actually involves recognizing place references (" MIAMI", "U.S.") inside larger entity references; the requisite relation flames are also constructed as part of reducingthe larger reference. The result (not showing the contents of the flames) can representedthus: <INDIVIDUAL> <ARRIVE> AT <FACILITY> <DATE>. <INDIVIDUAL> <ARREST-pas> <ORGANIZATION>WHEN<DRUGS> <FIND-pas> THE SPARE FUEL TANK OF <INDIVIDUAL> <VEHICLE>. ON BY IN ’s In this example,there is only one additional relation found, reducingthe last three tokens to just <VEHICLE>. The event processing finds a Movement event in the first sentence, and an Arrest event plus an (implied) Seizure event in the second. The merging of flames then decides that the three <INDIVIDUAL> frames all refer to the sameentity, which requires doing the pronominal referenceresolution. At this point, the frameswouldlook somethinglike this: INDIVIDUAL: NAME: PABLO GARCIA OCCUPATION: PILOT CITIZENSHIP: COLOMBIA ORGANIZATION: NAME: CUSTOMS TYPE: LEA COUNTY: US DRUGS: TYPE: COCAINE QUANTITY: 300 UNIT: KG VEHICLE: TYPE: AIR MANUFACTURER: CESSNA MODELNAME: FIREBAT MODELNUMBER: X J3 RELATION: TYPE: HASDOC ENT1 : <INDIVIDUAL> ENT2: <DOCUMENT> RELATION: TYPE: LOCATED ENT1: <FACILITY> ENT2: <PLACE> RELATION: TYPE: OWNS ENTI: <INDIVIDUAL> ENT2: <VEHICLE> EVENT: TYPE: MOVEMENT AGENT: <INDIVIDUAL> DESTINATION: <PLACE> DESTFAC: <FACILITY> ENDDATE: <DATE> EVENT: TYPE: ARREST ARRESTEE: <INDIVIDUAL> ARRESTER: <ORGANIZATION> PLACE: <PLACE> FACILITY: <FACILITY> BEGINDATE: <DATE> ENDDATE: <DATE> DOCUMENT: TYPE: PASSPORT NUMBER:2324224 PLACE: CITY: MIAMI COUNTRY: US EVENT: TYPE: SEIZURE SEIZEE: <DRUGS> SEIZER: <ORGANIZATION> PLACE: <PLACE> FACILITY: <FACILITY> BEG1NDATE: <DATE> ENDDATE: <DATE> FACILITY: TYPE: AIRPORT NAME: MIAMI INTERNATIONAL AIRPORT DATE: YEAR: 1997 MONTH: 06 65 Notice that while realizing the implied seizure is not difficult for IEC, the cross-sentence reasoning required to realize that the Movement event was actually a Transshipment involving the drugs and the aircraft is currently beyondits capabilities. from crop cultivation and precursor chemical production to delivery of the drugs into the United States. CDISwas designed around a Sybase relational data base which stores detailed information on all the entity types listed above, a dozen narcotics-related event types as outlined above, and the full assortmentof relations and roles. Wedeveloped and integrated into CDIS an IE tool called the Automated Templating System (ATS) which extracted all the relevant information from the flee-text portion of four different types of intelligence messages. On a test of entity extraction skills, the ATSscored much better than several counter-drug analysts on both speed and accuracy. Alta Analytics integrated their commerciallink analysis tool, Netmap, into CDIS. That tool displays myriads of entities, events, and associations placed in the data base by analysts and ATS. More Examples Here are a few more examples of relation phrases, to demonstrate the variety and nesting complexity that is handled by the IEC: 1. ANALYSTJOHN DOE OF XYZ CORP 2. JOHN DOE, WHOIS A MEMBEROF THE GARCIACARTAGENA GROUP, 3. JOHN’S WIFE MARY, A SECRETARY AT XYZ CORP, 4. THE GUZMANORGANIZATION, LOCATED IN MEDELLIN, 5. A CESSNA OWNED BY BOSTON-BASED XYZ CORP MUC6 The second system was developed for the DARPAsponsored Sixth Message Understanding Conference (MUC-6). It extracted and templated information Individuals, Organizations, Locations, Money,Dates, and Times, from Wall Street Journal articles. Its accuracy on those tasks compared very well with those of other participants. These would result in frames (and then data base records) for the obviousentities, and for relations: 1. Employee(individual - organization) 2. Employee(individual - organization) 3. Spouse(individual - individual); Employee(individual - organization) 4. Location (organization - place) 5. Location (organization - place); Owner(organization vehicle) ... respectively. (Obviously,the relation types are not cast in stone. If the analysts’ needs and the data base design warrant it, the distinction between "member" and "employee", or the more specific "analyst" and "secretary", can be maintained by the IEC.) Note that the result of reducing phrases such as the aboveis a token representing the head of the phrase, which is still available for use in event patterns. For example,if the larger text of the 3rd fragment was "JOHN’SWIFE MARY, A SECRETARY AT XYZ CORP, WAS ARRESTED FRIDAYFOR ..." then, in addition to the relation frames described above an Arrest Event frame with the Individual frame for "Mary"filling the Arrestee role would be produced. Example MDITDS The third system we developed an IE tool for was the Migration Defense Intelligence Threat Data System (MDITDS).This system is being built around a Memex data base - not, strictly speaking, a relational data base engine, but the design has separate tables for the primary entity types, a table for events, and the crucial Association table. The Information Extraction Component (IEC) was developed to extract the usual details on all the types of entities listed above, the various event types relevant to Counter-Terrorism analysts, and the usual assortment of relations and roles. Weexpect that the IEC will be delivered shortly, with the knowledge bases needed for extracting from newswire articles; work on knowledge bases for intelligence messages may be continued at a future time. The link analysis tool for MDITDS is being developed by OrionScientific. Systems Unnamed The fourth system, as yet unnamed,is in the early stages of design and prototype. It will be built around an Oracle relational data base, the design of whichwill be similar to that of the data bases for CDISand MDITDS and will reflect the lessons learned on those projects. CDIS The first system Sterling Software developed an IE capability for is called the Counter-Drug Intelligence System (CDIS), which was developed to support analysts tracking the entire spectrumof narcotics-related activities 66 The IE tool development will be a continuation of the work begun under MDITDS; that continuation will consist primarily of: References Osterholtz, L.; Lee, R.; and McNeilly, C. 1995. The Automated Templating System for Database Update from Unformatted Message Traffic. In 1995 ONDCP International Technology Symposium. ¯ Reworking of the code for greater modularity and portability ¯ Knowledgebase development for intelligence messages ¯ Knowledgebase and code development for additional event types ¯ Workon improving the handling of coreference Conclusions Lee, R. 1995. An NLToolset-Based System for MUC-6. In Proceedings of the SL, cth Message Understanding Conference, 249-261. Columbia, Maryland: DARPA; Morgan KaufmannPublishers. for Link Analysis Integrating an information extraction tool and a link analysis tool with a relational data base is a natural and proven way to provide the latter with plenty of usefully structured data to work with. The association information automatically placed in the data base by the former tool is the natural sourceof link data to drive that analysis. The bottom-up nature of our approach to information extraction means that entity extraction happens first, without regard to event extraction. Since the set of entity types varies little across problemdomains,it is relatively easy to port that portion of an IE from one application to another, as long as the style of documentsbeing processed is similar. Onelimitation of the IE tool is that it makesno effort to decide whether an item it has found in a message is the "same"as another item already in the data base; it errs on the side of assuming it is not. Thus, each message that refers to a "Pablo Garcia", or a "Cessna Firebat", for example, will result in a new record for that item being generated. It is up to the analyst to use various tools including link analysis - to decide that tworecords refer to the same item. This leads to a useful synergy between the two tools and the analyst. Wehave demonstrated that the state of the art in Artificial Intelligence - specifically, Natural Language Understanding - is advanced enough that we can implement a practical Information Extraction tool which populates relational data bases with detailed information from free-text messages. This information includes not just entities but events and associations. The information is structured in a way which enables useful analysis by meansof link analysis and other tools. Acknowledgments With the exception of the few weeks work on MUC-6,all work described herein was carried out on various USDoD contracts. This is published with the sponsoring agencies’ permissions. All versions of the IE tool were developed using the NLToolset, which is a product of Lockheed-Martin. 67