T-REX: A Domain-Independent System for Automated Cultural Information Extraction

advertisement
T-REX: A Domain-Independent System for
Automated Cultural Information Extraction
Massimiliano Albanese
V.S. Subrahmanian
University of Maryland Institute for Advanced Computer Studies
College Park, Maryland, USA
Cognitive Architecture for Reasoning about Adversaries
Introduction
 Several applications require the ability to extract fine-grained
information from huge text collections
» Intelligence agencies may need detailed information about
diverse cultural groups around the world in order to understand
and model their behavior
» A real-time “violence-watch” around the world would require
the ability to identify several attributes for every “violent event”
reported in the online press
 Traditional search engines
» Are not able to provide such information without sorting through
a long list of documents
» Are not able to integrate information from different sources
Cognitive Architecture for Reasoning about Adversaries
2
Key contributions
 Domain-independent framework for information extraction
» A schema describing the information the user wants to extract is
provided as an input
 Key features
» Scalability: the system is designed to massively scale to large volumes
of data
• It currently searches through 109 online news sites from 66 countries
around the world, processing about 45,000 articles/day (about 10 millions
distinct urls explored so far, with 7 millions triples extracted)
» Multilingual support: the system is designed to work with different
languages
• English, Spanish and Chinese
» Flexibility: several elements can be easily customized
• List of sources, topics of interest, type of information to extract
Cognitive Architecture for Reasoning about Adversaries
3
T-REX architecture
Crawling and parsing
Cognitive Architecture for Reasoning about Adversaries
4
Multilingual Annotation Interface
Sentence being annotated
Parse tree edit panel
List of triples that can be
extracted from the sentence
Constraint selection
panel
Cognitive Architecture for Reasoning about Adversaries
5
Annotation Process: Motivation
 The same fact can be reported in many slightly different ways
» At least 73 civilians were killed February 1 in simultaneous suicide
bombings at a Hilla market
» More than 73 civilians were massacred in February in suicide attacks at
a Hilla marketplace
» 74 people were killed on February 1, 2007 in multiple bombings at a
Hilla market
 Other similar events may be reported through similar sentences, describing
the same set of attributes
» About 23 U.S. soldiers were killed in August 2005 in a suicide attack in
Baghdad
 Sentences describing the same type of fact in slightly different ways can be
grouped into a single class
» Learning an “extraction rule” for each class of interest to a given
application enables to extract the desired information from any article
Cognitive Architecture for Reasoning about Adversaries
6
Annotation Process: Step 1
At least 73 civilians were killed
February 1 in simultaneous suicide
bombings at a Hilla market
The annotator is presented with
one or more parse trees for the
sample sentence
Cognitive Architecture for Reasoning about Adversaries
7
Annotation Process: Step 2
The annotator marks as
“variable” all the nodes that
may have different text in other
sentences of the same class
Cognitive Architecture for Reasoning about Adversaries
8
Annotation Process: Step 3
If needed, the annotator add
constraints to variable nodes
Cognitive Architecture for Reasoning about Adversaries
9
Annotation Process: Constraints
 IS_ENTITY
» restricts a noun phrase to be a “named entity”
 IS_DATE
» restricts a noun phrase to be a temporal expression
 X_VERBS
» restricts a verb to be any member of a class X of verbs
• e.g. the constraint MURDER_VERBS requires a verb to be any of
the following: kill, assassinate, murder, execute, etc.
 X_NOUNS
» restricts a noun to be any member of a class X of nouns
• e.g. the constraint ATTACK_NOUNS requires a noun to be any of
the following: assault, attack, clash, etc.
Cognitive Architecture for Reasoning about Adversaries
10
Annotation Process: Step 4
The annotator describes the
semantics of the annotated sentence
in term of triples, mapping attributes
to variable nodes
Cognitive Architecture for Reasoning about Adversaries
11
Annotations in Multiple Languages
English
Spanish (Español)
Cognitive Architecture for Reasoning about Adversaries
Chinese simplified (中文)
12
Rule Extraction Engine
 An extraction rule is of type Head  Body
 A rule is learned through the following steps
» abstraction
• each variable node is assigned a numeric
identifier, its text and child nodes are
removed
› the model becomes independent of the
particular sentence
» body definition
• the body of the rule is built by serializing
the parse tree of the annotated sentence
in Treebank II Style
» head definition
• the head is defined as a conjunction of
RDF statements, one for each triple
defined in the last step of the annotation
process
Cognitive Architecture for Reasoning about Adversaries
13
Rule Matching Engine (1/2)
 Extracts RDF triples, by matching sentence from texts being
analyzed against the set of extraction rules
Continuously fetches documents
relevant to the application of interest
If the parse tree of a sentence satisfies
the condition in the body of a rule an
RDF triple is instantiated for each
statement in the head of the rule
CompareNodes() determines if the
parse tree of a sentence satisfies the
condition in the body of a rule
Cognitive Architecture for Reasoning about Adversaries
14
Rule Matching Engine (2/2)
 CompareNodes() recursively explores the parse tree of the
sentence being processed and the annotated parse tree of a rule
Checks satisfaction of constraints
for variable nodes
Checks constant nodes
Pairwise compares child nodes
of non terminal nodes
Cognitive Architecture for Reasoning about Adversaries
15
Example of Matching
e.g. “About 23 U.S. soldiers were killed August 23 in a suicide attack in Baghdad”
The sentence satisfies the body of the rule
Var#1
Var#2
Var#3
Var#4
Var#5
Var#6
Var#7
=
=
=
=
=
=
=
“About 23”
“U.S. soldiers”
“were”
“killed”
“August 23”
“a suicide attack”
“Baghdad”
(KillingEvent9,victim,U.S. soldiers)
(KillingEvent9,numberOfVictims,about 23)
(KillingEvent9,date,August 23)
(KillingEvent9,location,Baghdad)
Cognitive Architecture for Reasoning about Adversaries
16
Example of extracted data (1/2)
At least 22 Hindus were killed by suspected Muslim militants in India's Jammu and Kashmir state
Monday, the police said
Event data
Cognitive Architecture for Reasoning about Adversaries
17
Example of extracted data (2/2)
Link depth 2 from
Pushtuns
Cognitive Architecture for Reasoning about Adversaries
18
T-REX implementation
 The implementation of T-REX consists of several components
running on different nodes of a distributed system
» Multilingual Annotation Interface: web-based tool, that is part
of the web interface of T-REX (implemented as a Java Applet)
» Annotated RDF Database System for storage of annotated
RDF triples: the underlying relational DBMS is PostgreSQL 8.2
» Rule Matching Engine: a pipeline of several components
• Crawler: explores news sources for relevant documents
• Parsers for every language: process sentences from relevant
documents, producing constituent trees in Treebank II Style
• Extractor: implements the Rule Matching Engine logic
 Distribution, Database Partitioning, and Multithreading ensure
scalability
Cognitive Architecture for Reasoning about Adversaries
19
Conclusions
 We have presented a general, multi-lingual and flexible
framework for information extraction
» Domain specific application are enabled by targeting the
»
extraction to the instantiation of a schema of interest
Addition of other languages is a relatively simple task, once a set
of linguistic resources are available for those languages
 We have implemented a complex prototype that has proved to
» effectively extract information for different applications
» scale massively
 Future efforts will be devote to
» define pruning strategies to make the extraction process faster
» define strategies to manage inconsistencies in the extracted data
» extend the system to other languages (mainly Asian languages)
Cognitive Architecture for Reasoning about Adversaries
20
Download