Nov-30-information-e..

advertisement
Discovering and Utilizing Structure
in Large Unstructured Text Datasets
Eugene Agichtein
Math and Computer Science Department
1
Information Extraction Example

Information extraction systems represent text in
structured form
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information Extraction System
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
2
How can information extraction help?
Large Text Collection





Structured Relation
… allow precise and efficient querying
… allow returning answers instead of documents
… support powerful query constructs
… allow data integration with (structured) RDBMS
… provide input to data mining & statistics analysis
3
Goal: Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IE
Sys 4
IE
Sys 3
Hospital Records
911 Calls
Traffic accidents, …
IE
Sys 2
IE
Sys 1
Historical news,
Current Patient Records:
Diagnosis, physician’s notes, breaking news stories,
wire, alerts, …
lab results/analysis, …
4
Challenges in Information Extraction

Portability



Scalability, Efficiency, Access



Reduce effort to tune for new domains and tasks
MUC systems: experts would take 8-12 weeks to tune
Enable information extraction over large collections
1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” )


Snowball: Partially Supervised Information Extraction
Querying Large Text Databases for Efficient Information Extraction
5
Outline

Information extraction overview

Partially supervised information extraction



Text retrieval for scalable extraction



Adaptivity
Confidence estimation
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work




Inferring and analyzing social networks
Utility-based extraction tuning
Multi-modal information extraction and data mining
Authority/trust/confidence estimation
6
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
NAME
TITLE
ORGANIZATION
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
7
What is “Information Extraction”
As a task:
Filling slots in a database from sub-segments of text.
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
IE
NAME
Bill Gates
Bill Veghte
Richard Stallman
TITLE
ORGANIZATION
CEO
Microsoft
VP
Microsoft
founder Free Soft..
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
8
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + clustering + association
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
9
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
10
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
Microsoft Corporation
CEO
Bill Gates
Microsoft
Gates
Microsoft
Bill Veghte
Microsoft
VP
Richard Stallman
founder
Free Software Foundation
11
What is “Information Extraction”
As a family
of techniques:
Information Extraction =
segmentation + classification + association + clustering
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill
Gates railed against the economic philosophy
of open-source software with Orwellian fervor,
denouncing its communal licensing as a
"cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the opensource concept, by which software code is
made public to encourage improvement and
development by outside programmers. Gates
himself says Microsoft will gladly disclose its
crown jewels--the coveted code behind the
Windows operating system--to select
customers.
"We can be open source. We love the concept
of shared source," said Bill Veghte, a
Microsoft VP. "That's a super-important shift
for us in terms of code access.“
Richard Stallman, founder of the Free
Software Foundation, countered saying…
* Microsoft Corporation
CEO
Bill Gates
* Microsoft
Gates
* Microsoft
Bill Veghte
* Microsoft
VP
Richard Stallman
founder
Free Software Foundation
12
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment
Classify
Associate
Cluster
Load DB
Document
collection
Train extraction models
Label training data
Database
Query,
Search
Data mine
13
Information Extraction Tasks

Extracting entities and relations

Entities



Relations



Entities related in a predefined way (e.g., Location of a Disease
outbreak)
Discovered automatically
Common information extraction steps:




Named (e.g., Person)
Generic (e.g., disease name)
Preprocessing: sentence chunking, parsing, morphological analysis
Rules/extraction patterns: manual, machine learning, and hybrid
Applying extraction patterns to extract new information
Postprocessing and complex extraction: not covered


Co-reference resolution
Combining Relations into Events, Rules, …
14
Two kinds of IE approaches
Knowledge Engineering






rule based
developed by experienced
language engineers
make use of human intuition
requires only small amount of
training data
development could be very
time consuming
some changes may be hard to
accommodate
Machine Learning





use statistics or other machine
learning
developers do not need LE
expertise
requires large amounts of
annotated training data
some changes may require reannotation of the entire training
corpus
annotators are cheap (but you
get what you pay for!)
15
Extracting Entities from Text
Classify Pre-segmented
Candidates
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama
Alaska
…
Wisconsin
Wyoming
Boundary Models
Abraham Lincoln was born in Kentucky.
Abraham Lincoln was born in Kentucky.
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
Classifier
which class?
which class?
Try alternate
window sizes:
Finite State Machines
Abraham Lincoln was born in Kentucky.
Context Free Grammars
Abraham Lincoln was born in Kentucky.
BEGIN
Most likely state sequence?
NNP
NNP
V
V
P
Classifier
PP
which class?
VP
NP
BEGIN
END
BEGIN
NP
END
VP
S
…and beyond
Any of these models can be used to capture words, formatting or both.
16
Hidden Markov Models
Graphical model
Finite state model
S t-1
St
S t+1
...
...
observations
...
Generates:
State
sequence
Observation
sequence
transitions
O
Ot
t -1
O t +1

|o|
o1
o2
o3
o4
o5
o6
o7
o8
 
P( s , o )   P( st | st 1 ) P(ot | st )
t 1
Parameters: for all states S={s1,s2,…}
Start state probabilities:
Transition probabilities:
P(st )
P(st|st-1 )
Observation (emission) probabilities: P(ot|st )
Training:
Maximize probability of training observations (w/ prior)
Usually a multinomial over
atomic, fixed alphabet
17
IE with Hidden Markov Models
Given a sequence of observations:
Yesterday Lawrence Saul spoke this example sentence.
and a trained HMM:
 
Find the most likely state sequence: (Viterbi) arg max s P( s , o )
Yesterday Lawrence Saul spoke this example sentence.
Any words said to be generated by the designated “person name”
state extract as a person name:
Person name: Lawrence Saul
18
HMM Example: “Nymble”
[Bikel, et al 1998],
[BBN “IdentiFinder”]
Task: Named Entity Extraction
Person
start-ofsentence
end-ofsentence
Org
Other
Train on 450k words of news wire text.
Case
Mixed
Upper
Mixed
Observation
probabilities
P(st | st-1, ot-1 )
P(ot | st , st-1 )
or
(Five other name classes)
Results:
Transition
probabilities
Language
English
English
Spanish
P(ot | st , ot-1 )
Back-off to:
Back-off to:
P(st | st-1 )
P(ot | st )
P(st )
P(ot )
F1 .
93%
91%
90%
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum19‘99]
Relation Extraction

Extract structured relations from text
May 19 1995, Atlanta -- The Centers for Disease Control
and Prevention, which is in the front line of the world's
response to the deadly Ebola epidemic in Zaire ,
is finding itself hard pressed to cope with the crisis…
Disease Outbreaks in The New York Times
Information
Extraction System
Date
Disease Name
Location
Jan. 1995
Malaria
Ethiopia
July 1995
Mad Cow Disease U.K.
Feb. 1995
Pneumonia
U.S.
May 1995
Ebola
Zaire
20
Relation Extraction

Typically require Entity Tagging as preprocessing

Knowledge Engineering
 Rules defined over lexical items


Rules defined over parsed text



“<company> located in <location>”
“((Obj <company>) (Verb located) (*) (Subj <location>))”
Proteus, GATE, …
Machine Learning-based
 Learn rules/patterns from examples
Dan Roth 2005, Cardie 2006, Mooney 2005, …
 Partially-supervised: bootstrap from “seed” examples
Agichtein & Gravano 2000, Etzioni et al., 2004, …

Recently, hybrid models [Feldman2004, 2006]
21
Comparison of Approaches

Use “language-engineering” environments
to help experts create extraction patterns
 GATE [2002], Proteus [1998]
significant
effort

Train system over manually labeled data
 Soderland et al. [1997], Muslea et al. [2000], Riloff et al. [1996]

Exploit large amounts of unlabeled data
 DIPRE [Brin 1998], Snowball [Agichtein & Gravano 2000]
 Etzioni et al. (’04): KnowItAll: extracting unary relations
 Yangarber et al. (’00, ’02): Pattern refinement, generalized names
detection
substantial
effort
minimal
effort
22
The Snowball System: Overview
Organization
Location
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
...
...
..
157th Street
Manhattan
0.52
15th Party
Congress
China
0.3
15th Century
Europe
Dark Ages
0.1
1
Snowball
2
Text Database
3
23
Snowball: Getting User Input
Get
Examples
Organization
Headquarters
Microsoft
Redmond
IBM
Armonk
Intel
Santa Clara
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
ACM DL 2000
Tag
Entities
Generate Extraction
Patterns
User input:
• a handful of example instances
• integrity constraints on the relation
e.g., Organization is a “key”, Age > 0, etc…
24
Snowball: Finding Example Occurrences
Can use any
Get
Examples
Find Example
Occurrences in Text
full-text search
engine
Organization Headquarters
Microsoft
Redmond
IBM
Armonk
Intel
Santa Clara
Search
Engine
Text Database
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp
The Armonk-based IBM introduced a new line…
Change of guard at IBM Corporation’s headquarters near Armonk, NY ...
25
Snowball: Tagging Entities
Named entity taggers
can recognize Dates,
People, Locations,
Organizations, …
MITRE’s Alembic, IBM’s
Talent, LingPipe, …
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Computer servers at Microsoft ’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp
The Armonk -based IBM introduced a new line…
Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...
26
Snowball: Extraction Patterns
Computer servers at Microsoft’s headquarters in Redmond…

General extraction pattern model:
acceptor0,

Entity,
acceptor1, Entity, acceptor2
Acceptor instantiations:



String Match (accepts string “’s headquarters in”)
Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5), (in,
0.5)] )
Sequence Classifier (Prob(T=valid | ‘s, headquarters, in) )

HMMs, Sparse sequences, Conditional Random Fields, …
27
Snowball: Generating Patterns
1 Represent occurrences
Get
Examples
2 Cluster similar
Evaluate
Tuples
as vectors of tags and
terms
occurrences.
Extract
Tuples
Find Example
Occurrences in Text
Tag
Entities
Generate Extraction
Patterns
ORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}
ORGANIZATION
{<‘s 0.57>, <headquarters 0.57>, < near
0.57>}
LOCATION
LOCATION
LOCATION
{<- 0.71>, < based 0.71>}
ORGANIZATION
LOCATION
{<- 0.71>, < based 0.71>}
ORGANIZATION
28
Snowball: Generating Patterns
Represent occurrences
1 as vectors of tags and
terms
2 Cluster similar
occurrences.
3
Create patterns
as filtered cluster
centroids
ORGANIZATION
LOCATION
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
{ <'s 0.71>, <headquarters 0.71>}
{<- 0.71>, < based 0.71>}
LOCATION
ORGANIZATION
29
Vector Space Clustering
30
Snowball: Extracting New Tuples
Get
Examples
Match tagged text
fragments against
patterns
Find Example
Occurrences in Text
Evaluate
Tuples
Tag
Entities
Extract
Tuples
Google
V
Generate Extraction
Patterns
's new headquarters in
ORGANIZATION
Mountain View are …
{<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>}
LOCATION {<are 1>}
Match=0.4
P2
ORGANIZATION
{<located 0.71>, < in 0.71>}
LOCATION
Match=0.8
P1
ORGANIZATION
{<'s 0.71>, <headquarters 0.71> }
LOCATION
Match=0
P3
LOCATION
{<- 0.71>, <based 0.71>
ORGANIZATION
31
Snowball: Evaluating Patterns
Get
Examples
Evaluate
Tuples
Automatically estimate
pattern confidence:
Conf(P4)= Positive / Total
Extract
Tuples
= 2/3 = 0.66
P4
ORGANIZATION
{<
Find Example
Occurrences in Text
, 1> }
Tag
Entities
Generate Extraction
Patterns
Current seed tuples
LOCATION
IBM, Armonk, reported…
Intel, Santa Clara, introduced...
“Bet on Microsoft”, New York
-based analyst Jane Smith said...
Positive
Positive


Organization
Headquarters
IBM
Armonk
Intel
Santa Clara
Microsoft
Redmond
Negative
32
Snowball: Evaluating Tuples
Get
Examples
Automatically evaluate
tuple confidence:
Conf(T) =
1 - 1 - Conf(P i) * Match( Pi)
Find Example
Occurrences in Text
Evaluate
Tuples
Tag
Entities
p
A tuple has high
confidence if generated
by high-confidence
patterns.
Extract
Tuples
Generate Extraction
Patterns
Conf(T): 0.83
3COM Santa Clara
, 1> }
0.4
P4: 0.66
ORGANIZATION
0.8
P3: 0.95
LOCATION {<- 0.75>, <based 0.75>} ORGANIZATION
{<
LOCATION
33
Snowball: Evaluating Tuples
Organization Headquarters
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
157th Street
Manhattan
0.52
15th Party
Congress
China
0.3
15th Century
Europe
Dark Ages
0.1
...
...
....
....
..
..
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Keep only high-confidence
tuples for next iteration
34
Snowball: Evaluating Tuples
Organization Headquarters
Conf
Microsoft
Redmond
1
IBM
Armonk
1
Intel
Santa Clara
1
AG Edwards
St Louis
0.9
Air Canada
Montreal
0.8
7th Level
Richardson
0.8
3Com Corp
Santa Clara
0.8
3DO
Redwood City
0.7
3M
Minneapolis
0.7
MacWorld
San Francisco
0.7
Get
Examples
Find Example
Occurrences in Text
Evaluate
Tuples
Extract
Tuples
Tag
Entities
Generate Extraction
Patterns
Iteratenew
untiliteration
no newwith
tuples
are extracted
Start
expanded
example set
35
Pattern-Tuple Duality

A “good” tuple:



A “good” pattern:




Extracted by “good” patterns
Tuple weight  goodness
Generated by “good” tuples
Extracts “good” new tuples
Pattern weight  goodness
Edge weight:

Match/Similarity of tuple context
to pattern
36
How to Set Node Weights

Constraint violation (from before)


Conf(P) = Log(Pos) Pos/(Pos+Neg)
Conf(T) = 1 - 1 - Conf(P i) * Match( Pi)

p

HITS [Hassan et al., EMNLP 2006]


Conf(P) = ∑Conf(T)
Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006]



Unknown tuples = Neg
Compute Conf(P), Conf(T)
Iterate
37
Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm




“Hide” labels for some seed
tuples
Iterate EM algorithm to
convergence on tuple/pattern
confidence values
Set threshold t such that
(t > 90% of spy tuples)
Re-initialize Snowball using
new seed tuples
Organization
Headquarters
Initial
Final
Microsoft
Redmond
1
1
IBM
Armonk
1
0.8
Intel
Santa Clara
1
0.9
AG Edwards
St Louis
0
0.9
Air Canada
Montreal
0
0.8
7th Level
Richardson
0
0.8
3Com Corp
Santa Clara
0
0.8
3DO
Redwood City
0
0.7
3M
Minneapolis
0
0.7
MacWorld
San Francisco
0
0.7
…..
0
157th Street
Manhattan
0
0.52
15th Party
Congress
China
0
0.3
15th Century
Europe
Dark Ages
0
0.1
0
38
Adapting Snowball for New Relations

Large parameter space
 Initial seed tuples (randomly chosen, multiple runs)
 Acceptor features: words, stems, n-grams, phrases, punctuation, POS
 Feature selection techniques: OR, NB, Freq, ``support’’, combinations
 Feature weights: TF*IDF, TF, TF*NB, NB
 Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values:
 Estimate operating parameters based on occurrences of seed tuples
 Run cross-validation on hold-out sets of seed tuples for optimal perf.
 Seed occurrences that do not have close “neighbors” are discarded
39
Example Task: DiseaseOutbreaks
Proteus:
Snowball:
SDM 2006
0.409
0.415
40
Snowball Used in Various Domains

News: NYT, WSJ, AP [DL’00, SDM’06]


Medical literature: PDR, Micromedex… [Thesis]


CompanyHeadquarters, MergersAcquisitions,
DiseaseOutbreaks
AdverseEffects, DrugInteractions,
RecommendedTreatments
Biological literature: GeneWays corpus [ISMB’03]

Gene and Protein Synonyms
41
Outline

Information extraction overview

Partially supervised information extraction



Text retrieval for scalable extraction



Adaptivity
Confidence estimation
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work




Inferring and analyzing social networks
Utility-based extraction tuning
Multi-modal information extraction and data mining
Authority/trust/confidence estimation
42
Extracting A Relation From a Large Text Database
Information
Extraction
System
Text Database





Structured
Relation
Brute force approach: feed all
Expensive for
docs to information extraction system large collections
Only a tiny fraction of documents are often useful
Many databases are not crawlable
Often a search interface is available, with existing
keyword index
How to identify “useful” documents?
]
43
An Abstract View of Text-Centric Tasks
[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
Output tuples
Text Database
…
Extraction
System
1. Retrieve documents
from database
2. Process documents
3. Extract output tuples
Task
tuple
Information Extraction
Relation Tuple
Database Selection
Word (+Frequency)
Focused Crawling
Web Page about a Topic
44
Executing a Text-Centric Task
Output tuples
Text Database
Extraction
…
System
1. Retrieve
documents from
database
Similar to relational world
2. Process
documents
3. Extract output
tuples
Two major execution paradigms
 Scan-based: Retrieve and process documents
sequentially
 Index-based: Query database (e.g., [case fatality rate]),
retrieve and process documents in results
→underlying data distribution dictates what is best
Indexes are only “approximate”: index is on
keywords, not on tuples of interest
 Choice of execution plan affects output
completeness (not only speed)

Unlike the relational world
45
Scan
Output tuples
Text Database
Extraction
…
System
1. Retrieve docs
from database

2. Process
documents
3. Extract output
tuples
Scan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Question: How many documents
does Scan retrieve to reach target
recall?
Time for retrieving a
document
Time for processing
a document
Filtered Scan uses a classifier to identify and process only promising documents (details in paper)
46
Iterative Query Expansion
Output tuples
Text Database
…
Extraction
Query
System
1. Query
database with
seed tuples
Generation
2. Process retrieved
documents
3. Extract tuples
from docs
(e.g., <Malaria, Ethiopia>)
4. Augment seed
tuples with new
tuples
(e.g., [Ebola AND Zaire])
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Question: How many queries
and how many documents does
Iterative Set Expansion need to
reach target recall?
Time for retrieving a Time for processing
document
a document
Time for answering
a query47
QXtract: Querying Text Databases for Robust
Scalable Information EXtraction
User-Provided Seed Tuples
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Query Generation
Search Engine
Queries
Promising Documents
Text Database
Information Extraction System
DiseaseName
Location
Date
Malaria
Ethiopia
Jan. 1995
Ebola
Zaire
May 1995
Problem:
keyword
queries
Cow Disease
The
U.K.
July 1995
Extracted
RelationLearnMad
Pneumonia
The U.S.
Feb. 1995
to retrieve “promising”
documents
48
Learning Queries to Retrieve Promising Documents
1. Get document sample
with “likely negative” and
“likely positive” examples.
User-Provided Seed Tuples
Seed Sampling
?
3. Train classifiers to
“recognize” useful
documents.
4. Generate queries from
classifier model/rules.
Text Database
?
?
?
? ?
?
?
2. Label sample documents
using information
extraction system as
“oracle.”
Search Engine
Information Extraction System
tuple1
tuple2
tuple3
tuple4
tuple5
+
+
-
-
+
+
- -
Classifier Training
+
+
-
-
+
+
-
-
Query Generation
Queries
49
Training Classifiers to Recognize “Useful” Documents
D1
disease
reported
epidemic
expected
area
D2
virus
reported
expected
infected
patients
D3
products
made
used
exported
far
D4
past
old
homerun
sponsored
event
Ripper
SVM
products
disease
disease AND reported
exported
reported
=> USEFUL
used
epidemic
far
infected
virus
+
+
Document
features:
words
-
Okapi (IR)
virus
3
infected
2
sponsored
-1
50
Generating Queries from Classifiers
Ripper
disease AND reported
=> USEFUL
disease AND reported
SVM
disease products
reported exported
epidemic
used
infected
far
virus
epidemic
virus
Okapi (IR)
virus
3
infected
2
sponsored
-1
virus
infected
QCombined
disease and reported
epidemic
virus
51
SIGMOD 2003
Demonstration
52
An Even Simpler Querying Strategy: “Tuples”
DiseaseName
Location
Date
Ebola
Zaire
May
1995
Malaria
Ethiopia
Jan.
1995
hemorrhagic fever
Africa
May
1995
1.
2.
3.
“Ebola” and “Zaire”
Search
Engine
Information
Extraction
System
Convert given tuples into queries
Retrieve matching documents
Extract new tuples from documents and
iterate
53
Comparison of Document Access Methods
80
70
recall (%)
60
50
40
30
20
10
0
5%
10%
25%
MaxFractionRetrieved
QXtract
Manual
Tuples
Baseline
QXtract: 60% of relation extracted from 10% of
documents of 135,000 newspaper article database
Tuples strategy: Recall at most 46%
54
WebDB 2003
Predicting Recall of Tuples Strategy
Seed
Seed
Tuple
Tuple
SUCCESS!
FAILURE 
Can we predict if Tuples will succeed?
55
Using Querying Graph for Analysis
We need to compute the:

Number of documents retrieved after
sending Q tuples as queries (estimates time)

Number of tuples that appear in the
retrieved documents (estimates recall)
tuples
t1
Documents
d1
<SARS, China>
t2
d2
<Ebola, Zaire>
To estimate these we need to compute the:

Degree distribution of the tuples
discovered by retrieving documents

Degree distribution of the documents
retrieved by the tuples

(Not the same as the degree distribution of a
randomly chosen tuple or document – it is easier to
discover documents and tuples with high degrees)
t3
d3
<Malaria, Ethiopia>
t4
d4
t5
d5
<Cholera, Sudan>
<H5N1, Vietnam>
56
Information Reachability Graph
Tuples
t1
Documents
d1
t2
d2
t3
d3
t4
t1
t2
t3
t5
t4
d
t1 retrieves document d1
4
t2, t3, and t4 “reachable”
from
t
1 t
that contains
2
t5
d5
57
Connected Components
t1
In
t2
t3
Core
(strongly
connected)
Out
t4
Tuples that retrieve
other tuples but
are not reachable
Tuples that retrieve
other tuples and
themselves
Reachable Tuples,
do not retrieve
tuples in Core
58
Sizes of Connected Components
How many tuples are in largest Core + Out?
In
Core
Out
In


t0
Core
(strongly
connected)
Out
In
Core
Out
Conjecture:

Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
59
NYT Reachability Graph:
Outdegree Distribution
Matches the power-law distribution
MaxResults=10
MaxResults=50
60
NYT: Component Size Distribution
Not “reachable”
MaxResults=10
CG / |T| = 0.297
“reachable”
MaxResults=50
CG / |T| = 0.620
61
Connected Components Visualization
DiseaseOutbreaks,
62
New York Times 1995
Estimating Reachability
In a power-law random graph G a giant
component
CG emerges* if d (the average outdegree) > 1,
and:


Estimate: Reachability ~ CG / |T|
Depends only on d (average outdegree)
* For b < 3.457 Chung and Lu, Annals of Combinatorics, 2002
63
Estimating Reachability Algorithm
1.
2.
3.
4.
5.
Pick some random tuples
Use tuples to query database
Extract tuples from matching
documents to compute
reachability graph edges
Estimate average outdegree
Estimate reachability using
results of Chung and Lu,
Annals of Combinatorics, 2002
t2
t2
t4
Tuples
t1
Documents
d1
t2
d2
t3
d3
t4
d4
d =1.5
64
Estimating Reachability of NYT
S=10
S=50
S=100
S=200
Real Graph
1
0.9
Reachability
0.8
0.7
0.6
0.5
0.4
.46
0.3
0.2
0.1
0
MR=1
MR=10
MR=50
MR=100
MR=200
MR=1000
MaxResults
Approximate reachability is estimated after ~ 50
queries.
Can be used to predict success (or failure) of a
Tuples querying strategy.
65
Outline

Information extraction overview

Partially supervised information extraction



Text retrieval for scalable extraction



Adaptivity
Confidence estimation
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work




Adaptive information extraction and tuning
Authority/trust/confidence estimation
Inferring and analyzing social networks
Multi-modal information extraction and data mining
66
Goal: Detect, Monitor, Predict Outbreaks
Detection, Monitoring, Prediction
Data Integration, Data Mining, Trend Analysis
IE
Sys 4
IE
Sys 3
IE
Sys 2
IE
Sys 1
Hospital Records
911 Calls
Traffic accidents, …
Current Patient Records:
Diagnosis, physician’s notes,
lab results/analysis, …
Historical news,
breaking news stories,
wire, alerts, …
67
Adaptive, Utility-Driven Extraction

Extract relevant symptoms and modifiers from text


Call transcripts: a difficult extraction problem



Physician notes, patient narrative, call transcripts
Not grammatical, dialogue, speechtext unreliable, …
Use partially supervised techniques to learn extraction
patterns
One approach:



Link together (when possible) call transcript and patient
record (e.g., by time, address, and patient name)
Correlate patterns in transcript with diagnosis/symptoms
Fine-grained learning: can automatically train for each
symptom or group of patients, etc.
68
Authority, Trust, Confidence

How reliable are signals emitted by
information extraction?

Dimensions of trust/confidence:



Source reliability: diagnosis vs. notes vs. 911 calls
Tuple extraction confidence
Source extraction difficulty
69
Source Confidence Estimation

CIKM 2005
Task “easy” when context term distributions diverge from background
President George W Bush’s three-day visit to India
0.07
0.06
frequency
0.05
0.04
0.03
0.02
0.01
0
the

to
and
said
's
company
won
president
Quantify as relative entropy (Kullback-Liebler divergence)
KL( LM C || LM BG )   LM C ( wi )  log
wV

mrs
LM C ( w)
LM BG ( w)
After calibration, metric predicts task is “easy” or “hard”
70
Inferring Social Networks

Explicit networks


Patient records: family, geographical entities in
structured and unstructured portions
Implicit connections


Extract events (e.g., “went to restaurant X
yesterday”)
Extract relationships (e.g., “I work in Kroeger’s in
Toco Hills”
71
Modeling Social Networks for

Epidemiology, security, …
Email exchange mapped onto cubicle locations.
72
Improve Prediction Accuracy

Suppose we managed to



Automatically identify people currently sick or
about to get sick
Automatically infer (part of) their social network
Can we improve prediction for dynamics of
an outbreak?
73
Multimodal Information Extraction and
Data Mining

Develop joint models over structured data


One approach: mutual reinforcement



E.g., lab results and symptoms extracted from text
Co-training: train classifier on redundant views of data
(e.g., structured & unstructured)
Bootstrap on examples proposed by both views
More generally: graphical models
74
Summary

Information extraction overview

Partially supervised information extraction



Text retrieval for scalable extraction



Adaptivity
Confidence estimation
Query-based information extraction
Implicit connections/graphs in text databases
Current and future work




Adaptive information extraction and tuning
Authority/trust/confidence estimation
Inferring and analyzing social networks
Multi-modal information extraction and data mining
75
Thank You

Details: papers, other talk slides:
http://www.mathcs.emory.edu/~eugene/
76
Download